07-26-2016 08:40 AM - edited 07-26-2016 08:42 AM
I would like to change how often some of the reports summarize their data. I am not finding any documentation on how to do this of the effects of making changes. The first report is DHCPv4 Range Utilization Trend. Right now it runs every 4 hours and gives the average usage over that time. For many of our DHCP ranges that loses the granularity needed to alert and make predictions.
I have a couple options I see.
- Change the cron schedule of the built in report so it runs every 1 or 2 hours. There seem to be some dashboards that assume 4 hours though so there would likely be some ripple affects.
- Change the built in search \ alert so that it also records the max usage every 4 hours. This is where I’m leaning now, but again not sure what that might break.
- Create my own range utilization trend that runs more often, duplicating the built in one. Not completely sure I can do this in 7.3.4, and I’ve been hesitant to try and create my own data types and files as the documentation seems to be very lacking.
On many of the reports like leases per second or queries per second, I would like to see more granularity as well. Our current tool summarizes these every two minutes. For real time trouble shooting of issues, this near real time information is needed. How do I create these kinds of summary reports or dashboards for real time information? How to I make sure that this data is only kept for a short time, (days or weeks) while the less granular summaries are kept for months or years.
07-26-2016 09:35 AM
You are hitting the granuality of the source data. E.g if you put this into search "index=ib_dhcp_summary report=si_dhcp_range_utilization_trend" you will see that those values are only updated every 4 hours.
if you need to do real time troubleshooting, there will always be a lag with the reprting server, so it is not the best approach. But you sould use the raw syslog feed ( in 7.3.200 ) to see and filter individual events
07-26-2016 09:59 AM - edited 07-26-2016 10:02 AM
I can see where the report si_dhcp_range_utilization_trend runs every 4 hours and creates the summary files. I'd like that to run more often or add some data to the files it creates, like max in the 4 hours instead of average. Max DHCP scope utilization over a time window is generally far more useful than average as there is a hard upper limit that is needed to be managed.
Infoblox's old reporting solution, ibgraph and bloxtools, has only a 2 minute lag for queries per second and leases per scond. It was free and still is working within our environment. I am trying to get that functionality moved into the new reporting tool.
I've been able to move much of our alearting and forcasting over from bloxtools to the reporting server after the upgrade to 7.3. However, the lag and granularity are still an issue in some cases. I can see the actual data coming over more often, DHCP range utilization is reported by the members to the reporting member every hour, I am simply looking for the best way to get to that data. I have not had time to dig into the other items but I assume they are also getting reported by the members more often than the reports are currently giving me access to.
We are waiting for the raw syslog feed to be availble, however, these particular items are not syslog items.
07-26-2016 02:34 PM - edited 07-26-2016 02:36 PM
After playing with this some more, I've figured out how to get back to the source files and get finer granularity in reports and dashboards.
The reports and dashboards that run against the raw data seem to come up quickly and have little affect on the CPU of the reporting member.
I have a search like this now on a dashboard that lets me choose a member or group of members to look at over the last hour to day.
index=ib_dns source="/infoblox/var/reporting/query-top-rr-type.txt" $member_str$ |bucket _time span=2m |timechart span=2m eval(sum(COUNT)/120) by host
This seems to create the graph as fast or maybe even faster than using the summary reports.
Is there something I'm missing here or was I simply over complicating this by trying to put the summary and indexes in the middle of the process?
I assume the raw source data goes away faster than the summeries but if all I care about is "now" for alearting and troubleshooting is that an issue?
I have a feeling I just haven't quite grasp the splunk back end of the reporter yet.
07-27-2016 01:55 PM
OK, so after digging much further into this, the search window will let you build the above query and run it, however, that doesn't work in reports and dashboards. My dashboard was not giving the data I thought it was on closer inspection. The query only seems to work if you click through the suggestions to build a search. It doesn’t even work if you cut in paste it back in later. I can still go back in and run it if I type index=ib_dns in the search window and then choose the txt file as a source, but that is the only way it functions. So I've gone back and built my own summary that runs on a 2 minute schedule and now gives me a report that I can use to build dashboard and reports.
It certainly was not a simple task or I was not finding all the settings in the GUI, but once I found the advanced edit under the “Searches, reports, and alerts” I was able to get all the settings needed so that the data was (nearly) correctly summarized every 2 minutes.
Once I get the below issues worked out, I'll try and write up a how to.
This generated a few questions:
I was of course cloning the built in alerts \ searches and editing them for shorter time windows. Once I started doing this and moving the summary time windows closer to “now” and making the window itself smaller, I didn’t always have the same number of data sets in a given time window. So basically it appeared as though I had 1, 2 or 3 reports in a 2 minute window because of rounding and snapping to the nearest minute.
The alerts I cloned summed the raw data over the time frame. When searches and dashboards were run against this data, it appears as though the number of data points was consistent at one per minute. I assume this as the conversion from total(summed) queries in 10 minutes to queries per second was hard coded to “divide by 600”.
In my new summary report, I’ve tried averaging the total queries over the time window as part of the summary so that the stats command would take care of the issue where a report cycle moved from one 2 minute window to the next. By averaging the data sets instead of summing the totals. I’m getting more consistent numbers, however, this problem doesn’t completely seem to have gone away. I’m winding up with “extra” queries sometimes, it appears that some data is “rounding” into two different 2 minute blocks. Maybe…
Can you speak to how this issue was addressed in the built in reports? I don’t see anything that is catching this “error”.
Secondly, I’ve added another summary, that I’m assuming will grow the DNS data much more rapidly on the reporting member. I could find no way through the GUI to tell the reporting member to scavenge this data on a different cycle than the rest of the reporting data. I would only like to keep these more granular files for something like a week. Is there any way to manually go in and remove these files?
07-28-2016 09:44 AM
Just another note. If you create the every 2 minute jobs, make sure and set the expire time short. The searches I cloned had a 24 hour expire time so I came in today to this message on the reporter.
"Too many search jobs found in the dispatch directory (found=3403, warning level=3000). This could negatively impact Splunk's performance, consider removing some of the old search jobs."
I found the expire time in the advanced edit and the job count is dropping.
07-28-2016 02:11 PM
After further educating myself on Splunk, indexes and summary indexes, I think I’ve come to a solution for the real time data. A simple query in a dashboard against index=ib_dns sourcetype=ib:dns:query:qps will get to the information I was looking for in near real time. It gives 1 minute resolution and if the time window is setup to allow it, it will update as the new data comes in automatically. I assumed that this was not something that was already built in as a canned report as it would have a large CPU cost. I built a dashboard with 10 DNS servers running a 1 hour window, updating in real time. If this caused a CPU hit it was maybe 1%. Over all I think this will be less of an impact than trying to build report every two minutes, 24 x7.
Is there a reason that this is not a recommended solution?
I’m only looking to do this for real time trouble shooting so I don’t want long time windows or need extended history. This is for things like:
Did the traffic load between this group of severs change when the new anycast route was injected?
Did the configuration change that the application group just made, make their application stop pounding on the DNS server?
07-28-2016 02:25 PM
I didn't see your reply, search that I was running every 2 minutes was this:
index=ib_dns sourcetype=ib:dns:query:qps | bucket span=2m _time | sistats avg(COUNT) as COUNT by _time, TYPE, VIEW, host
I was running it different variations of every 2 minutes: now to -2m or -1m@m to -3m@m
It didn't seem to matter if had the sistats as an avg or a sum, every once in a while the queries per second would be much larger than actual when I used this data. I had my reports running against 4 members that i know the normal query rate on well and once or twice an hour their 2 minute average would be about double what I expected. I'm guessing this has to do with some of the markers on where the summary was stopping and starting but I wasn't sure. I finally decided that maybe there was some setting left over from the every 10 minute report that I cloned from that I just couldn't find.
The more I read troubleshooting the issue, the less what I was doing seemed like the best solution for what I wanted. Thats where I decided to go back and start over with the "raw" data and see what other solutions I could come up with.