Reply

Debug alert \ dashboard for DNS servers that have an above normal query rate.

[ Edited ]
Expert
Posts: 181
3550     0

       I'm trying to run a report \ dashboard that will give me the members whose DNS queries/sec has increased 1.5 times over the same hour last week.  With some other intelligence rapped around it I hope to do some alerting with this same query.

        Most of the time the search below works.   But sometimes the "last week" time frame has 0 queries or maybe 1% of the queries for a handful of members.   I will happen intermittently for 10 to 20 mins at a time and for only a handful of grid members out of 100+.  If you keep re running the report, suddenly that hour last week shows the correct number of queries again and the report will be "correct".

        Looking at the data raw data, or running the "last week" query manually outside of the full search below, shows the correct number of queries for the member over the same timeframe that the full report is currently showing 0 queries.  It seems to be just when I put this all together in this format that it doesn't work consistently.

       Is there some sort of rounding problem I'm having with the time frames I'm giving it?  Am I running something out of memory or timing out the search?  Any ideas on trouble shooting this further?

 

**********  Edited   Working correctly now.  *************

 

index=ib_dns  sourcetype=ib:dns:query:qps earliest=-1h@h latest=@h 
            | stats sum(COUNT) as TodayLastHour by host 
            | join [search index=ib_dns  sourcetype=ib:dns:query:qps earliest=-169h@h latest=-168h@h 
            | stats sum(COUNT) AS LastWeekLastHour by host] 
            | where TodayLastHour >  1.5 * LastWeekLastHour 
            | eval PercentChange=round(((TodayLastHour/LastWeekLastHour) * 100),0) 
            | eval TodayQPSLastHour=round((TodayLastHour/3600),1) 
            | eval LastWeekQPSLastHour=round((LastWeekLastHour/3600),1) 
            |  rename host as Member  TodayQPSLastHour as "Avg QPS Last Hour" LastWeekQPSLastHour as "Avg QPS Last Week Last Hour"
            | sort - PercentChange
            | fields - TodayLastHour -  LastWeekLastHour - PercentChange


 

Re: Debug alert \ dashboard for DNS servers that have an above normal query rate.

[ Edited ]
Adviser
Posts: 97
3551     0

I can't comment too much on this as the "where" statement always returns no results in my test environment. Possibly because I don't have any 1.5X increases.

 

On thing I can suggest though is to use an anomaly detection algorithem instead. This will catch any sudden spike or dip in volume.

 

http://docs.splunk.com/Documentation/SplunkCloud/6.6.3/SearchReference/Anomalydetection

 

Re: Debug alert \ dashboard for DNS servers that have an above normal query rate.

Expert
Posts: 181
3551     0

It looks like the sub search is timing out.   If I run in with a "host = wildcard"  in both searches that limit it to around 20 members, it works correctly.   But if I turn it loose on the full grid, the sub search of last week’s data times out part way through.   I messed with the "appendcolumns" time out and max results settings and put the crazy high and it made no difference.     The search competes in < 2 seconds with no errors, regardless if it’s on 1 host or the entire grid from what I can see from the logs.   I see some threads out there on limits and issues with sub searches in Splunk... most recommend just not using them because of the same kinds of issues I’m seeing.   

 I found a new example of looking at the average of the last 4 weeks hourly data.    It gives 100% accurate results but takes 10 times as long even if I only run it on one member and limit it to only looking at last week’s data.   The example had it looking at 4 week’s data and that took about 4 mins to run on the grid, but it was correct when it finally got done.   I've dropped it back to only 2 weeks and changed over the summary report.  It is better but still takes around a minute to run on the full grid.

That would likely be OK for an alert, but not really OK for a home dashboard widiget.   The first example, I’m sure because of the 2 very precise time frames is a much faster search.   Without sub searches, I can’t figure out how to just pull the “one hour last week” time window.   Pulling weeks worth of data and then searching for the 1 hour window just is not going to be efficient.  



index=ib_dns_summary report=si_dns_member_qps_trend  earliest=-15d@h latest=@h
         | eval hourstart=relative_time(now(),"-1h@h")
     | eval hourend=relative_time(hourstart,"@h")     
| eval weekwrap="0 1 2" 
| makemv weekwrap 
| mvexpand weekwrap
     | eval _time = _time + weekwrap*604800
     | eval week = case(weekwrap=0,"current",true(),"prior_".weekwrap) 
   | where (_time <= hourend) AND (_time>=hourstart)  
     | eval Time = strftime(_time,"%H:%M")
     | stats sum(QCOUNT) as QCOUNT by orig_host Time week 

 | appendpipe [| where like(week,"p%") | stats avg(QCOUNT) as QCOUNT by orig_host Time | eval week="prior_avg",QCOUNT=ceiling(QCOUNT)]  | eval {week} = QCOUNT
 | stats sum(*) as * by orig_host Time
 | table orig_host Time curr* prior* 
| where current > prior_avg * 1.5

The built-in Anomaly Detection functions are interesting but with the roller coaster that is our DNS and DHCP QPS they tend to give a lot of false positives during ramp up and down times.  If you stretch the times frames out, then they miss some spikes or are very delayed in reporting them.   I never could find a good middle ground or a way to tell them specifically look at weekly time frames when doing their analysis.  It may also have to do with the fact that my last "Stats" class was a very long time ago so some of the settings are a bit meaningless to me.

 

Re: Debug alert \ dashboard for DNS servers that have an above normal query rate.

[ Edited ]
Expert
Posts: 181
3551     0

I found the issue and made the change in the first post.  The problem is that appendcolums assumes sorted \ same number of results.    That is not the case with this data.   When you had one member it worked great.   When you had a group of members that were similar (covered with the same wildcard) I couldn't tell that it was broken, but it likely was.

Just need change from appendcols to join.


Here is the code, looking at the last hour,  and the same hour the last week and 2 weeks ago.  It compares the average of the previous 2 weeks, to the current 1 hour window.   This still runs in about 2 seconds on my grid.   Should be easy to adjust this code from here to look at longer past time frames to get a better "normal" or wider current windows to smooth out some spikes.

Could go back at look at the summary data over a full day as well and do some less urgent alerting \ reporting…   




index=ib_dns  sourcetype=ib:dns:query:qps earliest=-1h@h latest=@h 
            |stats sum(COUNT) as TodayLastHour by host 
            | join host [search index=ib_dns  sourcetype=ib:dns:query:qps earliest=-169h@h latest=-168h@h 
            | stats sum(COUNT) AS LastWeekLastHour by host] 
            |join host [search index=ib_dns  sourcetype=ib:dns:query:qps earliest=-337h@h latest=-336h@h 
            | stats sum(COUNT) AS TwoWeeksLastHour by host] 
            | eval Last2Avg=((LastWeekLastHour+TwoWeeksLastHour)/2) 
            | where TodayLastHour >  1.5 * Last2Avg
            | eval PercentChange=round(((TodayLastHour/Last2Avg) * 100),0) 
            | eval TodayQPSLastHour=round((TodayLastHour/3600),1) 
            | eval LastWeekQPSLastHour=round((Last2Avg/3600),1) |  rename host as Member  TodayQPSLastHour as "Avg QPS Last Hour" LastWeekQPSLastHour as "Avg QPS Last 2 Weeks Last Hour"
            | sort - PercentChange
            | fields - TodayLastHour -  LastWeekLastHour - PercentChange - TwoWeeksLastHour - Last2Avg

 

Showing results for 
Search instead for 
Did you mean: 

Recommended for You