Learn How We Can Help You Keep Teleworkers Protected During the COVID-19 Crisis

Reporting

Reply
Highlighted

Alert on KPIs - (Part 1 & 2)

[ Edited ]
Adviser
Posts: 81
3715     0
For all, the trigger event is search result > 0
All * report that are using the ib_syslog index require syslog redirection to Reporting available in 7.3.200 and newer

 

DNS
- Cache hit ratio
 
Threshold
sourcetype=ib:dns:query:cache_hit_rate index=ib_dns (HITS>0 OR MISSES>0)  | lookup dns_viewkey_displayname_lookup VIEW output display_name | eval PERCENT=if(HITS+MISSES > 0,(HITS*100/(HITS+MISSES)),0) | bucket span=1m _time | stats last(PERCENT) as CHR by host | search CHR<80
 
Variation
sourcetype=ib:dns:query:cache_hit_rate index=ib_dns (HITS>0 OR MISSES>0)  earliest=-1h latest=now | lookup dns_viewkey_displayname_lookup VIEW output display_name | eval PERCENT=if(HITS+MISSES > 0,(HITS*100/(HITS+MISSES)),0) | stats last(PERCENT) as lastCHR by host
| join type=outer [search sourcetype=ib:dns:query:cache_hit_rate index=ib_dns (HITS>0 OR MISSES>0)  earliest=-4h latest=now | lookup dns_viewkey_displayname_lookup VIEW output display_name | eval PERCENT=if(HITS+MISSES > 0,(HITS*100/(HITS+MISSES)),0) | stats avg(PERCENT) as avgCHRlast4h by host] | eval CHRdiff=(avgCHRlast4h-lastCHR) | search CHRdiff>20

- Top DNS clients
 
Threshold
index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-1h latest=now | stats sum(COUNT) as Countlast1h by CLIENT
| join type=outer [search index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-7d latest=-1h |stats sum(COUNT) as CountEarlier by CLIENT] | search Countlast1h > 100
 
Variation
index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-5h latest=now | stats avg(COUNT) as Avglast5h by CLIENT
| join type=outer [search index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-7d latest=-5h |stats avg(COUNT) as AvgEarlier by CLIENT] | eval Change=(100*Avglast5h/AvgEarlier)-100 | search Change > 50
 
New client occurence
index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-5h latest=now | stats sum(COUNT) as Countlast5h by CLIENT
| join type=outer [search index=ib_dns sourcetype="ib:dns:query:top_clients" earliest=-7d latest=-5h |stats sum(COUNT) as CountEarlier by CLIENT] | where isnull(CountEarlier)
 
- Top DNS domains
 
Threshold
 
index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names  earliest=-1h latest=now | stats sum(COUNT) as Countlast1h by HNAME
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names earliest=-7d latest=-1h |stats sum(COUNT) as CountEarlier by HNAME] | search Countlast1h > 100
 
Variation
index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names  earliest=-5h latest=now  | stats avg(COUNT) as Avglast5h by HNAME
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names earliest=-7d latest=-5h |stats avg(COUNT) as AvgEarlier by HNAME] | eval Change=(100*Avglast5h/AvgEarlier)-100 | search Change > 50
 
 
New domain occurence
index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names  earliest=-5h latest=now  | stats avg(COUNT) as Avglast5h by HNAME
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:top_requested_domain_names earliest=-7d latest=-5h |stats avg(COUNT) as AvgEarlier by HNAME] | where isnull(AvgEarlier)

- Query rate by server (threshold & variation)

Threshold
index=ib_dns sourcetype=ib:dns:query:by_member earliest=-1h latest=now | bucket span=10m _time | stats sum(eval(QCOUNT/600)) as QPS by _time host | search QPS > 100

Variation
index=ib_dns sourcetype=ib:dns:query:by_member earliest=-1h latest=now  | stats sum(eval(QCOUNT/3600)) as Avglast1h by host
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:by_member earliest=-7d latest=-1h | stats sum(eval(QCOUNT/((7*24-1)*3600))) as AvgEarlier by host] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50
 
- Query rate by type (threshold & variation)
Threshold
index=ib_dns sourcetype=ib:dns:query:qps earliest=-1h latest=now  | bucket span=1m _time | stats sum(eval(COUNT/60)) as QPS by _time TYPE | search TYPE = A QPS > 3000
index=ib_dns sourcetype=ib:dns:query:qps earliest=-1h latest=now  | bucket span=1m _time | stats sum(eval(COUNT/60)) as QPS by _time TYPE | search TYPE = ANY QPS > 30
index=ib_dns sourcetype=ib:dns:query:qps earliest=-1h latest=now  | bucket span=1m _time | stats sum(eval(COUNT/60)) as QPS by _time TYPE | search TYPE = TXT QPS > 30

Variation
index=ib_dns sourcetype=ib:dns:query:qps earliest=-1h latest=now  | stats sum(eval(COUNT/3600)) as Avglast1h by host TYPE
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:qps earliest=-7d latest=-1h | stats sum(eval(COUNT/((7*24-1)*3600))) as AvgEarlier by host TYPE] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search TYPE = A Change > 50
 
index=ib_dns sourcetype=ib:dns:query:qps earliest=-1h latest=now  | stats sum(eval(COUNT/3600)) as Avglast1h by host TYPE
| join type=outer [search index=ib_dns sourcetype=ib:dns:query:qps earliest=-7d latest=-1h | stats sum(eval(COUNT/((7*24-1)*3600))) as AvgEarlier by host TYPE] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search TYPE = ANY Change > 500

- Top DNS Timed Out queries
 
Threshold
index=ib_dns_summary report=si_top_timeout_queries earliest=-1h latest=now | stats avg(COUNT) as Queries by orig_host, NAME | search Queries > 50
 
Variation
index=ib_dns_summary report=si_top_timeout_queries  earliest=-1h latest=now  | stats avg(COUNT) as Avglast1h by orig_host, NAME
| join type=outer [search index=ib_dns_summary report=si_top_timeout_queries earliest=-7d latest=-1h | stats avg(COUNT) as AvgEarlier by orig_host, NAME] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50
 
- Top DNS Servfail received
Threshold
index=ib_dns_summary report=si_top_servfail_received_queries   earliest=-1h latest=now  | stats avg(COUNT) as Queries by orig_host, NAME  | search Queries > 100
 
Variation
index=ib_dns_summary report=si_top_servfail_received_queries   earliest=-1h latest=now  | stats avg(COUNT) as Avglast1h by orig_host, NAME | join type=outer [search index=ib_dns_summary report=si_top_servfail_received_queries earliest=-7d latest=-1h | stats avg(COUNT) as AvgEarlier by orig_host, NAME] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50
 
- Top DNS Servfail sent
Threshold
index=ib_dns_summary report=si_top_servfail_sent_queries   earliest=-1h latest=now  | stats avg(COUNT) as Queries by orig_host, NAME  | search Queries > 100
 
Variation
index=ib_dns_summary report=si_top_servfail_sent_queries earliest=-1h latest=now  | stats avg(COUNT) as Avglast1h by orig_host, NAME | join type=outer [search index=ib_dns_summary report=si_top_servfail_sent_queries earliest=-7d latest=-1h | stats avg(COUNT) as AvgEarlier by orig_host, NAME] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50
 
- fetches per zones / server (activation)*
lacking a NIOS 7.3.200 environment with related logs to build this one

- percentage of NXDOMAIN responses*
use the native SNMP trap from DNS > Security feature

- DNS recursive cache size*
Threshold
index=ib_syslog "Recursion cache view" earliest=-1h latest=now | search size >  1000000000
 
Variation
index=ib_syslog "Recursion cache view"  earliest=-1h latest=now  | stats avg(size) as Avglast1h by host
| join type=outer [search index="ib_syslog" "Recursion cache view"  earliest=-7d latest=-1h | stats avg(size) as AvgEarlier by host] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50

- Concurrent recursive clients*
 
Field extraction
Recursion\s+client\s+quota[^=\n]*=\s+(?P<used>\d+)/(?P<max>\d+)/(?P<soft_limit>\d+)/(?P<s_over>\d+)/(?P<hard_limit>\d+)/(?P<h_over>\d+)/(?P<low_pri>\d+)
 
I have build this fields extraction using the splunk field extractor documented here:
 
Threshold
index="ib_syslog" "Recursion client quota"  earliest=-1h latest=now | search used > 800
 
Variation
index="ib_syslog" "Recursion client quota"  earliest=-1h latest=now  | stats avg(used) as Avglast1h by host
| join type=outer [search index="ib_syslog" "Recursion client quota"  earliest=-7d latest=-1h | stats avg(used) as AvgEarlier by host] | eval Change=(100*Avglast1h/AvgEarlier)-100 | search Change > 50


DDNS
- DDNS update clients & variation*
lacking a NIOS 7.3.200 environment with related logs to build this one
 
- DDNS update domain & variation*
lacking a NIOS 7.3.200 environment with related logs to build this one

- Failed updates by reason (YXDOMAIN, etc)*
Threshold
index=ib_syslog dhcp_updater_default | rex "client (?<Client>[^#]+).+zone '(?<Zone>[^\/]+)\/IN'.+unsuccessful: (?<FQDN>[^:]+):.*\((?<Error>.+)\)" | stats count as Errors by Client Zone FQDN Error | search Errors > 50
Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.
Highlighted

Re: Alert on KPIs - (Part 1 & 2)

Adviser
Posts: 81
3715     0

updated with part 2

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.
Highlighted

Re: Alert on KPIs - (Part 1 & 2)

Expert
Posts: 181
3715     0

I really like what your doing with this.  I think I understand most of the 'code' your putting in but could you put a couple bullet point explination of what is going on wth some of the more complex lines.  It may save me and others some trial and error as we tweek it to fit our environment.

Maybe note the ones that will only work in 7.3.200  from the ones that work in the primary 7.3 line also.

Highlighted

Re: Alert on KPIs - (Part 1 & 2)

Community Manager
Community Manager
Posts: 248
3716     0

in general, seperate the search clauses by the '|' character, this helps decode them some more.

 

E.g, 

 

 

index=ib_syslog "Recursion cache view"  earliest=-1h latest=now 
| stats avg(size) as Avglast1h by host
| join type=outer [search index="ib_syslog" "Recursion cache view" earliest=-7d latest=-1h
| stats avg(size) as AvgEarlier by host] 
| eval Change=(100*Avglast1h/AvgEarlier)-100 
| search Change > 50

The basic commands being used here are:

 

 

'eval' : which creates a new field based on more than one other field calculation

   ( NewField = expression )

 

'stats' : which counts ocurrences of events, but also renames and groups them

   function(fieldname) as RenamedfieldName by GroupingField

 

 

 

 

Highlighted

Re: Alert on KPIs - (Part 1 & 2)

Techie
Posts: 2
3716     0

Hi!

 

Queries:

 

1. What are the recommended threshold values for the following:

• NXDOMAIN
• Referral
• NXRRSET
• Failure
• Recursion

• Cache Hit Ratio

 

(e.g. 

• Recursion: should be less than the SUCCESS, normally its only about 15% of SUCCESS rate

• NXDOMAIN:about 5% of total QUERIES, normal

• Cache Hit Ratio: 90% to 100%)

 

2. What is the reference document for this?

 

3. How DNS queries computed? (e.g. Total Queries = NXDOMAIN + NXRRSET + SUCCESS) what document contains the computation?

 

For your assistance please. Thank you.

 

Good day and have a wonderful weekend!

 

Best regards,

Dandy

Highlighted

Re: Alert on KPIs - (Part 1 & 2)

Techie
Posts: 2
3716     0

Hi!

 

image.png

image.png

 

As shown on the above figures from the data extracted from the Grid Master, NXDOMAIN exceeded the threshold values.

- What are the variations and differences on why they exceeds the threshold?

- What are the possible reason and factors and how can we justify it?

 

For your expert advice.

 

Thank you and good day.

 

Best regards,

Dandy

 

 

 

Showing results for 
Search instead for 
Do you mean 

Recommended for You