Reporting

Reply
Highlighted
Accepted Solution

Recursive DNS Performance & Troubleshoot Dashboard - v5

[ Edited ]
Adviser
Posts: 43
24511     16

In large environments when you are in charge of DNS, there is always this day where application or internet access breaks and DNS is under the spot. 

Problem is that DNS is often a blackbox with no indicator and manual troubleshoot requires heavy DNS expertise.

After this experience, it become clear: "You Can't Manage What You Don't Measure"

 

This Recursive DNS Performance & Troubleshoot is meant to have a clear view in a minute of all the key indicator of a recursive DNS to be able to tell if DNS behavior is optimal or not.

 

Three inputs are to be set to start:

- Grid member, the member you want to have report for

- Member Type, the hardware model

- Time, the period you want to investigate - can be a real time 4 hour window for realtime display

 

1.png

 

2.png

 

3.png

 

5.jpg

 

4.png

 

engine.png

Capture d’écran 2017-07-05 à 12.37.21.png

original.png

 

8.png

 

9.png

 

10.png

 

11.png

 

12.jpg

 

 

1) Performance

- Query per second (QPS)

this line graph shows the number of DNS queries per second or QPS processed

 

- Cache hit ratio (CHR) %

this line graph shows the Cache hit ratio or CHR. CHR is the ratio between the number of queries for which the answer is in the DNS engine cache vs the number of queries for which the DNS engine has to recurse to get the answer, potentially issuing multiple queries.

a typical latency for en entry in cache is about 1ms. when entry is not in cache, it is common to have about 300ms. CHR is a very important value when it come to analyzing DNS performance. at 0%CHR DNS perf is almost 10% of the maximum QPS at %100CHR

CHR for a SP is generally between 85 & 98%

CHR for an entreprise is generally over 75%

 

- Latency in ms

this line graph shows the DNS engine latency

a typical latency is about 1ms

 

- Max QPS at 100% Cache Hit Ratio (CHR)

this figure shows the maximum query per second or QPS for the selected member type

 

- QPS, CHR and Max QPS for this CHR comparison

this line graph displays actual QPS & CHR and derivate the Max QPS with the actual CHR

 

- DNS engine usage % (Max QPS for CHR vs QPS), CPU & CHR

this line graph displays the CPU, CHR & determine the DNS engine usage percentage. it is the ratio between the actual QPS & the Max QPS for this CHR. Projection for DNS engine usage % based on selected period is also displayed.

 

- DNS Engine maximum load %

this figure shows the maximum DNS engine usage percentage over the selected period

 

2) DNS Indicators - Top

- Top 10 DNS Clients

this statistic table shows the Top 10 DNS client IPs that have sent the largest volume of DNS requests processed by the DNS engine

 

- Top 10 Requested FQDN

this statistic table shows the Top 10 domain names that have been queried with percentage against the total number of queries in the top index

 

- Top 10 Requested FQDN not in Alexa 2000

this statistic table shows the Top 10 domains that are not in Alexa 2000 lookup list. 

Note that you can upload any list of FQDN in the lookup list like a local country Alexa list or any corporate fqdn. CSV format is single fqdn field.

 

3) DNS Engine Indicators

- Recursion client quota

this line graph displays concurrent number of recursion client (used) vs maximum (max)

when an entry is not in cache, DNS engine will issue DNS queries or recurse to get the answer. To do that, a recursion client will be used. however the concurrent number of recursion client has a maximum (1000 by default). when this maximum is reached (under a phantom domain attack for example), legitimate queries that don’t have answer in cache won’t be able to get answer from internet

 

- DNS recursive cache size

this line graph displays cache size which is an important indicator:

cache size going over 7/8 is getting cleaned for older entries

cache size that is going down without change in the volume and distribution of the queries is also a bad indicator

cache size that is not going up after a restart shows issue in rebuilding a hot cache (eg. downstream queries not answered)

 

- DNS engine messages by severity & Top 5 DNS engine messages by severity

this line graph displays message by severity over time which allows to identify if an event has generated new type of messages and drill down into thoses messages

 

4) DNS problem indicators

- Request time-outs

this line graph displays the number of DNS request which ends up in a timeout.

 

- Top 10 time-outs domains

this statistic table shows the Top 10 domains for which queries have ended up in a timeout

 

- Requests resolved after disabling EDNS

this line graph displays the number of DNS request for which EDNS had to be disabled in order to get an answer

 

- Top 10 domains resolved after disabling EDNS

this statistic table shows the Top 10 domains for which EDNS had to be disabled in order to get an answer to queries

 

- Requests resolved after reducing EDNS to 512

this line graph displays the number of DNS request for which EDNS packet size had to be lowered to 512 (default is 4096) in order to get an answer

 

- Top 10 domains resolved after reducing EDNS to 512

this statistic table shows the Top 10 domains for which EDNS packet size had to be lowered to 512 in order to get an answer to queries

 

- LAME delegations

this line graph displays the number of request that could not be resolved due to lame delegations. A lame delegation occurs when an authoritative DNS server (eg. .com) has a delegation (eg.lamedelegation.com) to other DNS server that are not authoritative for this zone.

 

- Top 10 LAME delegations domains

this statistic table shows the Top 10 domains which resolution ended up in LAME delegations

 

- Unexpected REFUSED return code

this line graph displays the number of request that could not be resolved because a DNS server has issued a REFUSED answer (often because it is not authoritative for the requested zones, could also be due to query ACLs)

 

- Top 10 Unexpected REFUSED return code domains

this statistic table shows the Top 10 domains which resolution ended up due to REFUSED answer

 

- Unexpected SERVFAIL return code

this line graph displays the number of request that could not be resolved because a DNS server has issued a SERVFAIL answer

 

- Top 10 Unexpected SERVFAIL return code domains

this statistic table shows the Top 10 domains which resolution ended up due to SERVFAIL answer

 

- Unexpected FORMERR return code

this line graph displays the number of request that could not be resolved because a DNS server has issued a FORMERR answer

 

- Top 10 Unexpected FORMERR return code domains

this statistic table shows the Top 10 domains which resolution ended up due to FORMERR answer

 

5) Security related indicators

- Fetches per server events

this line graph displays the number of fetches per server triggered event. 

this feature ensure that only a limited number of outstanding queries are sent to a DNS server to mitigate Phantom domain attack hence make best use of the DNS recursive client quota for legitimates queries.

have a important volume of these event out of the cache rebuilding period (after a DNS engine restart) is not normal and likely to be a phantom domain attack

 

- Top 10 fetches per server IPs

this statistic table shows the Top 10 IPs address of DNS server for which the fetches per server feature has been triggered

 

- Fetches per zone events

this line graph displays the number of fetches per zone triggered event.

this feature ensure that only a limited number of outstanding queries are sent for given zone to mitigate Phantom domain attack hence make best use of the DNS recursive client quota for legitimates queries.

have a important volume of these event out of the cache rebuilding period (after a DNS engine restart) is not normal and likely to be a phantom domain attack

 

- Top 10 fetches per zone FQDNs

this statistic table shows the Top 10 FQDNs for which the fetches per zone feature has been triggered

 

 

Prerequisites:

 

Field extraction:

ib:syslog : EXTRACT-client_ip,port,fqdn Inline ^(?:[^ \n]* ){7}(?P<client_ip>[^#]+)#(?P<port>\d+)\s+\((?P<fqdn>[^\)]+)
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-dns_view Inline ^[^"\n]*"(?P<dns_view>\w+)
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-fetches-zones Inline too\s+many\s+simultaneous\s+fetches\s+for\s+(?P<fetches_zone_name>[^\s]*)\s+\(allowed\s+(?P<allowed_fetches_number>\d+),\s+forced\s+(?P<forced_fetches_number>\d+)\)
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-fetches_server Inline adb:\s+quota\s+(?P<fetches_server_ip>[^\s]*)\s+\((?P<fetches_number>[^/]*)/(?P<fetches_frequency>[^\)]*)\):\s+(atr|avg\.\s+timeout\s+ratio)\s+(?P<fetches_atr>[^,]*),\s+quota\s+\w+\s+to\s+(?P<fetches_quota>\d+)
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-fqdn-resolving Inline resolving\s+\'(?P<fqdn>[^\']*)\'
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-holddown_time,holddown_IP,holddown_timeout_number Inline adb:\s+timeout:\s+setting\s+(?P<holddown_time>[^\s]*)\s+second\s+holddown\s+for\s+(?P<holddown_IP>[^\s]*)\s+after\s+(?P<holddown_timeout_number>[^\s]*)\s+timeouts
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-limit,max,avg,soft_limit,limit_over,hard_limit,h_over,est_max_req Inline clients\s+per\s+query[^=\n]*=\s+(?P<limit>[^/]+)/(?P<max>[^/]+)/(?P<avg>[^/]+)/(?P<soft_limit>[^/]+)/(?P<limit_over>[^/]+)/(?P<hard_limit>[^/]+)/(?P<h_over>[^/]+)/(?P<est_max_req>\d+)
infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-recursion-quota Inline

Recursion\s+client\s+quota[^=\n]*=\s+(?P<used>\d+)/(?P<max>\d+)/(?P<soft_limit>\d+)/(?P<s_over>\d+)/(?P<hard_limit>\d+)/(?P<h_over>\d+)/(?P<low_pri>\d+)

infoblox-admin
infoblox
Enabled Move | Delete
ib:syslog : EXTRACT-process,process_id,severity,message Inline

^(?:[^ \n]* ){3}(?P<process>[^\[]+)\[(?P<process_id>\d+)\]:\s+(?P<severity>\w+)\s+(?P<message>.+)

infoblox-admin
infoblox
Enabled Move | Delete

 

 

Lookup Table:

alexa2000global.csv

syntax:

fqdn

infoblox.com

infoblox.fr

....

 

 

update v5

Enhancements to drive ActiveTrust deployment (memory & rpz feed update status)

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

       Wow, there is a very powerful dashboard for both short term capacity predictions and real time trouble shooting.       
        The ability to factor in the CHR into predicting a specific members QPS capacity is HUGE.  This is something that we have struggled with in our environment for years.

            The ability to see all these different indicators for a single member quickly will be very useful tool in trouble shooting problems. Some of these indicators are not very likely suspects but do burn you from time to time.   They also take some time digging though logs to find, so having them there at a glance to quickly rule out will be very nice.

            Now we just have to wait for the syslog field extraction to make it into the primary code branch so we can use all the features.



Re: Recursive DNS Performance & Troubleshoot Dashboard

Adviser
Posts: 43
24512     16

small update with 815/825/1415/1415/2215/2225

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Expert
Posts: 173
24512     16

I finally got a chance to install and see this report with some real data.  It looks like it is going to be just as useful as I though.  I was actually cut and pasting in the syslog extractions when we had a switch failure in a data center effecting DNS queries in a HRD.  I wasn't quite done with it when I had to switch over and trouble shoot the problem so I wasn't able to use it but I was able to go back and look at the data afterwords. I could see the event and see the members moving closer to some of the thresholds as queries timed out and failed to different data centers.



A couple questions:

 

1.  On the panel "DNS Engine maximum load %" what is the meaning \ use of the  "(HITS=0 OR MISSES=0)" statement when calculating the cache hit ratio as part of the join?   None of my members would generate any data on that search until I removed that portion of the calculation.  With that gone, it now seems to correctly pick out the MAX value on the previous panel, "DNS engine usage % (Max QPS for CHR vs QPS), CPU & CHR".  I can't get my head around what the goal was of that command.

2.  Any thoughts on further weighting the DNS Engine Load for members that have RPZ's loaded and or DNSSec validation turned on?

DNSSec is partially taken into account in the CHR and CPU calculation but probably not completely accounted for vs not having to do that work.


RPZ seems harder:  How many lines of RPZ do you have loaded?   What percent of the queries hit your first passthrough RPZ and never are checked against the other 500,000 lines?  How does the reporting tool tell which members have which RPZ's loaded?

3.  I've mentioned this in another thread, but being able to dynamically pull which members are which models seems to be a needed feature.

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Adviser
Posts: 43
24512     16

Hello DEvans,

 

Thanks a lot for your feedback. Please find my answers below:

 

1) You are right, it is a mistake from early test which also caused me some bad result in a recent troubleshoot (Nicolas D. sorry for that). The error is existing is the other search which is also corrected. Please find attached the updated version or the dashboard

 

2) It is not an obvious one, I have found this way:

at DNS engine restart the syslog provide relevant information that allow

- to know if RPZ is enabled:

configured RPZ white list for view 4 (maxsize=24542670, ttl=60)

- to know if DNSSEC validation is enabled:

set up managed keys zone for view _default, file 'managed-keys.bind'

 

3) Syslog provide relevant model information at boot time for physical appliances:

debug [ 0.000000] DMI: Infoblox PT-1400/X8SIU, BIOS 1.0b 09/14/2012

but does not work for virtual appliance:

debug [ 0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

 

Thanks again and do not hesitate to implement modifications and share with us.

 

Nicolas

 

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

@NJeanselme wrote:

 

3) Syslog provide relevant model information at boot time for physical appliances:

debug [ 0.000000] DMI: Infoblox PT-1400/X8SIU, BIOS 1.0b 09/14/2012

but does not work for virtual appliance:

debug [ 0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

 

 

 


 

ahhh...  I just turned off debug syslog events yesterday across the grid to stay under 20 gig a day......  

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

 

Here are a couple code updates.

 

This is my model auto selection.   It pulls from my custom syslog injection of the system model (and system temp for no real reason) from SNMP pulls.  So not overly useful for anyone not doing that but a good POC for future alerting on some of these indicators.  Just to keep it simple, I left the rest of the <change> <condition> code the same for the panel so the rest of the max QPS still populate as though a user made the selection manually. 

    <input type="dropdown" token="member_type" searchWhenChanged="true">
      <label>Select Member Type</label>
      <search>
        <query>index=ib_syslog FORREPORTERsys orig_host="$grid_member_var$" |stats values(Model) as Model</query>
    </search>
       <fieldForLabel>Model</fieldForLabel>
      <fieldForValue>Model</fieldForValue>
      <selectFirstChoice>true</selectFirstChoice>


I updated the recursive cache to be in MB just so it was a bit more readable.

<title>DNS recursive cache size</title>
        <search>
          <query>index=ib_syslog "Recursion cache view" host="$grid_member_var$" | timechart avg(eval(size/1000000)) as size_Mb</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>

 

And with my second SNMP pull, custom syslog injection, I switched over so the recursive DNS Performance Dashboard displays the actual recursive latency instead of the authoritative answer latency.

<title>Recursive Latency in ms</title>
        <search> 
          <query>index=ib_syslog recursive_lat="*"  orig_host=$grid_member_var$    | bin span=1m _time      | timechart avg(eval(recursive_lat/1000)) as mSec by orig_host</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>




 

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

Any thoughts on taking CPU into account when calculating the DNS engine utilization instead of just the CHR?  That would seem to help with the RPZ, DNSSEC , traffic director, DHCP, extra reporting functions....  that might be running on this member, without the complications of looking at each of those independently.   I have 800 series boxes that run 40% CPU at 20 QPS and 90% CHR because of the other work they are doing.  I'm pretty sure they are going to start dropping packets well before the straight CHR prediction says they will.


I played around with a couple of splunk formulas around CPU usage I found on the Internet and really wasn't happy with the results.  They seemed to help with weighting "now" but appeared to not work correctly with the "forecast" function.  (at least with the way I was hacking them into the above searches)

 

Right now, I've just halved all your provided initial QPS max per model.  This is partially to help take the other services into account, but also because most of our DNS servers are in some kind of a fail over setup where I need to know if they can handle their partners load as well.  Not really a elegant solution but gets me closer to what I really want to know.

I started down a path for "forecast" that would look at when the "new" QPS MAX would be hit OR when the CPU would hit 90% separately (instead of a combined DNS engine usage value) but have not gotten a good search \ alert out of that in the time I've had to spend.  That is probably an over all easier solution.  But a single "member utilization" prediction would be very cool to have.

Re: Recursive DNS Performance & Troubleshoot Dashboard

RossG
Techie
Posts: 12
24512     16

When I try to add in the source for this dashboard, it returns the following error:

 

Encountered the following error while trying to update: In handler 'views': Error parsing XML on line 303: xmlParseEntityRef: no name

 

What might I have done wrong in taking the source and putting it in to the dashboard source?

Re: Recursive DNS Performance & Troubleshoot Dashboard

RossG
Techie
Posts: 12
24512     16

Nevermind, I found the error.  For some reason, it will not allow the use of the ampersand.  When I removed it, it threw an error on line 2, which also had an ampersand.  When I removed that one, then everything worked as expected.


@RossG wrote:

When I try to add in the source for this dashboard, it returns the following error:

 

Encountered the following error while trying to update: In handler 'views': Error parsing XML on line 303: xmlParseEntityRef: no name

 

What might I have done wrong in taking the source and putting it in to the dashboard source?


 

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Adviser
Posts: 118
24512     16

Here is a version that uses the new NIOS 8.0 functionality to automatically provide the model number and max QPS when selecting the grid member.

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Expert
Posts: 173
24512     16

The "Top 10 FQDN quota reached" does not work as written.    You need to add another field extraction and change the search.

 

Also as a note if your grepping syslog for these message.   The line for clients per query that gives the current limits and maxes has a small "c" in clients.  The Clients per query line that gives the domain that is actually dropped when the limt is hit, has a captial "C" in clients.  That may save you a few hours of frustrated looking for these messages.

Field extraction:

ib:syslog : EXTRACT-FQDN,QueryType Inline (?=[^C]*(?:Clients per query over limit|C.*Clients per query over limit))^(?:[^:\n]*Smiley Happy{6}\s+(?P<FQDN>\w+\.\w+)\.:\s+(?P<QueryType>\w+)

 

 

<panel>
      <table>
        <title>Top 10 FQDN quota reached</title>
        <search>
          <query>index=ib_syslog host="$grid_member_var$" "Clients per query over" | top FQDN limit=10</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="wrap">undefined</option>
        <option name="rowNumbers">undefined</option>
        <option name="drilldown">row</option>
      </table>
    </panel>

 

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

@DEvans wrote:

.

Field extraction:

ib:syslog : EXTRACT-FQDN,QueryType Inline (?=[^C]*(?:Clients per query over limit|C.*Clients per query over limit))^(?:[^:\n]*Smiley Happy{6}\s+(?P<FQDN>\w+\.\w+)\.:\s+(?P<QueryType>\w+)

 

 

 

This field extraction is not 100%.  I have not taken the time to figure it out but there must be some varation in the syslog message it doesn't take into account.   I was trouble shooting a member that had gone over the client limit but it wasn't pulling the actual domain that caused the problem for that member.   I could see the syslog messge in splunk with a search, but this extration wasn't pulling it. It was working for some other members though so I'm not sure.  I went back to just a space delimited extraction and its working for all members again but pulling way more fields than I need.

Let me know if someone comes up with a better one.

 

Re: Recursive DNS Performance

Adviser
Posts: 43
24512     16
Hello @DEvans
Would you have the syslog message so I can fix the regex?
Nicolas
Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Adviser
Posts: 43
24512     16

Hello Evans,

 

This one match 100% on the syslog you have shared with me:

 

lients\sper\squery\sover\slimit,\sdropping\squery:\s(?P<fqdn>[^:]+):\s+(?P<QueryType>[^\s]+)\s

 

Regards

 

Nicolas

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

That one appears to be working.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Authority
Posts: 16
24512     16

Trying to add this dashboard but I'm getting no data (all boxes just report that they're waiting for input).  Do I have to create any panels or inputs before pasting the XML code?

 

Thanks in advance.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Adviser
Posts: 43
24512     16

Hello stevediani,

 

You have to select Grid member, Member type & Time to get the report working.

 

Regards

 

Nicolas

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Authority
Posts: 16
24512     16

Ok.  Where?

 

Capture.PNG

Re: Recursive DNS Performance & Troubleshoot Dashboard

Adviser
Posts: 43
24512     16

On top left just below the dashboard title:

 

1.png

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Authority
Posts: 16
24512     16

Look at my screenshot from my previous post, I don't have those input selections. 

Re: Recursive DNS Performance & Troubleshoot Dashboard

Adviser
Posts: 43
24512     16

Could you attach this screenshot again please, I don't see an attached picture.

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Authority
Posts: 16
24512     16

Capture.PNG

 

Re: Recursive DNS Performance & Troubleshoot Dashboard

[ Edited ]
Adviser
Posts: 43
24512     16

You have to remove the existing xml code and paste the one attached to the thread so it can be a form with inputs.

Final code should look like:

 

<form>

<label>Recursive DNS Performance &amp; Troubleshoot v4 draft</label>

[...]

</form>

Check out our new Tech docs website at http://docs.infobox.com for latest documentation on Infoblox products.

Re: Recursive DNS Performance & Troubleshoot Dashboard

Authority
Posts: 16
24512     16

That worked thanks; replaced <dashboard> with <form>. 

Re: Recursive DNS Performance & Troubleshoot Dashboard

Expert
Posts: 173
24512     16

Any updates to the  "Max QPS for this CHR comparison"  and "DNS Engine MAX load" algorithms now that the Fault tolerant Cache is a feature.   Turning that on seems to greatly increase the reported CHR, however, the member is still going out in the background and doing the necessary iritative lookups for the recursive queries. 
 
I agree with the reported CHR but it throws off the above estimations of the true load on the members.  I'm guessing, even though the client experience is greatly improved, this is an over all load increase on the actual member, having to process through two different cache's. The above algorithms show a huge load decrease after turning on the fault tolerant cache instead of a slight increase.   


Showing results for 
Search instead for 
Do you mean 

Recommended for You