02-20-2017 12:57 PM - edited 12-18-2018 10:55 AM
I've been using the SNMP values below to watch for issues with response latency. IB-PLATFORMONE-MIB::ibNetworkMonitorDNSNonAAT1AvgLatency.0 IB-PLATFORMONE-MIB::ibNetworkMonitorDNSAAT1AvgLatency.0
These pulls give me response latency in micro seconds for authoritative and recursive lookups.
The recursive lookups give me a way to trouble shoot upstream network \ DNS server issues. The authoritative latency shows if a box is just getting hammered for other reasons.
I started to look at the DNS Response Latency Trend under sourcetype=ib:dnserf index=ib_dns. I can see some values here but there are only occasional spikes above 1 millisecond. These seem to correlate closer to a DNS service restart than some issue with the server. The rest of the values are all zero’s.
There is no documentation on how these numbers are generated and they don’t seem to correlate to the SNMP counters, for sure not the recursive SNMP counters which on some of our servers are consistently in the 10’s to 100’s of milliseconds.
Has anyone found a use for the built in response latency report or a way to get to some more useful values around response latency within the reporting tool?
02-23-2017 06:35 AM
I’ve found a few more things about this report.
First, it includes DDNS update times into its calculations. In looking for members that had some non zero values, I found that the members that were accepting GSS-TSIG updates showed latency in the 500 to 1000ms range. Recursive and authorities queries to these same members were always under 1ms in testing. When I looked at the syslog messages for dynamic update latency for these members, those values closely matched and tracked with this report.
It seems odd that DDNS update times would be included in the calculations when there is a separate syslog status specifically for it. But I had never looked at these members for latency before, I was just surprised when they showed as some of the highest latency servers in our grid.
Secondly, this report is only for authoritative queries. The charts below are from the same member over the same time period. This member is primarily a recursive resolver with very low cache hit ratio, (30 %) but have some zones loaded and gets some queries for those zones. The SNMP pulls in the first chart show the difference in latancy the two kinds of queries. It’s pretty clear from this data and some other members where the mix of queries is heavily skewed in one direction or the other that the built in report is only looking at authoritative latency and DDNS update latency.
02-27-2017 10:13 AM
I asked our engineering team to research this and they were able to confirm that the scripts that measure latency do indeed only look at authoratative zones, and are not measuring DDNS updates. Additional latency that you are seeing is likely a result of the collection scripts being delayed as a result of other tasks in the job queue (such as GSS-TSIG and DDNS activities). We agreed that this is suboptimal and I will create a Request for Enhancement to initiate the process of improving the measurement methodology.
08-29-2018 06:32 AM
Any updates on this RFE? This is on the short list of things that are requiring me to maintain the ib-graph / SNMP scripts in Bloxstools.
3 weeks ago
You may want to take a look at this report: