07-09-2018 02:54 PM
We have been using the Fault Tolerant DNS cache option in NIOS 8.2 for some time and it has decreased recursive latency from the clients point of view and masked a few WAN and server outages from the clients as well.
This weekend however, it worked a little too well. A routing change with a bad mask rendered a set of DNS servers accidently unreachable within our intranet. The fault cache kicked in and masked the issue as it should have but, normally this issue would have been caught quickly as the TTL’s timed out for any cached answers and clients started to fail lookups. As the routing problem ONLY affected these DNS servers, and due to this extra layer of cache, the issue didn’t show up for a significant amount of time after the routing change, it took some time to put everything together to find the issue. ( I know there are a hundred other places this could have been caught, but I think you see where I’m going.)
Are there any counters currently, or in RFE for finding failing infrastructure \ servers \ domains while they are being answered from the fault tolerant cache so they can be fixed before they fall out of the second cache? Has anyone looked at the usefulness of this data?
The DNS Top SERVFAIL Errors Received report had some data in it, but I struggled to get though some noise in it. Depending on the query rate of the infrastructure that is broken, it is very possible that some infrastructure may never make the thresholds for a canned “top” list, so I’m always hesitant to work off of them for any kind of alerting.
Although as I read back through this post the SERVFAIL data may be where the final solution is at. The infrastructure that had an issue this weekend does not have a high query rate but I still did not see the domain as high on this list as I would have expected so I dismissed digging into it very far. Maybe that is because the clients didn’t get a SERVFAIL message back and therefore never went into their “retry storm”? If this is the case, it will complicate weeding through this data to make it useful. A domain will have a limited number of SERVFAIL messages while its still in the fault tolarant cache, and then could spike once it falls out of the cache and clients see the SERVFAIL message.
We have put an “odd” TTL on the answers from the Fault Tolerant Cache so they do stand out to the DNS team. (ie, You’re getting 17 sec TTL’s for a record or domain, consistently, from lots of the infrastructure, we have a problem.) However, to most of the IT staff, a DNS response is all they care about and they happily go forward with any valid DNS answer. I thought about grepping the query logs for that unique TTL. Although I think I’d eventually get the data I’d like, I’m not sure I want to buy the server horsepower to do the analysis in something near real time.
I know the intended function of this cache was to mask Internet DDoS attacks that is likely against infrastructure out of your control, so the ability to “fix” the servers in question is likely very limited. But when it is used on an intranet, knowing that a large percentage of a specific type of query is coming from the fault tolerant cache would allow you to fix some issues within this new cache window, before they affect your clients.
Solved! Go to Solution.
07-18-2018 06:22 AM