DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

daboochmeister · ‎01-30-2019

Hello - we are running NIOS 8.2.1, and have suddenly started encountering DNS service outages on our internal responders (configured with recursion on and forward-only to a separate tier of Infoblox servers). In working it with support, the suggestion is that once the recursive cache usage gets to 7/8 of the max value, bind begins housekeeping - evicting 2 least-recently-used cached records for each new response it caches. However, we are being told that it's normal to start receiving sporadic DNS queries failing, from the client perspective, with no response provided by the server.

Is that other people's experience when dealing with recursive cache utilization? I've never encountered this before in managing raw bind servers - they seemed well behaved when cache hit high levels, though of performance takes a hit.

We are encountering cases where the cache memory starts climbing precipitously toward the max, and eventually DNS simply stops responding on the member (no reason forthcoming on why it starts climbing - analysis of the logs doesn't show an increased query nor NXDOMAIN response rate or anything else suspicious nor indicating an attack - and in fact it happens even on very very lightly loaded servers, even, e.g. ones supporting < 10 QPS). But we are also encountering cases where the cache memory is holding at the 7/8 of max level without issue, for days and even weeks - yet suddenly, while still at that 7/8 level, the DNS service stops responding, for hours at a time if we don't restart it.

I AM of course working this with support, but wanted to enhance with any info from the community. I have trouble believing that it's simply expected behavior that if you don't keep your cache memory usage below the 7/8 threshold you're in trouble - but tell me i'm wrong if I am! If i'm not wrong, does anyone have any suggestions on root causes for the outages?

If I AM wrong, it sounds like it's absolutely imperative to keep the cache memory usage below that 7/8 level - what strategies do people employ to ensure that, if so?

Thank you!

TTiscareno · ‎01-30-2019

The DNS service for BIND functions off of allocated memory and determining how much memory to allocate can be a bit of a balancing act since that takes away from other functions. NIOS has traditionally been a little on the conservative side with this allocation and if you find that you are approaching the limit under normal operating conditions (at peak load), it may be advisable to increase this limit on memory allocation.

In most cases, you can safely increase this limit two or even three times over the default limit for your appliance. The Infoblox knowledge base article # 3336 (CLI command to increase DNS cache size) provides instructions for how to adjust this limit, an operation that is done per DNS View.

It is important to note that when the DNS service first starts, this memory is first allocated with a minimum amount of memory and this grows as needed. With that said, one area that can trip up NIOS administrators is when there are multiple DNS Views. This is because that initial memory allocation is distributed evenly across each DNS View, regardless of whether a DNS View is using recursion or not. This is generally not an issue when only a couple DNS Views are in use but when you have a larger number of DNS Views, you are 'stealing' memory from the DNS View(s) where it is needed the most. If you have DNS Views that are unused, make sure to disable or remove those to help free up memory.

When looking at and adjusting the memory allocated for the recursion cache, it is important to monitor the overall appliance memory and CPU utilization. You will see high CPU utilization when you reach 7/8ths of the recursive cache memory limit and increasing this limit will help in that case but by increasing the limit, there will be a point where you max out what your appliance can handle and you want to make sure that you do not reach that limit. In most cases, that would only be an issue if the appliance comes under attack or is undersized for the expected load but is still something that you want to balance as allowing too much can lead to resource exhaustion for other services as well.

Definitely follow the recommendation from Infoblox Support but doubling or even tripling the limit from the default limit should be safe and allow you to handle your expected recursive query traffic.

Regards,

Tony

daboochmeister · ‎01-31-2019

Thank you, Tony - that aligns with recommendations from support, and we have already started increasing recursive cache memory (going from 5 to 3 views ramped it from 322MB to 536MB - and we're doing more ASAP). But we are still experiencing outages.

Can I ask, have others experienced extended DNS service outages when that 7/8 threshold is reached? And have others experienced the cache memory continuing to ramp up (i.e., the bind least-recently-used eviction routine, which clears 2 records from cache for every entry added, not succeeding in keeping the memory from being exhausted)? We're experiencing both, and I've never encountered that in running raw bind servers. Are these known issues, and is there a fix that anyone is aware of?

Support has said that it's a known issue with bind that you're going to experience outages if cache memory gets to that 7/8 point ... and I'm loathe to question that, but ... I've just never seen it anywhere else using bind, and can't find any reference to such an issue in the ISC records or mailing groups etc.

Support's hypothesis on the memory exhaustion is that each new record being cached is very large, > 2x as large as the existing entries in cache, and that's why the LRU eviction routine isn't keeping up. But traffic captures don't bear that out - incoming queries are very typical, and responses are not unusually large. And that "2x the average" would need to be continuous.

For the DNS service outages, there's no clear hypothesis. It happens when the server exhausts cache memory - but also when the server is successfully holding memory usage at the 7/8 threshold.

Again, just trying to understand - and convince myself it's not a NIOS issue, or something else. If it truly is an issue in bind, then we need to become maniacal about making sure we never ever reach that 7/8-of-cache-memory threshold, and restart services continuously. I have trouble believing that could be the case. ?

DEvans · ‎01-31-2019

If you look at this thread... We have seen this from time to time starting back in 2013. I have a couple different threads on this but this is the first one that came up in a search.

https://community.infoblox.com/t5/DNS-DHCP-IPAM/DNS-Cache-size-increased/m-p/783/highlight/true#M313

We have not seen an issue that caused an outage in several years, but we watch for the cache hitting the 7/8th and staying there and put those devices on the list to restart services. A restart of the DNS service will put the cache size back to trimming normally for months. We still see a strong correlation in the change in the way the cache trimming and cache cleanup works and running a clear cache from the GUI or CLI. Support never seemed to have a lot of interest in that correlation as it didn't seem to be 100%.

If I remember correctly, not every cache clear caused a change in behavior, but nearly every change in behavior was preceded by a cache clear in the previous few hours. As we go months to years between clearning the cache on any one box it seemed to be something to dig into.

daboochmeister · ‎02-01-2019

Thank you, DEvans. That thread is a treasure trove. I'm beginning to engage with support on the basis of the info there. We're on 8.2.1, btw - and enough of the symptoms align that I strongly suspect the issue is still present in that baseline.

I'm going to hold this thread open, instead of marking your reply as "the answer", and will come back to update here - then mark it answered once we see how it plays out.

Thx again!

TTiscareno · ‎02-01-2019

One additional note: Internal clients infected with malware, or if your server(s) are accessible from outside your network and recursion is not restricted, can expose you to a common attack vector where they can 'crawl' against your server with recursive lookups for what would ammount to 'garbage' and is designed to tie-up the memory of a DNS server (another form of a DDOS attack that can effect any type of DNS server). Different attacks that can contribute to this include 'slow-drip', 'phantom domain' and 'random subdomain-generated' attacks.

Here are a few helpful Infoblox KB articles relating to various attack methods:

KB #3142: Protecting recursive DNS servers from phantom domain attacks

KB #3143: Protecting authoritative servers from random subdomain-generated attacks

KB #3398: Protecting Networks from DDoS Attacks

If you notice a sudden spike in activity, that would be a concern that you want to watch for. If you are not an ISP or otherwise require that your server be open, restricting who can send recursive queries to only clients on trusted networks would help mitigate this.

The Infoblox Reporting solution has reports built in to make this easy to identify, while DNS query logs can also provide the evidence necessary to identify what is contributing to recursion activity and help you identify what may be outside of what you would consider normal. And for environments where servers must be open to the outside world, the Infoblox Advanced DNS Protection (ADP) solution is specifically built for that environment and is what ISP's depend on for maintaining reliable service levels.

What will work best for you depends on what your requirements are and what you would consider normal, or not. Infoblox Support should have tools available to help rule out if you are facing an attack or not. If not already done, be sure to inquire about this and make sure this is ruled out as a potential cause.

paulr · ‎02-05-2019

This thread is really quite concerning. In all the years I have been working with Infoblox and other DDI solutions I have never come across this. I set up a farm of BIND resolvers at an ISP a few years ago and also never saw any problems like this. Have you tried posting in BIND-users to see if there is a general issue with the cache cleaning algorithms in the version of BIND you are running? Or do we think it's more likely just down to lack of memory allocated by NIOS?

Paul Roberts
PCN (UK) Ltd

All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE

daboochmeister · ‎02-05-2019

Paul, re: engaging with the BIND team, no, we've adopted a "we're going to pay Infoblox a significant amount of money to do that for us" posture :-). In reality, though, the situation is getting messier and messier for us - we're having outages when the cache DB is nowhere near the trimming threshold, as well as when it's exactly at the threshold (and holding successfully), and a few times when it's growing past the threshold (as if the trimming is failing to keep up, or just not working at all). The outage duration is considerably longer when the cache DB is larger though - so we're not sure if we have multiple root causes, or what. Engaged with Tier 3 support at Infoblox - i'll come back and post if any clarity emerges.

RossGibson · ‎02-07-2019

Have you looked at the recursive client query limit setting on the members that are having issues? By default BIND (and Infoblox) have a limit of 1000 concurrent recursive clients. Once the limit is hit, any new recursive requests are dropped. The limit can be configured in Data Management -> DNS -> Members by selecting the member and editing, and if you look under Queries and switch to the Advanced tab, you will see the check box and dialog box to enter a different limit. Just because the box is not checked does not mean a limit is not set, the limit is 1000 if not specified otherwise.

To help determine if this is part of the issue, look in the syslog for the member and search for "Recursion client quota", that message is published periodically to give you an idea of how many recursive clients you are seeing and if you are hitting the max. The message has 7 numbers separated by /, the important numbers are the first, second, and sixth. The first is how many clients you have right now, the second is the maximum you have seen on that device, and the sixth is how many times the limit has been hit. If the sixth number is anything above 0, then your recursive client query limit is being hit and that is probably what is causing the issues in resolution you are seeing.

If you find that is the issue, work with support to make adjustments to that setting to try to get some relief. Do keep in mind that the setting exists for a reason, it is a control for balancing resources on the device, particularly important if it is serving multiple roles. If, however, the device having issues is a dedicated recursive DNS server, then it is likely you would want a higher limit than the default.

daboochmeister · ‎02-07-2019

Thx, Ross - yes, we looked at that with support, but we're never getting anywhere near 1,000 (maxing at about 400) ... and we're seeing outages of hours at a time - even in the middle of the night with no active users, and while concurrent captures show very nominal request load.

But very good information to have in our pockets!

RossGibson · ‎02-07-2019

What model device is in use? What services (if any) is it running other than DNS? Are you running multiple views? Is query and/or response logging enabled?

daboochmeister · ‎02-07-2019

Good q.s ... they're IB-1410/IB-1420s, 16GB of memory; we've had ouages on 3 of our 4 internal responders, one of which also runs DHCP, but the other two of which only run DNS. We do have multiple views - at the start, we had 5, and each view had 322MB of recursive cache mem ... we removed 2, and the remaining 3 went to 536MB cache, but outages continued to happen. Of the 3, only 1 is in active use, so we upped the cache memory to 1GB for it, and haven't had an outage since, but that was just a couple of days ago. And frankly, the outages are happening irrespective of the cache size at the time - we've seen outages in all 3 of these scenarios:

1) cache size is at 7/8 of max, and holding (i.e., trimming is working)

2) cache size is ballooning beyond the 7/8, and climbing toward the max

3) cache size is very small - 50MB, heck, we've had outages with only 14MB in the cache

In case 2, the queries simply could not explain the ballooning of the cache size - support hypothesized that each incoming query created a response bigger than the 2 records the trimming algorithm ejected, but that would have had to be happening for hours at a time, and captures didn't bear that out. So there's SOME defect, somewhere, affecting the trimming algorithm (whether in NIOS or BIND, dunno).

But cases 1 and 2 make us think it may not have anything to do with the cache (at least not directly). And, frankly, case 3 is the most common case. Though there is a thought that possibly something in NIOS is causing a corruption of the cache db, triggering the behavior - but dumps of the cache db at the time of outages appear fine.

All very perplexing. We're currently looking at the external forwarding tier, to which our internal responders forward non-authoritative requests. They seem fine, but maybe ...

Oh, and one other clue - though the data on this is iffy - it seems that when an outage is occurring, based on the syslog, the responder in question is still receiving queries, and thinks it's sending responses ... but no clients are receiving them back, they're all timing out. That sounds like a network issue, or network interface issue ... but doesn't appear to be the network in general, because another responder on the exact same subnet has no issues, and other servers in the same enclosure have no network issues.

<shrug>

RossGibson · ‎02-07-2019

Do you have a reporting server in your Grid? If so, have a look at the top DNS clients and top DNS queries during the times outages have occurred. I'm wondering if you are getting flooded with requests that aren't necessary, for example from someone running a security tool that is doing a lot of reverse lookups.

Have you implemented reverse zones as per RFC-6303?

daboochmeister · ‎02-07-2019

We're working to put in a reporting VM, but don't have one atm. That'll defintely give us better visibility, and i'm looking forward to it. But captures during the events don't show a high amount of traffic nor unusual lookups ... in fact, we've had outages with as little as 6 qps occurring against the server.

Hmm, we have some 1918 reverse zones in place, but not comprehensively, and we DO see a fair proportion of NXDOMAINs on reverse lookups ... the thought is, maybe we're overwhelming our external forwarders? That's something i'll look at, at least for all private IPs. Not seeing many IPv6 lookups. But - again, the outages happen at low traffic levels as well as high, and the forwarders seem fine. But good thought.

TTiscareno · ‎02-07-2019

I think the reason why you have not experienced this issue with other BIND installations is that the default for max-cache-size is 0 (unlimited). This means that the cache will grow as needed, to the point of system resource exhaustion and crash.

Because NIOS is a shared resource, it is a balancing act to determine how much memory an individual process should be allowed to consume. A common challenge that administrators run into today is that the default limit for some appliances has been historically low, a carryover from when appliances had smaller amounts of system memory available (a long time back indeed).

The default limits have been increased in recent versions of NIOS but probably not by as much as some need. This is why the command is available in the CLI, so those that need to adjust these limits can do so. With your appliance(s) having 16GB of memory, I would even wager that allowing 2GB for DNS is safe (with the caveat that you would want to monitor for high memory and CPU utilization of course).

Now for your specific circumstances- the memory demand keeps increasing because of the amount of queries that your server is processing and once you reach the limit, the process is almost self-defeating because the system is forced to prune the cache to bring it down to the allowed size and this process can be very CPU intensive. As entries disappear and the server gets busier and busier, clients keep retrying their queries until the administrator flushes the cache altogether, clients begin to give up and move on to other servers, or the server is finally able to catch up.

To figure out why you are seeing the cache demands, some questions would include:

What clients are these queries coming from?
Are there any queries that stand out?

Without reporting, you would need to enable query logging to capture the data necessary to analyze this properly. A Traffic Capture is nice but can only capture a small window and will not give you the overall picture. With the query log data, you would look for:

The number of queries overall (averaged to per second over different time frames depending on peak times)
The top clients sending queries
The top domains being queried

This will help narrow down anything suspicious. If nothing stands out, then it may very well be that your server(s) are in high demand and you will need to maintain a large DNS cache.

And for the comment about the server responding to queries but clients are still timing out- it is easy for it to look that way because NIOS has a default timeout of 5 seconds, while clients have a timeout of 2 seconds. As NIOS completes one transaction, it is quite possible for the client to have retried its query multiple times. By the time NIOS responds, the client has closed its socket and will drop (refuse) the response.

The only way to prove that is happening (aside from seeing the client refusing the response) is to compare the transaction ID's on the packets and see the mismatch. Depending on your NIOS version, the transaction ID's might be available in the query logs but this is easy to see in a Traffic Capture. The symptoms alone are generally enough to tell you that is what's happening though, and what you described in your previous post does match this behavior.

For your next steps, you would want to make sure to:

Enable query logging (you will probably want to make sure you are sending your syslog to an external syslog server since this data can build up quite rapidly)
Review how much free memory is still available for the appliance. If you are still under 30%, feel free to increase the limit even further (though I suspect that 1GB is probably more than enough now).
Work with support to analyze the query logs. If you are handy with Linux and AWK, you can do this yourself but doing so does involve quite a bit of effort to format the data and parse through it.
If using an external syslog server that is capable, configure alerts to monitor for recursion cache size for the DNS View that you are most concerned about so that you can see if the size grows beyond a certain threshold (say 40% or 50% of the max cache size).

Regards,

Tony

DEvans · ‎02-21-2019

Looking into the query rates, query mix and responses is a good place to start any troubleshooting but I remember pounding my head on that wall for months(years) when we had this issue.

The change in behavior where the recursive cache size would no longer drop during times of low query rates would happen after a flush cache, generally.

To get the cache size to start to go back to a normal (sine wave) where the over all cache size would drop when the query rates would drop would require a DNS service restart. This change in over all cache size and the start of the slow climb to 7/8^th full would happen on boxes with query rates of < 100 per second. Something would break and at least the reported recursive cache size would no longer be reduced when records were removed from the cache. This alone would not cause an outage however.
We also had devices that would get to 7/8ths full just because of the limited cache size on the -A appliances and the query load. I think that the cache clearing script we had and the change of behavior after the cache was cleared simply got more devices to 7/8th full more often so we saw the issue more often.

The outages would occur for us when the cache would get to 7/8th full (regardless of why) and sit there for days to weeks. The DNS server would be functioning normally during this time. There was no particular change in the query mix or query rate when the outage finally occured. It behaved very much like the procedure/code for the "aggressive cache clearing" events at 7/8ths would eventually corrupt the actual cache in memory and bind would no longer be able to read the cache.

2 boxes behind the same , equal cost, anycast address may have been at 7/8th full for weeks. One would suddenly have the issue and the other one would not.

Although the above troubleshooting is exactly where we started when we first had the issue on the old -A hardware, an exhaustive search through the query logs and query rates could never find any specific trigger in the data.
I went as far as to grab the previous few days query logs from a box that had failed and send those exact queries to a test box, on repeat, for weeks, with the recursive cache size set the same as the box that had failed and then set to a ridiculously low size. I could never reproduce the outages in the lab.

Moving to the TE series with much larger available caches, being much more judicious with our use of the manual cache clearing and a watching for the behavior where the cache size is stuck at 7/8ths full has at least masked the issue from occurring for us.

salmino · ‎03-24-2019

I set up a farm of BIND resolvers at an ISP a few years ago and also never saw any problems like this. Have you tried posting in BIND-users to see if there is a general issue with the cache cleaning algorithms in the version of BIND you are running?

<Moderator> External links removed.

THE GAME HAS CHANGED

NIOS DNS DHCP IPAM

DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate

Re: DNS service outage after hitting 7/8 of max cache - root cause, how to prevent, mitigate