08-05-2020 11:03 PM
We are running a cluster of Infoblox for a large organisation.
It is authoritative for about 200 domains and is also the recursive server for the Intranet clients (~30k clients).
Yesterday, we experienced an issue - our ISPs went down for a few minutes and we lost Internet connectivity.
We ran into a very weird situation with the Infoblox.
During the Internet downtime, our Infoblox stopped responding to DNS request, we were getting timeouts.
It stopped responding even for the zones for which it is authoritative, which makes no sense to me.
It was still "pingable", and absolutely no network issues inside the network were reported. Only Internet connectivity was missing.
We are currently investigating the issue, but we struggle to find any link and why it would act like that.
Does anyone has any idea about a parameter that might trigger this behavior?
Maybe it's a parameter or something weird that I'am missing?
08-06-2020 03:37 PM
One of the parameters that you can look for is the "Recursive client quota". If this quota had been continuously exceeding (which it likely did as the ISP was down, as a result, DNS resolution dependent on the ISP would have completely gone stagnant), BIND could end up dropping packets (depending on whether the Soft Limit or the Hard Limit was exceeded). Another possibility is that the DNS Server could have considered it as a DNS attack provided various other factors were met and could be dropping UDPv4 packets as a result. I have seen scenarios where Authoritative Queries could also be dropped as a result, since they are UDP packets with Recursion Desired bit set at the end of the day.
However, there is a multitude of factors that needs to be considered here and this can only be achieved by looking at the Support Bundle from the DNS Server in concern. I would suggest engaging Infoblox Support via Support Ticket to get to the bottom of why this had occurred in your environment.
08-24-2020 07:18 AM
Thank you for your reply.
I digged into this, and indeed during the Internet connectivity issue I can see that I reach the "soft-limit" for Recursive Client Number allowed in the logs.
I guess the issue comes from there, but I'am still not sure how to solve it.
If I understood correctly, the parameter "Limit number of recursive clients to", even if UNCHECKED, will be enforced as "1000" by the appliance. So my first thinking is to check it at put it at a higher limit.
Also, the Recursive Query timeout is set as "0" by default, which after reading the documentation seems to be "30 seconds". I can lower it down to 10 seconds, which is the minimum allowed and in my opiniong BY FAR enough.
Is this a good practice?
Is there any other parameters that I can play with to mitigate this issue?
As a reminder, we have a lot of clients using the appliances (up to 40k depending on the days).
When Internet comes down (which is rare - but still happens), it is normal to see "a lot" (I have no idea how much it would represent) Outstanding Recursive Query.
It's sad that the appliances are still denying the authoritative request and thus impacting the intranet - when only Interent is down, but yeah I guess it's a normal behaviour.
08-25-2020 06:35 AM
Go into DNS member or Grid DNS properties and check the security tab to see if you have any mitigations enabled for "non-responsive servers" or "bogus query alerting and mitigation". I have been caught out by these as we did an upgrade and some of these settings were enabled by default, meaning the server was detecting what it thought were threats and was enabling holddown. It might not be the problem but no harm checking.
PCN (UK) Ltd
All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE
09-01-2020 12:20 AM
Thank you for your reply.
Everything is unchecked in the Security tab and I couldn't see anything related to that in the logs.
Here is what I could see during the issue regarding Recursive Quotas :
named: Recursion cache view "_default": size = 197756156, hits = 128115632, misses = 46384569
named: Recursion client quota: used/max/soft-limit/s-over/hard-limit/h-over/low-pri = 25/906/900/32103/1000/0/21
named: Recursion view "_default" clients per query: limit/max/avg/soft-limit/limit-over/hard-limit/h-over/est-max-req = 100/100/1.07/100/0/100/145/217
On the second line, it looks like the limit was destroyed, so everything points towards this.
Can you confirm that the parameter Limit number of recursive clients to will change the limit for the "Recursion client quota" line?
What about the third line, how do change the clients per query parameter in Infoblox GUI ?
I will be reproducing the issue in a few days and I would like to have as many things to test to try to resolve the issue.
Also, how can we check the current state of Recursive cache or any usefull stats to check what is happening there (either in CLI or GUI)?
Thank you in advance