Reply

Infoblox stop responding when losing Internet connectivity (second edition)

New Member
Posts: 1
1179     0

Hello everyone,

 

In 2020, I created a post about this issue (https://community.infoblox.com/t5/nios-dns-dhcp-ipam/infoblox-stop-responding-when-losing-internet-c...) which was promptly solved.

 

We ran into (as far as I know) the exact same issue, but this time with some limitations that we don't know how to deal with.

 

I contacted IB's support, but maybe I can get help here also.

Instead of re-writing everything, here is a copy/paste of what I've sent them with the description of the issue.

 

 

Description of the issue
These appliances serves as autoritatives DNS servers (200+ domains), and are also used as internal recursive servers by up to 40k devices simultaneously depending on the days.
Under normal circonstances, they are perfectly fine with CPU barely hitting 50% in peak days, memory under 30% and DB utilization around  50%.
 
The issue comes when we lose our internet connectivity and when the DNS servers can't reach outside DNS servers to resolve external domain queries.
 
In this case, we see that the CPU skyrockets, the Cache Hit Ratio drops to 0 (or close) and the servers don't reply anymore, even for internal autoritatives queries, which is really a big issue.
 
We already had this kind of issue in the past, I posted this on IB forums : 
Back then, one guy told me to check the "Limit number of recursive clients to", which by default is set to 1000, and we raised this limit to 40000, which is the maximum allowed by Infoblox, and it worked fine.
 
However, we have this issue again, and according to our anylisis, this is simply due to the grow of number of users/devices and the fact that devices do more and more DNS queries.
Basically, the servers get overwhelmed by Google, Apple, Microsoft, ... queries from devices trying to reach the outside world.
 
I belive this is very similar as a Phantom Domain Attack (described by IB here : https://docs.infoblox.com/space/nios85/35915059/Automated+Mitigation+of+Phantom+Domain+Attacks).
Except that it's kind of legit trafic in this case, not an actual attack.
 
The queue gets filed by "waiting response recursive requests" and the servers start dropping (SERVFAIL) everything else, including queries for autoritative/internal domains.
 
  • There is plenty of memory available, is there any way to go higher than 40.000 for the "Limit number of recursive clients to" parameter ?
  • Can we solve that by looking at Security feature "Enable holddown for non-responsive servers", "Limit recursive queries per server" or "Limit recursive queries per zone" ? 

These parameters are currently disabled. If it is what we need, what would be good values for them knowing that we uses Global Forwarders (toward our DNS security upstream providers) ? 

I think it's a bit different to use these parameters in the Forwarders scenario.

 

There is also a feature called "Enable Fault Tolerant Caching" which I believe is advised to enable by Infoblox, could this help in this case ?
 
So basically, what we are looking for is guidelines on how to protect internal, autoritative queries when the DNS servers lose their internet connectivity.
Showing results for 
Search instead for 
Did you mean: 

Recommended for You