Reply
Highlighted

DNS Cache size increased

JSingh
Techie
Posts: 11
8313     0

Hi All,

I need support to understand about an error and also how to rectify the same: i.e. Daemon : ERROR: [named///] nxbl_trim_cache: No progress reducing cache

size for view "_default" [Max cache:111061273 cursize : 121071273 prevsize : 121071273]

I've got above error, please help me on above.

Thanks....Jay Shankar

Re: DNS Cache size increased

I-Team Employee
Employee
Posts: 186
8314     0

There's a note on this in the support knowledgebase. The DNS cache is too small. You can try this from the console and see if it solves it (if not, I suggest filing a support ticket on this):

Use the steps below to increase the affected members' DNS cache sizes from the console:
 
Infoblox > set recursion_cache_size 200 viewname
 
To set the "viewname" 's cache size to to 200MB. 
 
To set it to default
 
Infoblox > set recursion_cache_size 0 viewname
 

 

So how should we determine

NCTM
Techie
Posts: 1
8314     0
So how should we determine and identify an optimal value for the cache?

I have a lenghty tech support

Expert
Posts: 181
8314     0

I have a lenghty tech support ticket open with these same questions.   The basic answer is if the cache size set so that bind has to remove things from cache before the TTL expires, your cache size is to small.   There are problems with the code that infoblox \ bind uses to do these 'agressive cache sizing events" and we experinced DNS service failures during and well after these trim events.

Now there are no alerts when the cache is trimmed and there are no reports anywhere within Infoblox to watch the cache size.   Our default cache size was different across the same models even though we did not ever change it before these issues started happening.  So it was a significant amount of work to dig into which dns servers had the size set to low and adjust upward as needed.

 

I have written code to watch the cahce size syslog messages that each Infoblox server generates every 5 minutes and graph them using the ib-graph bloxtools infrastructure.   Basically you have to watch the cache size and look for large dropps in the cache size that keep happening at the same cache size range.(around 7/8ths of whatever your cache size is set to)   Then you can start to adjust your cache size upwards to get rid of these trim events.

6.11.x is susposed to change how cache size is managed when it nears the hard limit and get rid of this issue.

@David Evans - Need help - Subsequent to your last reply.

DDoshi
Techie
Posts: 16
8314     0

Hello David,

First of all thank you for posting this. We have exactly same issue with one of the customer. The are running NIOS 6.12.7-281477. Their DNS service stopped working all of sudden and IB Support found the issue with cache-size allocation per view. As IB Support mentioned, cache size is around 2GB (not sure if it varies models to models) in my case, which evenly distributes among the Views I create.

In past we had 8 views when we encoutered this issue, which we reduced to 4. However after a year we are again on the same danger. Can you please tell me how did you manage to keep an eye on Cache-utilization and how did you manage to resolve the issue permanentely?

 

Thank you,

Darshan

We have all the syslogs from

Expert
Posts: 181
8314     0

We have all the syslogs from our grid coming to a central linux syslog repository.    We also pulled our bloxtools environment off of the grid and are running it on the same linux server as the syslogs are going to.   This allows us to use all the code from the ib-graph tool that used to be availble from Infoblox to monitor our grid and leverage the same code and libraries to write our own custom monitoring.

I used a perl script logmon.pl to continously tail the syslogs and take action on specific events.   Its available here.   http://www.unixlore.net/  

We have modified it to do a good deal of reporting and alerting that is missing from the grid and the grid reporting tool.

For the recursive cache events, I parse the lines and dump them into the same round robin database format as was used in the ib-graph tool.

This allows a nice graph that looks like this, and the RRD file can be monitored and alerted on with various tools.

Our initial problems were in the fall of 2013 through spring of 2014 so I don't have those graphs.  But they would should a rapid growth up to 7/8ths of the max size setting and eventually not drop off even over the weekends.  July - Nov above we were clearing the cache every 12 hours to keep the cache from hitting the limits and crashing bind.   It appeared that every so often, the cache trim code would corrupt the cache so that bind could not read it at all.   Bind would still kind of function, so it would not trip any of Infoblox’s alerts, but every inbound query resulted in a recursive lookup, so the load on the box would spike and queries would be dropped. Then in Nov of 2014 we were told it was fixed with 6.11.5, then 6.11.6 and then 6.12.? each of which we went to through March of 2015...   As you can see, as soon as we would turn off our cache clearing script, the cache would slowly grow.   The grid is set to only accept TTL's 24 hours or less, so the cache should reach a max size around that time and start to flatten its growth, but it would continue to grow steadily for months if we didn't clear the cache or restart DNS.   What would happen is that instead of the cache trimming code running rairly, the "base" cache size was so large, that it would start to run every few mins.  The problem continued off and on for about 2 years before we got on 6.12.6.   We have not seen the problem come back in the ~90 days we have been on 6.12.6.

Thanks for all the details

Expert
Posts: 259
8314     0

Thanks for providing all of the details on how you troubleshot this and for your contributions at unixlore!

Re: Thanks for all the details

[ Edited ]
Expert
Posts: 181
8314     0

Well I thought that 6.12.9 had the issue fixed but it does not.   Here is one of our graphs showing the recursive cache size.

 

recursive.JPG

 

Where the dip is, just before the climb starts on week 41, we used the perl API to flush the cache on every DNS server in the grid because of a change in some long TTL records we had cached throughout the enterprise.

 my $response = $session->clear_dns_cache(
                        member => "$memberIP");
                 unless($response) {
                        print("Clear DNS cache failed on ", $memberName , $response,"\n" );

It appears that flushing the cache, gets the cache into the state where it does not trim correctly.   This graph "shape" is on every DNS server in the grid regardless of 1 query per second or 5,000 per second.

We are going to 7.x code in a few weeks so I'm not going to bother with a ticket.  That and it took hours of toruble shooting to convence Infoblox this was a problem 2 years ago and I don't have the time to help debug their code again.  But keep an eye on your memory usage.  It appears that actually forcing a restart of the DNS service clears the memory utilization.

Re: Thanks for all the details

Expert
Posts: 181
8314     0

7.2.4 still has this issue so I have opened a ticket.  I can see this kind of cache growth on every member in the grid after I run the perl API to clear the cache.

I'll keep this thread updated with any progress.

Re: Thanks for all the details

[ Edited ]
Expert
Posts: 181
8314     0

We never really got this solved.   Support basically said that the cache cleanup is a very low priority task and my servers simply never get to a point where this service will kick in and remove the stale records from the cache, freeing the memory.

 

These are TE1420's that drop below 100 queries for hours at a time over the weekend.

They also will clear the cache correctly (no week to week, ever climbing cache size) after a service restart and until you use the GUI or command line to clear the cache.

The final answer was that even when the higher priority aggressive cache clearing kicks in at ~1.4 gig of cache that the new machines will handle that task without the issues we saw in the 6.X line and the older -A hardware.

I feel that there is still some lurking bug between the manual cache clear and the automated process to purge records from the cache as their TTL expires, but as it has not caused an actual outage in more than a year I've decided to drop it and just see what happens.



Showing results for 
Search instead for 
Do you mean 

Recommended for You