11-03-2017 10:33 AM
We have a grid wide RPZ license and have expanded our use of RPZ's.
Today I removed a DHCP only grid member that from my audit logs I can see NEVER had the DNS server service started during its 5 year existance.
The removal went fine, but then I noticed every grid member with a RPZ loaded running through a sequential restart a few minutes later.
It turns out that the default setting is to add all members to the ACL infoblox-deny-rpz. So any grid member removal and I'm guessing add will update that ACL. When that ACL gets a new IP, every DNS server with an active RPZ zone will request a service restart. Whoever the next admin is that clicks restart as needed, maybe thinking they are just restarting a couple DHCP services, will restart every DNS service as well.
I understand the need for this, but it is not well documented that now something that should have zero service impact, like removing a grid member from the grid that has been turned off for a week, and never was a DNS server, causes a DNS service restart to ripple through the grid.
I can't think of a fix for this, other than some documentation around the grid RPZ license that mentions that now that the RPZ feature is available on the grid, the need to restart the DNS services in your grid is GREATLY increased.
This also SHOULD not be a service interruption, however we try not to do grid wide service restarts in the middle of the day for the cache rebuild time if nothing else.
11-03-2017 12:01 PM - edited 11-03-2017 12:02 PM
I can certainly understnd how that would catch anyone off guard and I agree that it might be better to trigger some sort of warning when making a change like that.
One thing to point out is that you can find the following snippet in the "About Infoblox DNS Firewall" section in the NIOS Administrators Guide:
For RPZ, Infoblox uses the ACL infoblox-deny-rpz, which contains a list of addresses for bypassing RPZ actions. The infoblox-deny-rpz list excludes Grid members that do not have an RPZ license.
Of course I know how easy it is to spot something like that from the ever so tiny admin guide and since your RPZ license is Grid wide, it would easy to not realize that this has been done automatically for you. I would recommend opening a case with Infoblox Support and ask that they open a feature request on your behalf to improve the warning displayed when deleting a Grid member that DNS service on other Grid members can be effected when deleting a Grid member where the RPZ license has been applied. This might help others from finding out the same behavior the hard way.
On a side note- while all versions of NIOS allow you to poll Grid members to see what services will be effected, newer versions of NIOS also give you the ability to view pending changes that are requiring a service restart to take effect. I'm sure most admins won't think to check this every time but if there are concerns like that, it gives you the ability to at least see what's going on. The other thing that some organizations do is setup scheduled service restarts so that they can plan on when those will happen and then don't bother with doing restarts themselves. This way you can schedule these to take place during off peak times. Additionally, you can also setup restart groups so that you can control when each server does its restart so that servers providing the same services don't go down at the same time.
11-03-2017 01:05 PM
I found that sip from the admin guide and it really doesn't help. With a grid wide license, every member has a RPZ license. So, what it should state is that the ACL contains every member that does not currently have a RPZ zone assigned to it even if the DNS service is not running on it. That statement might trigger an admin to correctly go down the path of the repercussions of adding or removing a grid member. The leap that this is a DHCP only, never ran the DNS service, member and simply deleting it from the grid, causes a grid wide DNS service restart is always going to be a difficult one.
The "Affected Members and Services" tab has been broken differently over the years as our grid has grown. It currently, eventually (20-30 seconds), pulls up the list of grid members and their service restart status. If you are fast enough to scroll through it you can see the status of each member. However, it auto refreshes, graying out and locking the menu for 10 seconds, every 10 seconds, resetting you back to the top of the list each time. Spending 2 to 3 minutes attempting to scroll through a large grid, using this constantly pausing and resetting status window, by each admin, for each restart that they do really isn't something that is likely to happen.
A quick test on the "view pending changes" will show that I removed a member, but not that the ACL was updated on every DNS server and they are all pending restarts.
This tab is also currently broken in our grid and shows randomly "pending changes" from admins over the last several days to weeks. So, you are never sure what is really pending until you look at the date they did the work and guess that likely the service really has been restarted since then.
Yes, over time I have opened tickets for these issues. And they have gotten fixed for a while. But as these tabs have been broken for months to years at a time, they are rarely used.
I was glad to see that the reporting members' IP's were not listed ACL so at least adding and removing reporting members should not cause a restart.
The long and short of it is, that with a Grid wide RPZ license installed, there are far more grid changes that can cause DNS a service restart. Some of these new changes used to have zero service impact, and are not at all intuitive in their new link to causing a service impact.
The popup that deleting this grid member will affect all the members running RPZ is likely the best idea.
11-06-2017 01:54 AM
Ouch that is good catch, I have worked at organisatons that require service restarts to be done under change control. Whilst this provides some "career protection" should something go wrong, it doesn't help if the change control is wrong due to these unforeseen consequences - it only takes one person to cause one outage to cause reputational damage, I have seen it happen and have had to deal with the fallout (sometimes for years).
IMHO this is something Infoblox should try to remediate with better reporting of "pending changes".
PCN (UK) Ltd
All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE