11-08-2018 05:00 AM
So one of my grid members suddenly stopped resolving any and all quries to SOME
of the zones that it is authorative for. This particular grid member only resolves external zones at the moment and some of the subzones it is authorative for. All other grid members are working fine.
I'm battling to find the cause of the problem. I've checkd the replication status and it says its ok. If I look at statistics for that member is has processed zero queries for the zone in question.
I've restarted it, rebooted it and so on. I've made no changes recently to the grid, it just stopped working. Doing a dig returns SERVFAIL for anything in that zone/domain.
I see the following error in that members log but it's not even for the zone in question:
general: zone domaindnszones.ZONEDELETED.COM/IN: refresh: skipping zone transfer as master IPDELETED#53 (source IPDELETED#0) is unreachable (cached)
I'm running 8.2.6 across the entire grid. There is no firewall between the grid members and grid master and all network connectivity has been verified.
Any advice appreciated.
11-08-2018 05:52 AM
Hmmm, sounds strange to me without seeing the complete configuration (named.conf). Better open a support ticket with infoblox so they can investigate it.
11-08-2018 03:46 PM - edited 11-08-2018 03:47 PM
What stands out to me here is "refresh: skipping zone transfer as master is unreachable". This tells me that this is a secondary zone on this server and the primary has stopped responding. The servfail response that you are seeing when querying this server is expected in this case.
If you can connect to the CLI for this server, you can attempt a zone transfer and see if that is successful or not. Example:
dig @<server_address> <zone.name> axfr
A servfail or refused response will confirm that the primary is no longer responding properly and you need to check that server. Verify if it is seeing zone transfer requests from this server and if so, what it is doing in response (it may no longer have the zone loaded, or its zone transfer settings/ACL no longer allows your server). If it is not seeing the zone transfer requests at all, you have a firewall issue or misconfiguration somewhere.
Hope this helps.
11-08-2018 07:41 PM - edited 11-08-2018 07:43 PM
To follow up what Tony has stated :
By default dig uses LAN1 interface. Thus, while performing a dig as he suggested, you would also have to ensure that the dig request is initiated from the ‘transfer-source’ as in the named.conf file. So to append the request, you would :
dig @<primary_server_address> <zone.name> axfr -b <transfer-source-IP>
If your transfer source in named.conf is indeed the LAN1 IP address, you don’t have to use ‘-b’ extension. Comparing the system log pasted with the SERVFAIL responses, I am assuming that the secondary zone could have expired in those servers. If the above query is successful, you may use the following control channel signal to force a zone transfer (it does the traditional rndc underneath...) :
set dns transfer <zone_name> <DNS_view_name>
This comamnd would pull a complete set of data from the 'master' in the named.conf. Starting a packet capture during the same would help you identify additional information in case if the transfer has been Refused/Failed for any reason.
Additional note : Your statement, “I see the following error in that members log but it's not even for the zone in question:” tells me that the pasted system log may/may not be related to the zones which could be affected. If this is something which affects an active production environment causing impact to depending clients, please contact Infoblox technical support immediately to expedite resolution!
11-09-2018 12:52 AM
Thank you all for the replies.
Your input has been most valuable.
I tested from the member and it was not able to do any zone transfers at all (via dig or via
set dns transfer). There are a few zones and subzones and none of them worked.
Doing a packet capture I could not even see the zone transfer attempt (Query type 252) from that member and I captured on ALL ports.
Other members in my grid work perfectly and doing the same test and capture I can clearly see the zone transfer happening from the same primary server.
I've checked as many settings as I can find and all the ACEs etc look correct, there is no config mismatch between this particular member and the others on the grid.
The search continues......
4 weeks ago
Not knowing more- zone transfer requests don't always happen right away so it is possible that the capture did not run long enough. That, or the DNS service has an issue. I would recommend working with Infoblox Support to troubleshoot this further and they can help review the logs to see if that gives any indication as to why it is not working as expected.
Just to bring this full circle (somewhat).
The problem has been resolved by removing the offending member from the
grid and rejoining it to the grid. This obviously forces all sorts of background
syncs to happen and that fixed the problem.
Sadly I can't say why it happened in the first place but at least it's resolved,
and rather simply at that.
Thanks again for all your inputs.