3 weeks ago
Hi Expert Team,
I'm having such issue on customer that a member of dhcp failover is disconnect from the grid (in this case they are using mgmt port for grid communication). After the member couldn't reach the master the dhcp failover status became failure. From the Grid master i can see that the disconnected member status is unknown. My question is does infoblox used grid communication for DHCP failover? As far as i know that DHCP failover is using tcp port 647, so i assume it use LAN 1 (which dhcp running).
Please your advice
3 weeks ago
Yes, Infoblox does not use grid communication for DHCP failover comms between peers. DHCP failover communication happens via TCP port 647 (default/configurable).
A DHCP failover peer to peer communication can source from LAN1/LAN2/VIP of an appliance depending upon whether the appliance is HA/standalone and whether you have DHCP service enabled and running on LAN1/LAN2/Both.
- If the appliance is standalone and has DHCP enabled only on LAN1, failover communication sources from LAN1.
- If the appliance is HA and has DHCP enabled only on LAN1, failover communication will source from the LAN1/VIP.
- If the appliance is HA and has DHCP enabled on both LAN1 and LAN2, failover communication will source from LAN1/VIP.
- If DHCP is enabled only on LAN2, failover communication sources from LAN2.
While DHCP failover communication and grid communication are separate channels, it is likely that you may have ran into one of the below problems.
1. While grid comms over MGMT was interrupeted to the GM, failover comms was also affected and the affected node had no reach to the other failover peer.
2. Someone performed a change, either to the failover configuration OR its associated DHCP ranges/networks and performed a DHCP service restart which did not propogate to the offline node (offline from Grid comms perspective) of the failover peer. Now after a DHCP restart of the working failover peer, we have a conf file mismatch preventing failover communication between the failover peers. [This is unlikely because it usually throws a "Communication Interrupted" status rather than "Failure"].
3. You may want to get a serial console to the affected node and verify whether it is fully functional and if it has reach to the GM/failover peer and to other required appliances.
2 weeks ago
I've checked on both peers by doing tcp dump and filter on port 647 connection while the grid comm was interrupted over mgmt port, the result was both peers were still xommunicating and reaponding each other. But on the GUI still shows the disconnected peer as Failure. If i checked from hte client side by release and renew, seems the mclt lease time does it came into the play.
Do you have any thought on it?
2 weeks ago - last edited 2 weeks ago
To simulate your issue, I performed the following.
- Create a Failover association (FOA) and associate to a network and range.
- Failover was created using members who are not Grid Master (GM) but regular DHCP grid members.
- Verified that the FOA status is "Running OK".
- Took the secondary peer of the FOA and disabled its grid communication to the Grid Master using iptables.
iptables -I INPUT -p udp --dport 1194 -j DROP
- Secondary peer is now "Offline" from the grid communication perspective and has no reach to the GM.
- FOA status changed from "Running OK" to "Failure".
- Performed a tcpdump and verified that the failover communication between peers are intact.
- Clicking on the FOA "Failure" status from the GM UI shows the Primary peer in "Normal" state and Secondary peer in "Unknown" state.
- If the Primary peer is in "Normal" state, the Secondary peer can only be in the same state which is "Normal" as well.
- The reason why the UI shows "Unknown" is simply because the GM is unable to retrieve the status of this failover peer via Grid Communication.
Please confirm the below:
- Click on the FOA "Failure" status and do you find one of the peers in "Normal" state?
- Login to the CLI of both peers and issue "show log /I move/". You will find messages related to failover state changes and verify whether last recorded message is a state change "to normal".
If both the above are true, then the issue is simply a product of GM not able to retrieve the status of one peer and you should consider fixing Grid Communication.
If your member is part of multiple failover associations, you would want to verify the FOA name in the above logs as well.
I tried to do iptables on the Grid master but it seems iptables command was not available neither in maintenance mode nor expert mode. Could you share how to access the root session?
My apologies that I did not clarify.
I performed this is in a lab environment strictly for issue reproduction only and is not recommended to be done in production.
Infoblox does not provide/offer root access to customers for several security and stability reasons. However if you are attempting to simulate the issue, you would want to block this on a firwall such as Meraki/Fortigate in between.
While I do not know what NIOS version you are on, you could use the CLI command "set vpn_comm [ block | unblock ]" to temporarily block grid comms on a member on new NIOS versions.
Please note that these are disruptive commands and you should be using them only if you have a clear use-case, in a maintenance window or in a lab environment.
Thanks, actually im doing it on the lab environment to replicate the issue. May be i need to add some firewall on the lab.