DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

paulr · ‎08-15-2019

Hey all,

Just an observation, we have deployed a new grid running NIOS 8.4.1. Every time we add new scopes to the DHCP failover association one of the peers goes into recover-wait state, and stays in this state for around MCLT (1 hour) before eventually returning to normal.

We're not entirely sure what's happening, the servers aren't heaviliy utilised yet, I think there's only a couple thousand leases so shouldn't take this long to sync leases each time.

We will send support bundles into support, but it's happened a couple times now and I'm just a little curious that this is not normal. Has anyone else seen anything like this on NIOS 8.4.x?

Cheers,

Paul

Paul Roberts
PCN (UK) Ltd

All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE

bkoshy · ‎08-20-2019

It should do that only during a Force Recovery to recover from conflicts or while resuming operations after a Partner Down state on its peer.

I agree it is not ideal and is ambiguous. Are you able to replicate it in lab and do you find any NIC down (eth1/eth2/LAN1/VIP is down syslog/debug log messages) or communication issues between the failover peers?

Kindly open a case with Infoblox support.

Best Regards,

Bibin Thomas

paulr · ‎08-21-2019

Thanks, I did pull the support bundles and took a look through the logs but couldn't find anything suspicious. We added more scopes last Friday and didn't have any problems so maybe it was just one-off. We are adding the final batch of scopes this weekend so will see how we get on then and if we have more problems we'll pull another set of support bundles and log a ticket.

Paul Roberts
PCN (UK) Ltd

All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE

bkoshy · ‎01-02-2020

Hello again Paul,

I got to learn from Engineering that this is a major change that went into 8.4.x.

There is a chance for the FOA to goto recover-wait for ~MCLT time [especially while moving ranges back and forth from single-member-->FOA] and the change was done to overcome an issue much larger in impact [Pool rebalancing failure for 1/multiple/all dhcp ranges].

We will request to have this documented well in the Administrator Guide but unfortunately the only suggestion I have for customers making such changes are to plan such changes in a maintenance window of [Time required for the change+MCLT].

Best Regards,
Bibin Thomas

paulr · ‎04-08-2020

Thanks for the reply Bibin. I am a little concerned about this change now, I have project coming up where I have to take an existing single DHCP server and add a failover peer to all the existing ranges.

So it seems that the new peer may well go into recover-wait state for MCLT, that isn't too bad I suppose, but what about the existing node? I hope that will keep operating and doesn't also go into recover-wait state - I am trying to deploy the failover peer without causing an outage, but you have got me worried now....

Further along into the future, once the failover association has been established, the original peer is going to be shut down, so all the ranges need to be moved to another peer - again now I wonder what impact this will cause if one or both of the nodes go into recover-wait state?

I'll test it in the lab and see what happens but would be interested to hear your thoughts on this.

Cheers,

Paul

Paul Roberts
PCN (UK) Ltd

All opinions expressed are my own and not representative of PCN Inc./PCN (UK) Ltd. E&OE

bkoshy · ‎05-12-2020

Hi Paul,

Sorry I missed your response.

Further along into the future, once the failover association has been established, the original peer is going to be shut down, so all the ranges need to be moved to another peer - again now I wonder what impact this will cause if one or both of the nodes go into recover-wait state?

1. When we move from a standalone config to FOA, we are expecting one of the below to happen

a. One peer and most probably the Primary peer (may depend on the order of restart completion) goes to RECOVER-->RECOVER DONE-->NORMAL [Should complete in a minute]
or
b. One peer moves to RECOVER-->RECOVER-WAIT-->NORMAL [Should take upto MCLT]

Once we are done with the changes of moving all concerned ranges to FOA and once the FOA has both peers in Normal--Normal state, for any further changes of replacing the secondary node should not put both peers of the FOA to RECOVER-WAIT. There is a chance for the new Secondary peer to move to RECOVER-WAIT/RECOVER but the remaining peer should still be serving from its pool of free/backup IP addresses but again that is limited to 50% of the available IPs since pool balancing may not work during this time.

Best Regards,

Bibin Thomas

bkoshy · ‎07-30-2020

Hey Paul,

I recently had a meeting with our DHCP SME in Engineering and the only available suggestion at this moment would be to reduce the MCLT to 300seconds before performing changes and later moving it back to the old config 3600s(default).

Best Regards,

Bibin Thomas

THE GAME HAS CHANGED

NIOS DNS DHCP IPAM

DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1

Re: DHCP failover pair keeps going into recover-wait state on NIOS 8.4.1