AWS Grid member offline after trying to join grid

jauling · ‎02-08-2022

Our grid currently lives on-premise. We are slowly migrating over to AWS, and the first member we're deploying in AWS is to migrate an NS server. This should be simple enough, by deploying a member in AWS. No matter what I've tried, the AWS member is showing status "Offline" in red.

In the syslogs, I see that the member connects to the grid master and some openvpn traffic occurs. From the grid master:

Feb 8 15:05:02 10.2.10.76 openvpn-master[4036]: 10.146.12.100:5002 [VPN Node] Peer Connection Initiated with [AF_INET]10.146.12.100:5002
Feb 8 15:05:02 10.2.10.76 openvpn-master[4036]: VPN Node/10.146.12.100:5002 MULTI_sva: pool returned IPv4=169.254.0.13, IPv6=(Not enabled)
Feb 8 15:05:05 10.2.10.76 openvpn-master[4036]: VPN Node/10.146.12.100:5002 send_push_reply(): safe_cap=940

Feb 8 15:05:06 10.2.10.76 clusterd[581]: Grid member at 10.146.12.100 has connected to the Grid master.

Feb 8 15:05:14 10.2.10.76 clusterd[581]: Grid member at 10.146.12.100 is no longer connected.

But then you see a disconnect, which I find very strange. It happens every time I do a set membership from the member.

The member isn't showing much of anything in the syslog, but I do see that the ntp client has some issues and times out on some servers defined with hostnames, namely all the pool.ntp.org servers in the list. We have a few defined with ip addresses, and that service eventually goes green (in the syslog). My guess is that during the grid join process, the new member will attempt to contact the list of ntp servers defined in the grid, and only after a successful join will it then inherit the effective settings, which by default would be syncing only with the gridmaster.

Will the ntpdate timeout on the member be the root cause of the member offline status? Or am I looking in the wrong direction? Any help would be appreciated! We haven't had any issues with our on-premise members joining the grid, so this fact also makes the troubleshooting difficult. AFAIK, we've defined all the necessary standard NIOS port rules on the AWS VPC side.

jauling · ‎02-09-2022

I've removed all the hostnamed ntp servers in our grid, so they're all ip addressed, but that didn't fix the offline status of the aws member, unfortunately.

jauling · ‎02-10-2022

So, after a lengthy Zoom session with Infoblox support, we've resolved the issue.

We have NAT groups defined, in all our production grid members. This is a configuration setting that I was not aware of, mainly because the original grid setup was done before my time. The support engineer says that all grid members need to be in the same NAT group in order for all members to function properly. So, once I added this AWS member to the same NAT group, the grid join was successful, and my AWS member is finally in green status of Running.

We didn't find any definitive log entries that would have pointed at this, but I'm glad that this finally is resolved.

THE GAME HAS CHANGED

Amazon Web Services

AWS Grid member offline after trying to join grid

Re: AWS Grid member offline after trying to join grid

Re: AWS Grid member offline after trying to join grid