Reply

Grid problem - HA broken

[ Edited ]
fovada
Techie
Posts: 4
7183     0

Hello,

 

My name is Fovad and I've one of my customers grid members that won't join the grid.

The Customer facing issues in associatin one of the infoblox device to Grid.

All of a sudden they lost the management of the device and also the high availability is broken and the device got dis-associated with grid manager.

As we didn't get the console access we rebooted the device but unable associate the device to GM.

 

I asked the customer to run following tests

  • Ping the unit from Grid-manger - Works
    2- Check if all Access lists in place. OK
    4- Check ip setting and HA status.- IP confirmed but HA not happening and joining grid not happening.

 We tried with following link without any results

https://community.infoblox.com/t5/DNS-DHCP-IPAM/Tip-What-to-do-if-grid-members-won-t-join-the-grid-u...

 

And here we have the model and version of the devices.

Version : 6.12.24-349737

Model – IB1410

 

I appreciate if someone could send some advice here.

 

Best Regards

/Fovad Adami

HA-brojen-ICA.jpg

Re: Grid problem - HA broken

RichA
Techie
Posts: 9
7184     0

My two cents.  Can the HA port IP addresses be pinged?  Can the two nodes ping each other on the LAN1 port as well as the HA port IPs?  Just a thought as I had a situation last week that one of the nodes had the HA cable plugged into the wrong port.

Re: Grid problem - HA broken

fovada
Techie
Posts: 4
7184     0

Thank you RichA.

I ask the customer to run the pings and wait for the reults.

Re: Grid problem - HA broken

fovada
Techie
Posts: 4
7184     0

Here is the setting.

--------------------------

 

Node 1

Ha1                     10.255.255.11

Lan1                    10.255.255.13

MGM1               10.119.245.112

 

Node 2 (lost Grid connection)

Ha2                     10.255.255.12

Lan2                    10.255.255.14

MGM2               10.119.245.113

 

Vip     10.255.255.10

 

Grid Master

GM = 10.107.198.10 (vip)

 

Grid traffic to GM via 10.255.255.1

 

Ping from from all interfaces to all interfaces on 10.255.255.0/24 = OK

SSH to MGM1 och MGM2 = OK

 

------------+-------+------------------------+-------+-------+

            |Lan1   |Ha1                     |Lan2   |Ha2

            |.13    |.11                     |.14    |.12

          +------------+                   +------------+

          |  Node 1    |                   |  Node 2    |

          |            |                   |            |

          +------------+                   +------------+

                |                                |

                |Mgm1 .112                       |Mgm2 .113

----------------+--------------------------------+

 

And here we have som logs :

 

2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: OpenVPN 2.1_rc20 x86_64-redhat-linux [SSL] [LZO2] [EPOLL] [PKCS11] built on Oct 14 2016 

2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: WARNING: --keepalive option is missing from server config 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: NOTE: OpenVPN 2.1 requires '--script-security 2' or higher to call user-defined scripts or executables 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: TUN/TAP device tun1 opened 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: /sbin/ip link set dev tun1 up mtu 1500 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20831]: /sbin/ip addr add dev tun1 local 169.254.255.1 peer 169.254.255.2 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20835]: Close error on pid file /infoblox/var/vpn_pids/tun1.pid: No space left on device (errno=28) 
2017-08-05T02:50:50+02:00 10.119.245.113 openvpn-master[20835]: Exiting 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: OpenVPN 2.1_rc20 x86_64-redhat-linux [SSL] [LZO2] [EPOLL] [PKCS11] built on Oct 14 2016 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: WARNING: --keepalive option is missing from server config 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: NOTE: OpenVPN 2.1 requires '--script-security 2' or higher to call user-defined scripts or executables 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: TUN/TAP device tun1 opened 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: /sbin/ip link set dev tun1 up mtu 1500 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24199]: /sbin/ip addr add dev tun1 local 169.254.255.1 peer 169.254.255.2 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24203]: Close error on pid file /infoblox/var/vpn_pids/tun1.pid: No space left on device (errno=28) 
2017-08-05T02:55:58+02:00 10.119.245.113 openvpn-master[24203]: Exiting 
2017-08-05T02:57:41+02:00 10.119.245.113 openvpn-member[21514]: event_wait : Interrupted system call (code=4) 
2017-08-05T02:57:41+02:00 10.119.245.113 openvpn-member[21514]: SIGTERM received, sending exit notification to peer 
2017-08-05T02:57:42+02:00 10.119.245.113 openvpn-member[21514]: /sbin/ip addr del dev tun2 local 169.254.0.8 peer 169.254.0.1 

 

 

/Regards

/Fovad

Re: Grid problem - HA broken

RichA
Techie
Posts: 9
7184     0

I would look at this error first "No space left on device (errno=28) "

Re: Grid problem - HA broken

fovada
Techie
Posts: 4
7184     0

I checked the "No space left on device (errno=28) " first and according to the customer, they have enough space.

 

Regards

/Fovad

Re: Grid problem - HA broken

Adviser
Posts: 107
7184     0

 

Hello Fovad,

 

Issues like these require active troubleshooting and hence it would be best to open a case with Infoblox Support.

 

It is not normal for the HA interface 10.255.255.12 to respond to ping. By design HA of the passive node does not respond to ping unless you have specifically enabled a setting "Enable ARP on HA Passive Node".

 

If you are still facing issues on the grid, I would suggest verifying the below.

 

1. Verify UDP ports 1194 and 2114 are bidirectionally open between the Grid Master and member.

2. Login to the CLI of both 10.255.255.13 & 10.255.255.14 and issue 'show status' to verify whether they display 'Active' and 'Passive' correctly. If both of them show 'Active', VRRP communication may be broken.

3. Issue 'show interface' in the CLI of both the above nodes and verify the displayed 'Status', 'Speed', 'Duplex' for LAN1 and HA. If this member is configured to perform grid cmmunication via MGMT port, you would want to verify that as well.

4. If you find anything suspicious, verify the physical cable connections and switch port link status.

 

5. Verify siwtch port confguration to ensure that it meets the prerequisites for an HA pair to function properly. Some of the generally required settings are explained in the below Infoblox knowledgebase article.


NIOS uses the Virtual Router Redundancy Protocol (VRRP) for HA communications and HA-failover.

 

Best Regards,

Bibin Thomas

Re: Grid problem - HA broken

RichA
Techie
Posts: 9
7184     0

Fovada,

  I just ran into a somewhat similar situation in our lab.  Unfortunately in our lab, I have never seen the GM HA pair being online since I have been here.  We were needing to upgrade our NIOS code and since the GM HA pair was basically broken and when doing an upgrade, the passive node of the HA pair is the first to be upgraded.  We were unable to distribute the new code with the pair being broken.

  Here is what I did to fix our problem.  First, I broke the HA pair by making the active node a standalone device.  I then proceeded to upgrade our Grid with the new NIOS version that we wanted to test.  That was a success.  I then pre-configured the HA pair by selecting HA pair in the Network section of the GM.  Configured the VRID and all the IP addresses of the HA pair.  After the system rebooted, I left it sit overnight to just let everything settle out.  The next I logged into the GUI of the passive node and upgraded it to the new code running in the Grid.  When it rebooted after the upgrade, I attempted to rejoin the Grid as the passive node of the HA pair.  It worked!

  The only thing that I can think of was that there was something wrong with that node's previsould NIOS install or something.  Either way, this is what fixed my issue in our lab.  I do not know if you got your fixed or not but, I thought that I would throw my experience out there.

Showing results for 
Search instead for 
Do you mean 

Recommended for You