10-06-2020 04:28 PM
i believe i've figured out the fix to a problem that i know i've experienced numerous times over the last few years, and i assume it's possible that others have as well.
if you are setting up a new grid member, and you have a set up where the lan1 interface is unable to communicate with the grid master, your only realistic choice is to get on the cli and do a join and tell it to use the mgmt interface (sorta like checking the box to say to use the mgmt port for vpn communication in the gui).
things start off fine. the member reaches out over the mgmt interface, contacts the grid master, and verifies the grid name and secret are legit, and that it is defined as a grid member. i'm not sure what all info it might pull, not much it appears...basic interface networking info, maybe?...then it restarts.
this is when the problem occurs. when it restarts, it tries to contact the grid master again. and it can't. maybe it'll time out after 10 minutes of trying. or maybe it'll get hung for eternity until you actual pull the power plug to hard cycle the device. if you're lucky, you had physical access. if not, you had to get data center ops to do it, or worse yet, someone to drive somewhere to do it. (assuming you didn't have lom access.) next, if you're lucky, it'll come back up and give you a prompt and you can try something else. if you're unlucky, it'll just go straight into trying to contact the grid master again. another eternity. repeat. (that seems like a lot of eternities, but it's really just one. ; ) if this happens, you'll have to power cycle the box, then break to emergency prompt, then do a database reset. fun!
what i've discovered, is after that first restart, the box conveniently forgets it is supposed to be contacting the grid master over the mgmt port and it tries using the default route, which is lan1. which can't route to the ip of the grid master. oops. if you defined a route on the grid gui thinking that would get around this, it won't have transferred that info during that initial communication with the grid master. oops again.
so...here's my "fix". before i do the grid join, i set a /32 static route to the grid master and set it to use the mgmt gateway ip. if you do this, after the restart it will still have the /32 route you put it, it will route over the mgmt port, it will contact the gm successfully, then it will pull all of the config stuff from the grid (including any static routes you set in the grid for the member) and restart services again. everything will be fine now.
i know this is probably a bit of a fringe case, as most people probably set up systems that can communicate with the grid master over lan1, even if they eventually plan to move the vpn communication to the mgmt port. but if it can't, this situation will arise.
probably, infoblox should fix this, as i would assume it's not supposed to behave this way. i guess it's also possible it's been fixed in some newer version. but i know i've seen this on 8.0.x and 8.1.x versions, and i think 7.x versions as well (time heals all wounds). (for the record, i don't believe i've discussed with this infoblox or opened a ticket about it or anything. maybe someone should do that... : ) enjoy!
10-06-2020 04:34 PM
to reply to my own post, it's also possible infoblox assumes that anyone setting up these devices should know to set up routes if traffic can't flow over lan1. but it feels like if the setup allows you to choose to have grid communication happen over the mgmt port, then the setup should take care of what is necessary to have the grid communication happen over the mgmt port.