10-19-2016 02:33 AM
I have a grid consisting of:
- one virtual appliance (IPAM, grid master)
- two physical appliances (failover pair DHCP and internal DNS, grid master candidate)
- two smaller physical appliances (external DNS)
For disaster recovery puropose, we would like to periodically snapshot the virtual grid master and send it to the disaster recovery site. We are currently running it on a vmWare + Veeam infrastructure. Whenever we trigger quiescence on the IPAM virtual appliance (to get a consistent backup), it crashes and a restart is needed.
Here you are some versions of the software involved:
- NIOS 7.3.8 on all grid members
- IPAM = VM-1410, vmWare virtual hardware version 7, vmware tools version 2147483647 (guest managed)
- internal appliances = IB-1420
- external appliances = IB-820
- vmWare ESXi version 6.0.0, 3620759
- Veeam Backup & Replication 9.0, Version: 126.96.36.1991
We run some tests updating the virtual hardware to version 8 through 11, but it only works on an entirely disconnected clone of the IPAM virtual appliance. We opened a support ticket with Infoblox to check with them if they have tested virtual hardware upgrades, and if this could be an issue for support, with no significant objections to that.
So, here's the question: aside from shutting down the VM-1410, backing it up, and starting it, do you have any better suggestion for the task of sending an updated and consistent snapshot of the VM-1410 to our disaster recovery site? Moving one of the physical appliances to the disaster recovery site is not an option, unfortunately.
Thank you in advance.
Solved! Go to Solution.
10-19-2016 06:46 AM
This is a known issue when taking snapshots on virtual applainces and documenetd in this KB article with refrence to supporting VMware KB.
For disaster recovery you make use NIOS database backups which can be scheduled daily & be configured to be sent to a FTP or SCP server. This database backup can be restored on NIOS (hardware or virtual) to replicate your grid setup including all members and their configurations.
On a side note there is another issue that is triggered by snapshots on virtual machines that affects HA pairs
10-19-2016 06:52 AM
The easiest solution would be to setup your DR server in your remote site as a member of your production Grid and enable it as a Master Candidate. By doing so, it will always retain a full and current copy of the Grid database without any further intervention required. If for some reason you need to switch over to using your DR server, you can promote it to take over the Grid Master role by connecting to its command line interface and running the command "set promote_master".
10-19-2016 07:06 AM
That would be my first choice, but unfortunately I can't leave a VM this big in the DR ESX server always on (and we also do not have another VM license suitable for the task).
Thank you for your answer, I'll have our ESX engineers evaluate the UUID KB on vmware, and we'll eventually resort to an external server.
@Aveen: Thank you for your answer, I'll have our ESX engineers evaluate the UUID KB on vmware, and we'll eventually resort to an external server.
10-19-2016 07:10 AM
That would not be an issue. Simply power on the DR system, allow it to sync back up to the Grid Master and once it shows a green status on the Grid Manager tab, power it back off. It will show up as offline when powered off but will not cause any issues and it will automatically synchronize with the Grid Master when it's powered back on and reconnects.
I imagine that some of this can even be automated via the API. Your Sales rep would be able to help you wtih that if you need any assistance with it.
01-18-2018 08:54 AM
I know I'm responding a few months later, sorry. Here is what I have done for the same type of situation.
Our current DR solution of this nature is built on an as needed basis. Once a year test and built from scratch (backups) every time for everything. Over the years, I have tried numerous ways to get around the issues you are seeing. Ifinally settled on a nightly off grid backup, which we do anyway, to a given system that is snapshot by veeam daily. Also present on that system are copies of the appropriate vnios appliances we use in production (only). Here is the basic sequence of events we use:
A. DR network built
B. DR VM farm setup
C. DR veeam restore of guest mentioned above containing grid backup and vnios appliances
D. VM admin retrieves virtual appliances from guest and begins deploying per documented order
E. A member of my team connects and does basic configuration to the point where we can restore the backup from the day before the "day of failure".
F. VM admin continues to deploy rest of environment that is vm
We perform some magic to make the virtual members resemble the physical members of our grid for the tests. Too many "users" that don't know how to change their own ip information. In our environment our physical grid members are primary and virtual are secondary. At some point in the future we will be switching to using anycast to get around these steps, but right now we don't have enough free time to plan that conversion.
Oh, and from my point of view, all of this is done using temp license. If we were in a disaster situation we would possibly need to migrate licenses to the new guests. The temporary licenses are long enough for us to make this distinction though.
This is the only solution we could come up with that would prevent constant crashes and possible db corruption. I would greatly appreciate it if Infoblox would/could make it possible to just make a veeam backup of a running vm and it not crash in the process. Possibly even working directly with vmware/veeam to work out some way of signaling the guest to tell it "Hey, don't freak out, I just want to help protect you. Give me a few seconds and you can go back to the way you were."