03-08-2019 10:54 AM
Hi all, I've just configured a new grid to sync with an external NTP server. My GM is approx. 2.5 secs out of sync, I can't wait for days whilst the clock skew slowly syncs the time, I need to get this synced now. I have tried rebooting in the hope it would force an "ntpdate" to be run, but my GM is still 2.5 secs out.
I read somewhere that you need quite a big offset before ntpdate is run but that doesn't help me right now.
Is there some secret command in maintenancemode or something where I can force an "ntpdate" to be executed?
03-08-2019 11:32 AM - edited 03-08-2019 11:36 AM
To the best of my knowledge:
1. Step changes (ntpdate) happens on Infoblox only in the event of a restart, where your offset is >300s.
2. 2.5 seconds should be well under the tolerance limits of most protocols and I don't expect it to take days to slew correct it (Provided that you've 3 or >3 good NTP sources.).
Unfortunately I am not aware any maintenance/expert mode CLI commands that would help you here. ntpdate is possible but only through a root session.
If this was my lab setup, and if 2.5 seconds was a problem and with no root access, I would, manually adjust the clock on the GM +/-5 or 6 mins and perform a product restart (well I think the restart should trigger on its own). I know - Not a good solution.
03-08-2019 12:41 PM
Assuming that my math is right, @ 500PPM, 2.5 seconds should be corrected in approx 1.38 hours.
- Provided that you 3 or >3 good time sources with good reach, stratum and low jitter.
03-12-2019 08:56 AM
Currently only have 2 reachable NTP servers, access to the others hasn't been granted yet. Clock is drifting the wrong way, now 6.3 seconds out, NTP is syncing with local clock.
nlgm101 (A) > show ntp remote refid st t when poll reach delay offset jitter ============================================================================== 22.214.171.124 .INIT. 16 u - 256 0 0.000 0.000 0.000 126.96.36.199 .INIT. 16 u - 256 0 0.000 0.000 0.000 188.8.131.52 .INIT. 16 u - 256 0 0.000 0.000 0.000 184.108.40.206 .INIT. 16 u - 256 0 0.000 0.000 0.000 10.55.164.105 .LOCL. 1 u 24 64 377 0.777 -6349.9 0.305 10.55.164.108 .LOCL. 1 u 65 64 377 0.630 -6349.3 0.680 *127.127.1.1 .LOCL. 12 l 42 64 377 0.000 0.000 0.000 nlgm101 (A) >
03-12-2019 10:00 AM
Removed the unreachable servers, an hour later it's still locked onto the local clock and the offset is getting bigger:
nlgm101 (A) > show ntp remote refid st t when poll reach delay offset jitter ============================================================================== 10.55.164.105 .LOCL. 1 u 22 64 377 0.677 -6464.3 9.879 10.55.164.108 .LOCL. 1 u 17 64 377 0.597 -6464.6 10.131 *127.127.1.1 .LOCL. 12 l 35 64 377 0.000 0.000 0.000 nlgm101 (A) >
03-12-2019 12:16 PM
Tried Bibin's suggestion of manually setting the time, then I re-enabled NTP, this is what I have an hour later:
nlgm101 (A) > show ntp remote refid st t when poll reach delay offset jitter ============================================================================== 10.55.164.105 .LOCL. 1 u 18 64 377 0.630 -662.08 8.297 10.55.164.108 .LOCL. 1 u 10 64 377 0.744 -662.04 8.354 *127.127.1.1 .LOCL. 12 l 50 64 377 0.000 0.000 0.000 nlgm101 (A) >
Offset has come down but it's still sync'ing with the local clock.
If it doesn't sort itself out within a couple hours I'll log a ticket with support.
But this is symptomatic of the problems I have with Infoblox and NTP, this isn't the first time. I once had to reboot a whole grid just to sort these kinds of problems out, which is just plain ridiculous.
03-13-2019 01:19 AM
I had the same problem with a grid yesterday after initial installation.
I disabled the NTP, rebooted the appliances and then reenabled the NTP process.
For me this did the trick to remove the offset of 3 seconds towards 80 miliseconds.
However, this is not really possible in a production grid, so I would log the ticket and, in the worst case, get an RFE logged.
03-13-2019 11:45 AM - edited 03-13-2019 11:46 AM
Based on what an NTP expert stated from years ago, 2 time sources is the worst possible NTP configuration, even worse than having just 1. 3 or more is recommended.
Having said that, the 'show ntp' output is a little confusing.
- Stratum 1 internal servers referencing to their own local clocks? Not certain how thats a good configuration.
- The GM polled 10.55.164.105 - 18 seconds ago, 10.55.164.108 - 10 seconds ago and we continue to poll them every 64 seconds, Best reach value (377), 600+ ms offset, low jitter.
With all the good virtues above, we dont consider either of those 10.x.x.x servers a Candidate, a Bad Quality source or a False Ticker.
Perhaps 'set maintenancemode' followed by 'show ntpstats' can give more info.
Again, if this was my lab, I would trying adding one more time source to the list and disable, enable ntp sync from external sources. While I do understand that many people do not have the luxury to go out and fetch time, To be very frank, I would use the following as NTP servers 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199 alongside a couple of other NIST servers.
Wondering what those servers are - well you can always do a reverse lookup.
03-13-2019 12:43 PM
I totally agree about 2 being a bad number, I actually wrote a paper about this a year ago for another customer explaining why 2 servers is such a bad idea. I have asked this customer if they have a 3rd NTP reference they can provide, but now you mention about the refid being local rather than .GPS. or .PPS. you've really got me thinking now as they are reporting as stratum 1 so something very fishy going on here. I'm starting to think these aren't real stratum 1 servers, so I am not sure what the customer has set up.
The problem I have is I am building a grid that is on a temporary network in preparation for a big network migration next week when everything will get re-addressed onto their proper IPs. This means this grid doesn't have internet access and won't have until the migration starts. So the customer has provided these temporary NTP servers so that we can do some testing, except they don't seem to be working (maybe because Infoblox doesn't "trust" them).
The members are all synced to the grid master, but obviously the grid master is drifting so I guess it's going to drag the members along with it. We are doing a lot of testing this weekend and one component is Cisco ISE, which of course need accurate time for authentication, so I only hope that system will work when it's using Infoblox as it's NTP reference.
I'm logged in again now, here's the grid master:
nlgm101 (A) > show ntp remote refid st t when poll reach delay offset jitter ============================================================================== 10.55.164.105 .LOCL. 1 u 15 64 377 0.703 -2881.2 0.159 10.55.164.108 .LOCL. 1 u 16 64 377 0.693 -2880.6 0.505 *127.127.1.1 .LOCL. 12 l 18 64 377 0.000 0.000 0.000 nlgm101 (A) >
Here's the stats from maintenance mode:
Maintenance Mode > shpw ntpstats unknown command 'shpw' type 'help' for more information Maintenance Mode > show ntpstats ntpdc -c iostat --------------- ntpq -c rv ---------- associd=0 status=0515 leap_none, sync_local, 1 event, clock_sync, version="ntpd firstname.lastname@example.org Mon Mar 27 19:11:03 UTC 2017 (1)", processor="x86_64", system="Linux/3.14.25", leap=00, stratum=13, precision=-23, rootdelay=0.000, rootdisp=11.271, refid=LOCAL(1), reftime=e033d85c.178f7d1b Wed, Mar 13 2019 19:28:28.092, clock=e033d873.8cd87104 Wed, Mar 13 2019 19:28:51.550, peer=55199, tc=6, mintc=3, offset=0.000000, frequency=8.686, sys_jitter=0.000000, clk_jitter=0.000, clk_wander=0.000 ntpq -c pstatus 55197 -------------------------- ntpq -c pstatus 55198 -------------------------- ntpq -c pstatus 55199 -------------------------- Maintenance Mode >
Here's a member:
nlsec101 (A) > show ntp remote refid st t when poll reach delay offset jitter ============================================================================== *169.254.0.1 LOCAL(1) 13 u 22 64 377 0.565 0.038 0.116 127.127.1.1 .LOCL. 14 l 6h 64 0 0.000 0.000 0.000 nlsec101 (A) > set maintenancemode Maintenance Mode > show ntpstats ntpdc -c iostat --------------- ntpq -c rv ---------- associd=0 status=0638 leap_none, sync_ntp, 3 events, no_sys_peer, version="ntpd email@example.com Mon Mar 27 19:11:03 UTC 2017 (1)", processor="x86_64", system="Linux/3.14.25", leap=00, stratum=14, precision=-23, rootdelay=0.557, rootdisp=13.408, refid=169.254.0.1, reftime=e033da5c.8b817aba Wed, Mar 13 2019 19:37:00.544, clock=e033dad6.2067efcb Wed, Mar 13 2019 19:39:02.126, peer=1985, tc=6, mintc=3, offset=0.033728, frequency=0.380, sys_jitter=0.079453, clk_jitter=0.168, clk_wander=0.001 Maintenance Mode >
I think I may just have to wait until the customer gets internet connectivity sorted, then I can sync with some "proper" servers.
All quite interesting though! :-)
03-14-2019 10:29 AM - edited 03-14-2019 10:30 AM
That didn't help. I was hoping to get a look at the "rootdisp" value of 10.x.x.x servers but apparently the output only shows the LOCAL clock.
Best guess - adding one more good time source (assuming the current 2 are also good) and restarting NTP should help.
Wish you the best on your migration.