Score:0

NTP/Chrony not keeping time synchronized on CentOS 7.9 (VM running on VMware ESXi)

cn flag

I have 3 servers running CentOS 7.9.2009 in a data center (VMware ESXi). These servers report that the time is not synchronized. I have a similar test environment running on an inhouse VMware ESXi server where the servers sync. the time Ok. The production environment was originally set up in exactly the same way - but obviously updated with package updates over time. So they "should" be identical - but I cannot guarantee that any more. The ESXi servers are both version 6.

The servers were originally configured using "ntpd" - but when troubleshooting this issue over the last couple of days I have found that "Chrony" seems to be a better choice on CentOS 7. I have therefore reconfigured the servers to use Chrony - but still have the problem.

Edit: Steps used to change to Chrony

  • yum install chrony
  • systemctl stop ntpd
  • systemctl disable ntpd
  • systemctl start chronyd
  • systemctl enable chronyd

So when I use timedatectl I get this output:

      Local time: Mon 2021-08-02 09:14:43 CEST
  Universal time: Mon 2021-08-02 07:14:43 UTC
        RTC time: Mon 2021-08-02 07:16:34
       Time zone: Europe/Copenhagen (CEST, +0200)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2021-03-28 01:59:59 CET
                  Sun 2021-03-28 03:00:00 CEST
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2021-10-31 02:59:59 CEST
                  Sun 2021-10-31 02:00:00 CET

If I restart Chrony using systemctl restart chronyd then after a couple of seconds timedatectl reports:

      Local time: Mon 2021-08-02 09:26:06 CEST
  Universal time: Mon 2021-08-02 07:26:06 UTC
        RTC time: Mon 2021-08-02 07:26:08
       Time zone: Europe/Copenhagen (CEST, +0200)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2021-03-28 01:59:59 CET
                  Sun 2021-03-28 03:00:00 CEST
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2021-10-31 02:59:59 CEST
                  Sun 2021-10-31 02:00:00 CET

After some time (minutes) it is back to NTP synchronized: no.

When I run ntpstat I get:

synchronised to NTP server (217.198.219.102) at stratum 2
   time correct to within 124123 ms
   polling server every 64 s

or

unsynchronised
poll interval unknown

In the last case then after some time it will show the first output again. But the "within ... ms" seems pretty high???

As I can get it synchronized by restarting Chrony then I guess that firewall/network is Ok. I use the default Chrony config (as I did with ntpd before).

VMwaretools service is installed and started (open-vm-tools, http://github.com/vmware/open-vm-tools).

I would appreciate any suggestions for troubleshooting this further - and eventually fix it ;-)

Thanks in advance!

/John

vidarlo avatar
ar flag
Is the VM configured to get time from the host as well?
Michael Hampton avatar
cz flag
Did you forget to stop ntpd?
John Dalsgaard avatar
cn flag
@MichaelHampton - Nope. I stopped and disabled ntpd ;-) I'll update the description
John Dalsgaard avatar
cn flag
@vidarlo Hmmm... good question. It shouldn't - but on the other hand that is one of the differences. How can I check that?
Paul Gear avatar
cn flag
Does your distro's version of `open-vm-tools` enable time sync by default? That will severely mess up both `chronyd` and `ntpd`.
John Dalsgaard avatar
cn flag
Good question @PaulGear, I will check up on that. But I wouldn't think so as I have the same setup running on our internal VMware ESXi server....
Paul Gear avatar
cn flag
@JohnDalsgaard Best to check syslogs to see what chronyd is saying about time adjustments.
Score:1
cn flag

I think I have solved it now.

Basically, chrony thought that the time varied too much. So following the link by Peter Rosenberg (and the resources it linked to) I got on the track....

I've put this information here in case someone else searches for it.

First steps on the process was the status from the chronyd service:

systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-08-02 22:23:39 CEST; 10h ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 24758 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 24754 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 24756 (chronyd)
   CGroup: /system.slice/chronyd.service
           └─24756 /usr/sbin/chronyd

Aug 03 08:41:24 db1.aqua.dtu.dk chronyd[24756]: Selected source 162.159.200.1
Aug 03 08:41:24 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 5.118732 seconds, adjustment started
Aug 03 08:41:26 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no majority
Aug 03 08:41:33 db1.aqua.dtu.dk chronyd[24756]: Selected source 162.159.200.123
Aug 03 08:41:33 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 1.761045 seconds, adjustment started
Aug 03 08:42:29 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no majority
Aug 03 08:42:30 db1.aqua.dtu.dk chronyd[24756]: Selected source 192.36.143.130
Aug 03 08:42:30 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 4.500188 seconds, adjustment started
Aug 03 08:43:34 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 4.842190 seconds, adjustment started
Aug 03 08:44:39 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no selectable sources

It clearly showed that something was wrong. So the next step was:

chronyc sources -v
210 Number of sources = 4

  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| /   '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^~ time.cloudflare.com           3   6   377     1   -17.0s[ -17.0s] +/- 1318us
^~ Time100.Stupi.SE              1   6   377     2   -16.9s[ -16.9s] +/- 4458us
^~ time.cloudflare.com           3   6   377    53   -11.2s[ -11.2s] +/- 1306us
^~ n1.taur.dk                    1   6   377    60   -10.4s[ -10.4s] +/- 4964us

Notice the time too variable for all of the servers....

And chronyc tracking also shows that the time is not aligned at all:

Reference ID    : C0248F82 (Time100.Stupi.SE)
Stratum         : 2
Ref time (UTC)  : Tue Aug 03 06:46:05 2021
System time     : 132.970306396 seconds slow of NTP time
Last offset     : -4.842189789 seconds
RMS offset      : 7.720179081 seconds
Frequency       : 63.104 ppm slow
Residual freq   : -81143.852 ppm
Skew            : 90.130 ppm
Root delay      : 0.008654756 seconds
Root dispersion : 19.424978256 seconds
Update interval : 58.2 seconds
Leap status     : Normal

After some more reading in the references to the articles mention I tried to adjust the makestep in the /etc/chrony.conf file to force an update. I had already changed the NTP pool servers to be "nearer" the application servers, so the config file now looks like this:

server 0.dk.pool.ntp.org iburst
server 1.dk.pool.ntp.org iburst
server 2.dk.pool.ntp.org iburst
server 3.dk.pool.ntp.org iburst
driftfile /var/lib/chrony/drift
makestep 1 -1
rtcsync

It has now been running for a little time and it seems to be keeping the time synchronized ;-)

EDIT:

As Paul Gear pointed out, I had not solved the issue... The time still drifted.

Using /usr/bin/vmware-toolbox-cmd timesync status I found that on the production servers synchronisation of the time with the ESXi host was ENABLED (!!!). I have no idea how this has happened? The VM I originally configured and uploaded to the data center guys did not have it enabled. Any way, obviously, it should not sync. time with the host.

It is fairly easy to disable by using: /usr/bin/vmware-toolbox-cmd timesync disable

And now we have more realistic data from chronyc sources -v:

210 Number of sources = 4

  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| /   '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^- sweetums.eng.tdc.net          2   7   377    36    +30us[  +30us] +/-   45ms
^* 77.68.139.83                  1   7   377    92   -191us[ -184us] +/- 4742us
^- 152.115.59.244                2   7   377    39    +99us[  +99us] +/-   31ms
^- pf.safe-con.dk                2   7   377    42   +359us[ +359us] +/-   29ms

as well as chronyc tracking:

Reference ID    : 4D448B53 (77.68.139.83)
Stratum         : 2
Ref time (UTC)  : Tue Aug 03 10:45:26 2021
System time     : 0.000008465 seconds slow of NTP time
Last offset     : +0.000006720 seconds
RMS offset      : 7.358564854 seconds
Frequency       : 57.633 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.340 ppm
Root delay      : 0.009058274 seconds
Root dispersion : 0.000351956 seconds
Update interval : 128.8 seconds
Leap status     : Normal

It has now been running smoothly for half an hour so I'm confident this is the solution. Thanks for the input!!!

Paul Gear avatar
cn flag
I don't think you've solved the problem; I think you've just worked around the symptoms. Between 08:41:33 and 08:43:34 your system went 3 seconds further out of sync. That is a very long way in 2 minutes, and I think it indicates some underlying problems with your VMware setup or your kernel configuration. A lot of old guides suggest very outdated kernel parameters to be used to "tune" for VMware, and they will often make things worse. Check over your kernel and hypervisor configuration (preferably turn off host time sync in VMware) and turn on chrony logging to see the change over time.
John Dalsgaard avatar
cn flag
@PaulGear you are right. I can see that it occasionally reports that it is not synced. The kernel parameters were needed in some contexts up to CentOS 5. From 6+ they should not be needed any more - and I don't set them. The ESXi server that these servers run on are "out of my hands" - so there could be things going on in the host that I don't know about. I'll check open-vm-tools to see if I can control time sync. from the service. I agree it should NOT try to sync. with the ESXi server's time....
John Dalsgaard avatar
cn flag
Bingo @PaulGear. For some reason this was enabled on the production servers....???? I used: `/usr/bin/vmware-toolbox-cmd timesync status` to see that it was "Enabled" - and `/usr/bin/vmware-toolbox-cmd timesync disable` to disable it. This seems to be the root cause - and I guess I can possibly go back to a chrony.conf file that is closer to the default. Will see how it goes - and update here. Thanks!
Paul Gear avatar
cn flag
Good to hear, John. It's definitely preferable to keep your configuration file as close to the defaults as possible.
Score:0
im flag

Consider to use the minimal client setup as suggested here: https://www.golinuxcloud.com/configure-chrony-ntp-server-client-force-sync/ When the threshold of drifting is reached it gives up on synching which seems happen to you.

Michael Hampton avatar
cz flag
What is the setup? Not everyone can click through links, and links die eventually anyway, so all of the relevant information should be included in your answer.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.