I think I have solved it now.
Basically, chrony thought that the time varied too much. So following the link by Peter Rosenberg (and the resources it linked to) I got on the track....
I've put this information here in case someone else searches for it.
First steps on the process was the status from the chronyd service:
systemctl status chronyd
● chronyd.service - NTP client/server
Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-08-02 22:23:39 CEST; 10h ago
Docs: man:chronyd(8)
man:chrony.conf(5)
Process: 24758 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
Process: 24754 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 24756 (chronyd)
CGroup: /system.slice/chronyd.service
└─24756 /usr/sbin/chronyd
Aug 03 08:41:24 db1.aqua.dtu.dk chronyd[24756]: Selected source 162.159.200.1
Aug 03 08:41:24 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 5.118732 seconds, adjustment started
Aug 03 08:41:26 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no majority
Aug 03 08:41:33 db1.aqua.dtu.dk chronyd[24756]: Selected source 162.159.200.123
Aug 03 08:41:33 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 1.761045 seconds, adjustment started
Aug 03 08:42:29 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no majority
Aug 03 08:42:30 db1.aqua.dtu.dk chronyd[24756]: Selected source 192.36.143.130
Aug 03 08:42:30 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 4.500188 seconds, adjustment started
Aug 03 08:43:34 db1.aqua.dtu.dk chronyd[24756]: System clock wrong by 4.842190 seconds, adjustment started
Aug 03 08:44:39 db1.aqua.dtu.dk chronyd[24756]: Can't synchronise: no selectable sources
It clearly showed that something was wrong. So the next step was:
chronyc sources -v
210 Number of sources = 4
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^~ time.cloudflare.com 3 6 377 1 -17.0s[ -17.0s] +/- 1318us
^~ Time100.Stupi.SE 1 6 377 2 -16.9s[ -16.9s] +/- 4458us
^~ time.cloudflare.com 3 6 377 53 -11.2s[ -11.2s] +/- 1306us
^~ n1.taur.dk 1 6 377 60 -10.4s[ -10.4s] +/- 4964us
Notice the time too variable
for all of the servers....
And chronyc tracking
also shows that the time is not aligned at all:
Reference ID : C0248F82 (Time100.Stupi.SE)
Stratum : 2
Ref time (UTC) : Tue Aug 03 06:46:05 2021
System time : 132.970306396 seconds slow of NTP time
Last offset : -4.842189789 seconds
RMS offset : 7.720179081 seconds
Frequency : 63.104 ppm slow
Residual freq : -81143.852 ppm
Skew : 90.130 ppm
Root delay : 0.008654756 seconds
Root dispersion : 19.424978256 seconds
Update interval : 58.2 seconds
Leap status : Normal
After some more reading in the references to the articles mention I tried to adjust the makestep
in the /etc/chrony.conf
file to force an update. I had already changed the NTP pool servers to be "nearer" the application servers, so the config file now looks like this:
server 0.dk.pool.ntp.org iburst
server 1.dk.pool.ntp.org iburst
server 2.dk.pool.ntp.org iburst
server 3.dk.pool.ntp.org iburst
driftfile /var/lib/chrony/drift
makestep 1 -1
rtcsync
It has now been running for a little time and it seems to be keeping the time synchronized ;-)
EDIT:
As Paul Gear pointed out, I had not solved the issue... The time still drifted.
Using /usr/bin/vmware-toolbox-cmd timesync status
I found that on the production servers synchronisation of the time with the ESXi host was ENABLED (!!!). I have no idea how this has happened? The VM I originally configured and uploaded to the data center guys did not have it enabled. Any way, obviously, it should not sync. time with the host.
It is fairly easy to disable by using: /usr/bin/vmware-toolbox-cmd timesync disable
And now we have more realistic data from chronyc sources -v
:
210 Number of sources = 4
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^- sweetums.eng.tdc.net 2 7 377 36 +30us[ +30us] +/- 45ms
^* 77.68.139.83 1 7 377 92 -191us[ -184us] +/- 4742us
^- 152.115.59.244 2 7 377 39 +99us[ +99us] +/- 31ms
^- pf.safe-con.dk 2 7 377 42 +359us[ +359us] +/- 29ms
as well as chronyc tracking
:
Reference ID : 4D448B53 (77.68.139.83)
Stratum : 2
Ref time (UTC) : Tue Aug 03 10:45:26 2021
System time : 0.000008465 seconds slow of NTP time
Last offset : +0.000006720 seconds
RMS offset : 7.358564854 seconds
Frequency : 57.633 ppm slow
Residual freq : +0.001 ppm
Skew : 0.340 ppm
Root delay : 0.009058274 seconds
Root dispersion : 0.000351956 seconds
Update interval : 128.8 seconds
Leap status : Normal
It has now been running smoothly for half an hour so I'm confident this is the solution. Thanks for the input!!!