Score:1

Troubleshooting NTP sync in VirtualBox Linux server

cn flag

I have a rather simple setup where I have two computers:

Computer A. has a normal NTP setup and uses the standard Internet sources (As per Ubuntu) to determine time. It also allows for query on IP 10.0.2.0/24:

restrict 10.0.2.0 mask 255.255.255.0 nomodify notrap

Computer B. has a normal NTP setup, except for all the sources are changed to use 10.0.2.1 (which is Computer A).

Once in a while, Computer A gets a Kiss-of-Death signal from one of its source. As a result, it totally kills Computer B's NTP (i.e. it looks like the KoD is transmitted directly).

Is there a way to know the state of an NTP server in terms of whether it will just be sending KoD message or not? (also, how do I get out of that situation? When I looked at it, all the IP addresses shown in the log were not used by the server?! so I don't understand why it insists on sending KoD to its client).


I've found two things so far:

  1. ntpq

    I can run ntpq like so:

     ntpq -pn
    

    When the NTP server works, I can see an asterisk in front of the IP address the computer is happy about. In my case, all of the status flags (first column +, -, *, #, etc.) all disappear. As far as I know, that means the NTP service is not happy and no synchronization is being performed.

    Here is an example when it still works (i.e. there are flags in the very first column):

          remote           refid      st t when poll reach   delay   offset  jitter
     ==============================================================================
      10.0.2.255      .BCST.          16 B    -   64    0    0.000    0.000   0.000
     #51.77.203.211   134.59.1.5       3 u    4   64    1  171.248  -743.64 691.917
     +72.5.72.15      216.218.254.202  2 u    2   64    1   19.223  -778.34 686.200
     +159.69.25.180   192.53.103.103   2 u    3   64    1  237.733  -775.41 701.376
     +173.0.48.220    43.77.130.254    2 u    2   64    1   35.489  -778.85 669.187
      38.229.56.9     172.16.21.35     2 u   31   64    1  153.976  -268.90 122.557
     +137.190.2.4     .PPS.            1 u   31   64    1   93.797  -253.69 116.289
     +150.136.0.232   185.125.206.71   3 u   35   64    1   95.667  -178.19 114.912
      94.154.96.7     194.29.130.252   2 u   31   64    1  237.560  -231.88 107.230
     +162.159.200.123 10.4.1.175       3 u   34   64    1   16.246  -199.68 115.561
     *216.218.254.202 .CDMA.           1 u   35   64    1   52.906  -193.84 131.148
      91.189.91.157   132.163.96.1     2 u   45   64    1   87.772   -5.716   0.000
     +204.2.134.163   44.24.199.34     3 u   34   64    1   16.711  -199.12 116.777
     +74.6.168.73     208.71.46.33     2 u   35   64    1   69.772  -189.21 128.119
      91.189.89.199   17.253.34.123    2 u   45   64    1  165.471   -3.708   0.000
     +216.229.0.49    216.218.192.202  2 u   35   64    1   71.437  -178.94  97.505
      91.189.89.198   17.253.34.123    2 u   44   64    1  172.852  -17.899   0.000
    
  2. ntpdate -q <ip>

    The ntpdate command will actually tell me whether the NTP is accepting packets. This is because it gives an error message if not:

     $ sudo ntpdate -q 10.0.2.1
     server 10.0.2.1, stratum 4, offset 5.194725, delay 0.02652
     21 Jul 15:22:48 ntpdate[13086]: no server suitable for synchronization found
    

    This happens after a little while when my main server loses the * status on the one server it was first happy to synchronize with...

Now... I still need to understand what I have to do to fix this issue...


This may be helpful, here are the logs about a restart from a full reboot:

Jul 21 18:29:13 vm-ve-ctl kernel: [  434.275481] audit: type=1400 audit(1626917353.636:43): apparmor="DENIED" operation="open" profile="/usr/sbin/ntp
d" name="/snap/bin/" pid=3896 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: ntpd [email protected] (1): Starting
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 126:129
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: proto: precision = 0.190 usec (-22)
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Cannot open logfile /var/log/ntp.log: Permission denied
Jul 21 18:29:13 vm-ve-ctl kernel: [  434.291490] audit: type=1400 audit(1626917353.652:44): apparmor="DENIED" operation="capable" profile="/usr/sbin/
ntpd" pid=3901 comm="ntpd" capability=1  capname="dac_override"
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2021-12-28T00:00:00Z last=2017-01-01T
00:00:00Z ofs=37
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 0 v6wildcard [::]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 2 lo 127.0.0.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 3 enp0s3 192.168.2.120:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 4 enp0s8 10.0.2.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 5 lo [::1]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 6 enp0s3 [fe80::a00:27ff:fe25:38ff%2]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 7 enp0s8 [fe80::a00:27ff:fe35:c30b%3]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listening on routing socket on fd #24 for interface updates
Jul 21 18:29:14 vm-ve-ctl ntpd[3901]: Soliciting pool server 51.77.203.211
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 159.69.25.180
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 72.5.72.15
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 198.251.86.68
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 173.0.48.220
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 38.229.56.9
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 150.136.0.232
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 94.154.96.7
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 137.190.2.4
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 162.159.200.123
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.218.254.202
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.91.157
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.199
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 74.6.168.73
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 204.2.134.163
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.198
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.229.0.49
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 2604:ed40:1000:1711:d862:f5ff:fe4e:41c4
Jul 21 18:29:21 vm-ve-ctl ntpd[3901]: receive: Unexpected origin timestamp 0xe4a34871.ac57f05d does not match aorg 0000000000.00000000 from [email protected] xmt 0xe4a34871.65648c54

I do not know exactly when it starts going bad. I've also seen the following which I thought could have something to do with it (i.e. when that happens, the corresponding IP is removed from the list!), but it is already bad now and no such error occurred in my last run.

Jul 21 18:08:57 vm-ve-ctl ntpd[9764]: 92.243.6.5 local addr 192.168.2.120 -> <null>

Note: the 192.168.2.120 is the IP of the failing computer. It is a VirtualBox. It has been working for months... though, maybe something changed which makes it unhappy.

I found this post about an issue with the ... -> <null> message. However, I think we have a newer version on Ubuntu 18.04:

SUSE minimum recommended version: ntp-4.2.8p7-11.1
Ubuntu 18.04 version: 1:4.2.8p10+dfsg-5ubuntu7.3


Just in case, I tried to connect the VM to the host and I still get a huge offset and jitter. What could have changed?!

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.0.2.10       .POOL.          16 p    -   64    0    0.000    0.000   0.000
 10.0.2.10       132.163.97.6     2 u   54   64    3    0.457  -5254.2 3917.68

As asked by Paul Gear, here is the output of ntpq with additional details:

$ ntpq -ncrv
associd=0 status=0028 leap_none, sync_unspec, 2 events, no_sys_peer,
version="ntpd [email protected] (1)", processor="x86_64",
system="Linux/4.15.0-151-generic", leap=00, stratum=4, precision=-23,
rootdelay=17.930, rootdisp=5019.260, refid=173.255.215.209,
reftime=e4a44f7a.1c2ec778  Thu, Jul 22 2021 13:11:38.110,
clock=e4a45030.c8a4b259  Thu, Jul 22 2021 13:14:40.783, peer=0, tc=6,
mintc=3, offset=-109.527915, frequency=-1.707, sys_jitter=0.000000,
clk_jitter=38.724, clk_wander=0.000, tai=37, leapsec=201701010000,
expire=202112280000

Here is the list of available clocks and the one currently used:

$ grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource:kvm-clock tsc acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:kvm-clock

And finally, the dmesg output about the clocksource selection process:

$ dmesg | grep clocksource
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.283117] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    1.161844] clocksource: Switched to clocksource kvm-clock
[    1.208316] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    2.329228] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1db81a3240f, max_idle_ns: 440795250379 ns
Paul Gear avatar
cn flag
Assuming that your first `ntpq -np` output is from computer A, it seems to have an inaccurate clock (although NTP hasn't been running for a long time, so it's hard to tell for sure). It seems like you might not have the best clock source configured for running in that VM. What does `ntpq -ncrv` show after NTP has been running for at least 15 minutes? Also, what does `grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource; dmesg | grep clocksource` show?
cn flag
@PaulGear I switched the paravirtualization to **Minimal** and that made it work. I switched back to **Default** and got the same issues. I have a link in my answer of where I found the solution and details about it. I still added the output as you asked here. I'll check what clock is used in **Minimal** and post that too.
Score:1
cn flag

It looks like I found a solution. I'm not too sure why it would have been working before, though.

After many searches, I found this VirtualBox ticket:

https://www.virtualbox.org/ticket/15179

which says that you shouldn't use ntpd because the VM already does keep time using the Host time to adjust the VM's virtual clock.

However, even withouth ntpd running, my VM's clock was off and would bounce between ±30 seconds in a short period of time.

Reading that post further, they offered to adjust the paravirtualization settings to "None". That didn't seem to work for me. I restarted the VM and it got stuck. So I instead tried with "Minimal" and now the clock is working. The ntpq -np output looks a lot better:

Thu Jul 22 12:57:57 PDT 2021
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.008
 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.008
 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.008
 ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000   0.008
+23.157.160.168  129.6.15.28      2 u   20  128  377   83.831    0.783   1.745
-104.131.139.195 163.237.218.19   2 u   28  128  377   20.162    3.622   1.451
+144.76.64.40    36.224.68.195    2 u   18  128  377  179.329    2.557   6.671
-159.89.86.140   192.5.41.40      2 u  126  128  377   87.016    3.094   1.385
-23.175.208.10   239.9.71.195     2 u   35  128  377   82.350    2.311   0.794
+206.82.16.3     47.187.174.51    2 u  127  128  377   84.683    1.385   0.753
+92.243.6.5      85.199.214.98    2 u   25  128  377  163.041    4.275   4.025
*200.160.7.186   .ONBR.           1 u   47  128  377  191.078    1.126   1.865
+51.255.197.148  217.91.44.17     2 u   96  128  367  154.201    1.225   4.742

As we can see, the Offset (max. 4.3) and Jitter (max. 6.7) are minuscule compare to what I was getting before. Now my other computer also works and can synchronize itself to this clock. The delay is around 0.7, which is enough for me (in my case, enough is 12.0 or less).

IMPORTANT NOTE: I rebooted that VM 2 or 3 times using Minimal, the boot time is excruciatingly long. So I do not recommend using this mode except as a the very last resort if you absolutely need a working system clock. If you have a better solution that works, I'm all ears!


Just in case, I wanted to see the difference in regard to the clock sources in Minimal mode. We just get the acpi_pm clock.

$ grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource:acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:acpi_pm

Now I'm wondering whether this could have an effect on the host clock. Since I also have ntp on my host, I don't think this is an issue.

vidarlo avatar
ar flag
NTP attempts to build a profile of local clock drift. This assumes that the clock is somewhat well behaving, and drifts in a linear fashion. When you have a clock that is set by a source unknown to ntpd, it gives up - as it steps randomly in unexplainable ways. You may want to read [this Q&A](https://stackoverflow.com/questions/25041213/how-to-disable-virtualbox-time-sync-from-within-the-guest-at-runtime) over at SE as well for how to disable this.
cn flag
@vidarlo Ah... the `--disable-timesync` is probably the solution. That being said, I just saw that the default clock uses the `kvm-clock` and switching to `tsc` would probably be better in terms of "not drifting". Still testing various things here.
vidarlo avatar
ar flag
For virtualized workloads, synchronize the host, and let the VM's get the time from the host. The clock drift is wild anyway due to the unknown rescheduling the hypervisor performs.
cn flag
@vidarlo In my case, though, I'm simulating an application which installs and run `ntpd` because I need a local network with a very tight timing. That being said, somehow the issue was very visible on that VM even without ntp running. Rather strange. It was working fine for nearly a year...
Score:0
cn flag

Okay, rather unusual, I'm adding a second answer because this 100% explains why the started breaking. Within a few days after I last rebooted, there is an NVidia driver upgrade. Then I started my VM (i.e. the order is very important here!)

I didn't know that the 3D environment could be missing and if not accelerated, then the VM would totally misbehave in terms of clock/time.

Note that the fact that the 3D environment is not available was not visible to me. It may have been mentioned in some logs, but as a standard user, I missed that part completely.

After a full reboot (required by that NVidia upgrade), the VM works normally again. No need to use the Minimal option. I put that VM back to Default and it works nicely all the way as before.

I hope this saves a few people the headache I had for 3 days...

For those interested, changing the clock may also help. Oracle has a nice page about how to change the clock source. On my end, I ended up using the apci_pm which seems to help greatly in maintaining time for a long time.

# echo "acpi_pm" > /sys/devices/system/clocksource/clocksource0/current_clocksource

You can also update your boot parameters so that way each time you boot you get that selected source.

GRUB_CMDLINE_LINUX="... clocksource=acpi_pm"

(I removed irrelevant parameters here, don't delete them, just add the clocksource parameter; it may also be that the variable is empty the first time you edit it: GRUB_CMDLINE_LINUX="").

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.