Score:0

Ubuntu reboots randomly

gw flag

I'm running some servers on Hetzner (AX101) and have been experiencing random reboots for a while now, all my investigations lead absolutely nowhere.

Prerequisite: Ubuntu 22.04 (Ubuntu 5.15.0-58.64-generic 5.15.74)

From system's standpoint it looks nothing is happening:

Feb  6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb  6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb  6 10:46:18 server4 kernel: [    0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
 (Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb  6 10:46:18 server4 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb  6 10:46:18 server4 kernel: [    0.000000] KERNEL supported cpus:
Feb  6 10:46:18 server4 kernel: [    0.000000]   Intel GenuineIntel
Feb  6 10:46:18 server4 kernel: [    0.000000]   AMD AuthenticAMD
Feb  6 10:46:18 server4 kernel: [    0.000000]   Hygon HygonGenuine
Feb  6 10:46:18 server4 kernel: [    0.000000]   Centaur CentaurHauls
Feb  6 10:46:18 server4 kernel: [    0.000000]   zhaoxin   Shanghai
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb  6 10:46:18 server4 kernel: [    0.000000] signal: max sigframe size: 3376
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-provided physical RAM map:
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable

Everything works as expected, until it doesn't. Server goes down for two minutes and than just re-appears booting the system.

NVMe disks are looking perfectly fine:

smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    55,954,348 [28.6 TB]
Data Units Written:                 76,540,527 [39.1 TB]
Host Read Commands:                 993,043,774
Host Write Commands:                1,875,329,624
Controller Busy Time:               1,396
Power Cycles:                       5
Power On Hours:                     4,902
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               49 Celsius

I also did memtest which yielded in no issues.

From software standpoint, there is nothing special running there: PostgreSQL, node exporter - and that's basically it.

I contacted Hetzner with this problem and they even replaced all the hardware - but problem persists, which makes me think it is likely to be software (doubt power surges).

Any direction I can dig this problem further?

user0103 avatar
ca flag
Have you ever figured this out? I started experience the same on my AX41-NVME dedicated server.
Danny avatar
gw flag
@user0103 unfortunately, not. Generally I'd say the most efficient way will be to contact Hetzner support, they can perform full hardware check. If the problem persists they can replace the server. Basically it's the only thing that helped with with those reboots.
user0103 avatar
ca flag
Thanks for your time to reply. I requested to do the full hardware check, they did it and reported that no errors found. I suspect that it's a PSU issue because they check RAM, disks, stress-test CPU etc but it definitely looks like a power issue, I had something similar on my desktop PC before I replaced PSU. In the end, I got 2 such hard-resets in the one day yesterday and I migrated to another server. No issues so far despite using the same software and OS (Ubuntu 22.04) so it seems that it's indeed the hardware issue and it can be resolved only by replacing the server.
Danny avatar
gw flag
@user0103 I also suspect PSU issues, but I haven't found any confirmation of it, neither from Hetzner or system side. In my experience the best to do now is to patiently wait when the server fails again, once that happened - just reopen the ticket, saying that issue persists.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.