Ubuntu reboots randomly

Question

Score:0

Server

Ubuntu reboots randomly

Danny

2/6/24, 11:14 AM

I'm running some servers on Hetzner (AX101) and have been experiencing random reboots for a while now, all my investigations lead absolutely nowhere.

Prerequisite: Ubuntu 22.04 (Ubuntu 5.15.0-58.64-generic 5.15.74)

From system's standpoint it looks nothing is happening:

Feb  6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb  6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb  6 10:46:18 server4 kernel: [    0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
 (Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb  6 10:46:18 server4 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb  6 10:46:18 server4 kernel: [    0.000000] KERNEL supported cpus:
Feb  6 10:46:18 server4 kernel: [    0.000000]   Intel GenuineIntel
Feb  6 10:46:18 server4 kernel: [    0.000000]   AMD AuthenticAMD
Feb  6 10:46:18 server4 kernel: [    0.000000]   Hygon HygonGenuine
Feb  6 10:46:18 server4 kernel: [    0.000000]   Centaur CentaurHauls
Feb  6 10:46:18 server4 kernel: [    0.000000]   zhaoxin   Shanghai
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb  6 10:46:18 server4 kernel: [    0.000000] signal: max sigframe size: 3376
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-provided physical RAM map:
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable

Everything works as expected, until it doesn't. Server goes down for two minutes and than just re-appears booting the system.

NVMe disks are looking perfectly fine:

smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    55,954,348 [28.6 TB]
Data Units Written:                 76,540,527 [39.1 TB]
Host Read Commands:                 993,043,774
Host Write Commands:                1,875,329,624
Controller Busy Time:               1,396
Power Cycles:                       5
Power On Hours:                     4,902
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               49 Celsius

I also did memtest which yielded in no issues.

From software standpoint, there is nothing special running there: PostgreSQL, node exporter - and that's basically it.

I contacted Hetzner with this problem and they even replaced all the hardware - but problem persists, which makes me think it is likely to be software (doubt power surges).

Any direction I can dig this problem further?

187

0 + 4

ubuntu

server-crashes

reboot

Ubuntu reboots randomly

Post an answer