I'm running some servers on Hetzner (AX101) and have been experiencing random reboots for a while now, all my investigations lead absolutely nowhere.
Prerequisite: Ubuntu 22.04 (Ubuntu 5.15.0-58.64-generic 5.15.74)
From system's standpoint it looks nothing is happening:
Feb 6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb 6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb 6 10:46:18 server4 kernel: [ 0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
(Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb 6 10:46:18 server4 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb 6 10:46:18 server4 kernel: [ 0.000000] KERNEL supported cpus:
Feb 6 10:46:18 server4 kernel: [ 0.000000] Intel GenuineIntel
Feb 6 10:46:18 server4 kernel: [ 0.000000] AMD AuthenticAMD
Feb 6 10:46:18 server4 kernel: [ 0.000000] Hygon HygonGenuine
Feb 6 10:46:18 server4 kernel: [ 0.000000] Centaur CentaurHauls
Feb 6 10:46:18 server4 kernel: [ 0.000000] zhaoxin Shanghai
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: xstate_offset[9]: 832, xstate_sizes[9]: 8
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb 6 10:46:18 server4 kernel: [ 0.000000] signal: max sigframe size: 3376
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-provided physical RAM map:
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Everything works as expected, until it doesn't. Server goes down for two minutes and than just re-appears booting the system.
NVMe disks are looking perfectly fine:
smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 55,954,348 [28.6 TB]
Data Units Written: 76,540,527 [39.1 TB]
Host Read Commands: 993,043,774
Host Write Commands: 1,875,329,624
Controller Busy Time: 1,396
Power Cycles: 5
Power On Hours: 4,902
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 49 Celsius
I also did memtest which yielded in no issues.
From software standpoint, there is nothing special running there: PostgreSQL, node exporter - and that's basically it.
I contacted Hetzner with this problem and they even replaced all the hardware - but problem persists, which makes me think it is likely to be software (doubt power surges).
Any direction I can dig this problem further?