I have a Debian 10 server that is randomly rebooting, though no error were written to journald
. The server has rebooted 20 times in last 3 days.
$ journalctl --list-boots
-22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC
...
-2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC
-1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC
0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC
Usually I run memtester
which takes a couple of hours (depending on RAM size) and it's quite unlikely to actually reproduce the issue (if it really is memory).
$ apt install memtester
$ memtester 245GB 4 > memtester.log 2>&1
My server has 256GB RAM, in 16 RAM modules:
$ dmidecode -t memory | grep Size | wc -l
16
free -h
total used free shared buffers cached
Mem: 251G 32G 218G 113M 0B 135M
-/+ buffers/cache: 32G 219G
Swap: 0B 0B 0B
DDR3
modules:
Handle 0x002D, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR3
Type Detail: Registered (Buffered)
Speed: 1600 MHz
Manufacturer: Hynix Semiconducto
Serial Number: 093C2E1C
Asset Tag: Dimm0_AssetTag
Part Number: HMT42GR7AFR4C-RD
Rank: 2
Configured Clock Speed: 1600 MHz
UPDATE:
The system should have ECC
memory modules (seems to be detected in dmidecode -t memory
)
Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 512 GB
Error Information Handle: Not Provided
Number Of Devices: 8
After replacing all memory modules the system shows EDAC MC0
errors (I haven't seen those before)
Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000
Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000
Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000
Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1
Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026
Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000
Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c
Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0
Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
UPDATE 2
I've tried disabling edac
kernel module, as suggested by RedHat/Suse in order to rule out possibility that the module is in conflict with hardware correction on motherboard
echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf
This seems to prevent reboots, but memory allocation is failing (on workload). All memtests still passing.
Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015
Call Trace:
dump_stack+0x66/0x81
dump_header+0x6b/0x283
? ___ratelimit+0xa1/0x100
oom_kill_process.cold.30+0xb/0x1cf
out_of_memory+0x1a5/0x450
mem_cgroup_out_of_memory+0xbe/0xd0
try_charge+0x707/0x780
mem_cgroup_try_charge+0x86/0x190
__add_to_page_cache_locked+0x64/0x240
add_to_page_cache_lru+0x4a/0xe0
filemap_fault+0x34c/0x780
? filemap_map_pages+0x1ed/0x3a0
ext4_filemap_fault+0x2c/0x40 [ext4]
__do_fault+0x36/0x170
__handle_mm_fault+0xdb6/0x11b0
handle_mm_fault+0xd6/0x200
__do_page_fault+0x249/0x4f0
? page_fault+0x8/0x30
page_fault+0x1e/0x30
RIP: 0033:0x7f1e1d58ff9d
Code: Bad RIP value.
RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202
RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040
RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626
RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007
R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0
R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428