Score:1

How to test Linux server for hardware errors?

ng flag

I have a Debian 10 server that is randomly rebooting, though no error were written to journald. The server has rebooted 20 times in last 3 days.

$ journalctl --list-boots
-22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC
...
 -2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC
 -1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC
  0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC

Usually I run memtester which takes a couple of hours (depending on RAM size) and it's quite unlikely to actually reproduce the issue (if it really is memory).

$ apt install memtester
$ memtester 245GB 4 > memtester.log 2>&1

My server has 256GB RAM, in 16 RAM modules:

$ dmidecode -t memory | grep Size | wc -l
16
free  -h
             total       used       free     shared    buffers     cached
Mem:          251G        32G       218G       113M         0B       135M
-/+ buffers/cache:        32G       219G
Swap:           0B         0B         0B

DDR3 modules:

Handle 0x002D, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: Hynix Semiconducto
        Serial Number: 093C2E1C          
        Asset Tag: Dimm0_AssetTag
        Part Number: HMT42GR7AFR4C-RD
        Rank: 2
        Configured Clock Speed: 1600 MHz

UPDATE: The system should have ECC memory modules (seems to be detected in dmidecode -t memory)

Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 512 GB
        Error Information Handle: Not Provided
        Number Of Devices: 8

After replacing all memory modules the system shows EDAC MC0 errors (I haven't seen those before)

Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000
Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000
Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000
Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1
Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0
Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

UPDATE 2 I've tried disabling edac kernel module, as suggested by RedHat/Suse in order to rule out possibility that the module is in conflict with hardware correction on motherboard

echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf

This seems to prevent reboots, but memory allocation is failing (on workload). All memtests still passing.

Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015
Call Trace:
 dump_stack+0x66/0x81
 dump_header+0x6b/0x283
 ? ___ratelimit+0xa1/0x100
 oom_kill_process.cold.30+0xb/0x1cf
 out_of_memory+0x1a5/0x450
 mem_cgroup_out_of_memory+0xbe/0xd0
 try_charge+0x707/0x780
 mem_cgroup_try_charge+0x86/0x190
 __add_to_page_cache_locked+0x64/0x240
 add_to_page_cache_lru+0x4a/0xe0
 filemap_fault+0x34c/0x780
 ? filemap_map_pages+0x1ed/0x3a0
 ext4_filemap_fault+0x2c/0x40 [ext4]
 __do_fault+0x36/0x170
 __handle_mm_fault+0xdb6/0x11b0
 handle_mm_fault+0xd6/0x200
 __do_page_fault+0x249/0x4f0
 ? page_fault+0x8/0x30
 page_fault+0x1e/0x30
RIP: 0033:0x7f1e1d58ff9d
Code: Bad RIP value.
RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202
RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040
RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626
RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007
R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0
R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428

SEWTGIYWTKHNTDS avatar
kw flag
can you be more specific rebooting? crashing and restarting? powering off and on? could it be a power supply issue (ups fault perhaps?)
ng flag
I'm trying to rule out all possibilities. Technicians have checked the power supply, it looks ok. The only suspicious messages are `kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000`
SEWTGIYWTKHNTDS avatar
kw flag
is it old? I had a system reboot and it was because the thermal paste on the cpu cooler had dried out and the cpu was overheating. Another server didn't like the UPS Self test, a firmware update sorted that one but your frequency seems too high for that. I see interrupt too long on lots of systems so probably not significant. Malicious user? Hope you sort it soon..
ng flag
I've installed the system 2 weeks ago, cooling seems to be working fine. The motherboard is Supermicro `X9DRFR`.
Davidw avatar
in flag
Does the motherboard BIOS have built in diagnostics?
ng flag
@Davidw I'm unable to get into BIOS, but the technicians tried update BIOS and check configuration. The server has been passing all hardware tests running for days.
Nikita Kipriyanov avatar
za flag
Supermicro servers have an IPMI BMC with its own network connection (sometimes a dedicated port, sometimes shared with the NIC 1) and it has its own hardware error log. What's in that log? Also you can get that from the OS using `ipmitool` or `ipmiutil` package (Debian has them both), try `sel` command. Better use `ipmiutil` (I've seen cases when it decoded messages way better).
Score:1
br flag

Have you tried booting from https://www.memtest86.com/ - it's always been great for me.

ng flag
Not yet, I have ssh access to a booted OS. Unfortunately booting custom image is not possible in this case. Is the `memtest86` algorithm very different from `memtester`?
br flag
It boots from the tester ISO, so you've no OS in the way.
ng flag
Yes, I know. I can only install/compile packages in provided rescue system. I don't have physical access to the server. AFAIK it's not possible to install `memtest86` as a package.
Nikita Kipriyanov avatar
za flag
If you have no control over hardware and suspect a hardware problem, this is not your problem. Hand it over to the person who is in charge of the hardware.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.