kernel message + how to know if we need to replace the DIMM based on kernel messages

Question

Score:0

Server

kernel message + how to know if we need to replace the DIMM based on kernel messages

King David

3/2/24, 4:01 PM

we have RHEL 7.6 server , and we noticed about the following kernel messages.

[1065085.048872] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1676989040 SOCKET 0 APIC 0
[1065086.052107] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2ae958e offset:0xa00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:1 rank:0)
[1065166.234239] mce: [Hardware Error]: Machine check events logged

also, we look on the idrac and this what we saw

from the link - https://www.dell.com/support/kbdoc/en-il/000055500/vxrack-idrac-logs-the-following-event-mem0702-correctable-memory-error-rate-exceeded-for-dimm-bank-slot

we have the following info

Cause The memory may not be operational.(See Resolution Scenarios) This is an early indicator of a possible future uncorrectable error.

Memory errors can show in a number of ways on your system, and might vary depending on the age of your system or (system generation). There might also be slight variations based on your system firmware levels. The error messages can appear in one or more of BIOS message on post, iDRAC logs, OpenManage System Administrator (OMSA) logs, System LCD display or in the Operating system.

but I am not sure if the DIMM on my physical machine are need to replace or not ?

other links

https://www.dell.com/support/kbdoc/en-il/000177028/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge

from other RHEL case we saw - https://access.redhat.com/solutions/6961932

Resolution The error code err_code:0101:0091 is from hardware.

OS only detects and reports them in the message log.

Currently, the error messages are reported from SuperMicro and HP hardware.

It is recommended to contact the hardware vendor for more information.

so I am very confuse. , and not clearly if we need to replace the DIMM cards.

here additional kernel messages output that we saw from dmesg

[34226.902474] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[34226.902477] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[34226.902479] EDAC sbridge MC0: TSC 41a0d0c2a8a2 
[34226.902482] EDAC sbridge MC0: ADDR 3a2b80a00 
[34226.902484] EDAC sbridge MC0: MISC 0 
[34226.902486] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1675958197 SOCKET 0 APIC 0
[34227.566735] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x3a2b80 offset:0xa00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[34239.759292] {16}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[34239.759297] {16}[Hardware Error]: It has been corrected by h/w and requires no further action
[34239.759299] {16}[Hardware Error]: event severity: corrected
[34239.759301] {16}[Hardware Error]:  Error 0, type: corrected
[34239.759303] {16}[Hardware Error]:  fru_text: A6
[34239.759305] {16}[Hardware Error]:   section_type: memory error
[34239.759307] {16}[Hardware Error]:   error_status: 0x0000000000000400
[34239.759308] {16}[Hardware Error]:   physical_address: 0x00000009df0e0440
[34239.759319] {16}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 0 bank: 3 row: 39911 column: 16 
[34239.759321] {16}[Hardware Error]:   error_type: 2, single-bit ECC
[34239.759331] mce: [Hardware Error]: Machine check events logged
[34239.759351] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[34239.759355] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[34239.759357] EDAC sbridge MC0: TSC 41a71a0719df 
[34239.759359] EDAC sbridge MC0: ADDR 9df0e0440 
[34239.759362] EDAC sbridge MC0: MISC 0 
[34239.759364] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1675958210 SOCKET 0 APIC 0

301

1 + 0

memory

redhat

dmesg

kernel

kernel message + how to know if we need to replace the DIMM based on kernel messages

Post an answer