Score:0

kernel messages are complained about memory.inspite all DIMM cards was replaced

gb flag

we have few DELL machines ( with RHEL 7.6) , and as we replaced the DIMM cards on machines because the Erros that we seen from kernel messages

after some time we checked again the kernel messages and we found the following and we can see the errors about the RAM memory ( also related to RHEL case - https://access.redhat.com/solutions/6961932 )

[Mon May  8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May  8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May  8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  Error 0, type: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  fru_text: B6
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   section_type: memory error
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_status: 0x0000000000000400
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   physical_address: 0x000000446e0d5f00
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888 
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: MISC 0 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May  9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May  9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported

just to be sure that this above messages are not random messages we decided to reboot machines and see if the bad messages about memory are reproduced

but the Erros messages about the RAM memory, are still remained.

so we are confused about the problem that we seen from kernel messages

how it can be that we still get Erros about RAM in spite we replaced the DIMM cards

I must gives here additional information about what we see from IDRAC

enter image description here

as we can above IDRAC not completed about the DIMM cards or about the RAM memory

so the question is - how comes dmesg ( kernel messages ) are complained about the RAM memory in spite all DIMMs was replaced?

is it possible that something else is BAD and not the DIMM cards? for example the motherboard in DELL machine?

Score:3
cn flag

The error you see is single-bit ECC correctable memory error that was corrected by hardware. These do not trigger a component listed as failed in iDRAC, at least until their number exceeds some internally defined threshold, but you should see this memory error logged under iDRAC SEL (system event log).

It is not recommended to mix single and dual rank modules, but your mileage may vary depending on processor/motherboard version.

King David avatar
gb flag
so do you recommended to find the faulty dimm ( according to dmesg slot ) and replaced them ?
Peter Zhabin avatar
cn flag
If these slots were unpopulated before you may get away with blasting the dust out of them with canned air. I recommend first rearranging the DIMMs to see if the error is slot or module related and monitor DIMM temperature, as correctable errors could be caused by insufficient cooling airflow and overheating.
King David avatar
gb flag
about the DIMM if they was before - so yes we replaced only the existing DIMM , so we not insert DIMM card on new SLOT for sure
King David avatar
gb flag
about insufficient cooling airflow - can we capture it from linux cli as example of dmidecode?
Peter Zhabin avatar
cn flag
Then rearrange the DIMMs and see if the error will migrate with the DIMM or stay with the slot. We use `ipmitool sdr list` for that.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.