Score:0

Memory errors in dmesg, do I need to replace a DIMM?

cn flag

The following errors show up in dmesg 10-20 times per day:

MCA: Bank 5, Status 0x8c00004000010092
MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
MCA: CPU 0 COR (1) RD channel 2 memory error
MCA: Address 0xbb5561e80 (Mode: Physical Address, LSB: 6)
MCA: Misc 0x2140109086

The CPU is always 0, and the "bank" is always 5. The "Misc" and the "Address" vary, but are often the same.

The motherboard is identified thus:

CPU: Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz (3591.44-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 137438953472 (131072 MB)
avail memory = 133741539328 (127545 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <LENOVO TC-A0   >
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads

Should I replace a DIMM (and how do I identify it?), or is ECC doing its job, and there is no need to worry? Yet?

Adding output of mcelog:

Hardware event. This is not a software error.
MCE 458
CPU 0 BANK 5 TSC 10283dbf8f01bc 
MISC 21401e9e86 ADDR bb5561e80 
TIME 1665418335 Mon Oct 10 12:12:15 2022
MCG status:
STATUS cc00010000010092 MCGSTATUS 0
MCGCAP 1000c10 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45 Step 7
Score:0
in flag

Please follow below.

  1. Check mcelog if that is hardware or software issue.
  2. Plug out and plugin dimm and see the logs again after cleaning motherboard/dimm slots.
  3. Check if you can see ECC lines in dmesg
  4. You can also try memtest if possible.
  5. Try removing/replacing dimm and check if this is related to dimm or motherboard.
cn flag
I added the output of `mcelog`. The errors don't show up all the time -- only occasionally. Should the "Bank 5" correspond to some marking on the motherboard?
asktyagi avatar
in flag
Check if you can see ECC lines in dmesg, you can also try memtest if possible. Or try removing/replacing dimm and check if this is related to dimm or motherboard.
Nikita Kipriyanov avatar
za flag
Check IPMI SEL too (with e.g. `ipmiutil`). Usually it logs memory ECC errors too, and also It may give a clue to which memory slot it is in.
cn flag
This is a workstation, not a server -- no IPMI device...
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.