Score:0

MCE Errors but no edac-util errors?

US flag

I have an older HP Z440 tower with 4x8GB ECC DDR4, running Proxmox VE 6.4. Recently, it started showing MCE errors every few seconds. I installed rasdaemon and can see that they are memory read errors. However, edac-util doesn't show any sign of problems. Memtest passed, but I understand that's normal for correctable errors.

There is only one socket, and the DIMMs are installed in slots 1, 3, 6, and 8 (which seems to be preferred for this model).

Am I actually having memory errors? How can I troubleshoot this further?

dmesg:

root@pve:~# dmesg
...
[ 5729.899255] mce_notify_irq: 20 callbacks suppressed
[ 5729.899260] mce: [Hardware Error]: Machine check events logged
[ 5732.907207] mce: [Hardware Error]: Machine check events logged
[ 5792.907319] mce_notify_irq: 19 callbacks suppressed
[ 5792.907323] mce: [Hardware Error]: Machine check events logged
[ 5793.899247] mce: [Hardware Error]: Machine check events logged
[ 5852.911342] mce_notify_irq: 11 callbacks suppressed
[ 5852.911347] mce: [Hardware Error]: Machine check events logged
[ 5853.903354] mce: [Hardware Error]: Machine check events logged

Errors from rasdaemon:

root@pve:~# ras-mc-ctl --errors | tail
1435 2023-05-12 14:58:05 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=5, mcgcap=0x07000c16, status=0xcc00014000010091, addr=0x4ccdc28c0, misc=0x40484886, walltime=0x645e9a4e, cpuid=0x000306f2, bank=0x00000007
1436 2023-05-12 14:58:06 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x4d5c831c0, misc=0x140383886, walltime=0x645e9a4f, cpuid=0x000306f2, bank=0x00000007
1437 2023-05-12 14:58:09 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x4ccdc28c0, misc=0x403aba86, walltime=0x645e9a52, cpuid=0x000306f2, bank=0x00000007
1438 2023-05-12 14:58:11 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x6fd8eee80, misc=0x140282886, walltime=0x645e9a54, cpuid=0x000306f2, bank=0x00000007
1439 2023-05-12 14:58:12 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x510122800, misc=0x140282886, walltime=0x645e9a55, cpuid=0x000306f2, bank=0x00000007
1440 2023-05-12 14:58:13 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=4, mcgcap=0x07000c16, status=0xcc00010000010091, addr=0x4ea312a80, misc=0x1403c3c86, walltime=0x645e9a56, cpuid=0x000306f2, bank=0x00000007
1441 2023-05-12 14:58:16 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x4ea342a80, misc=0x1403aba86, walltime=0x645e9a59, cpuid=0x000306f2, bank=0x00000007
1442 2023-05-12 14:58:17 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x50abf2900, misc=0x1404c4c86, walltime=0x645e9a5a, cpuid=0x000306f2, bank=0x00000007
1443 2023-05-12 14:58:18 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x52676fbc0, misc=0x140585886, walltime=0x645e9a5b, cpuid=0x000306f2, bank=0x00000007

No errors reported by edac:

root@pve:~# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

root@pve:/sys/devices/system/edac/mc# tail -n +1 mc*/ce_* mc*/dimm*/dimm_ce_count
==> mc0/ce_count <==
0

==> mc0/ce_noinfo_count <==
0

==> mc0/dimm0/dimm_ce_count <==
0

==> mc0/dimm3/dimm_ce_count <==
0

==> mc0/dimm6/dimm_ce_count <==
0

==> mc0/dimm9/dimm_ce_count <==
0
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.