I have a Supermicro X8DT6 system that has suddenly developed a high rate of uncorrectable ECC errors. The system was running error-free until just a few days ago, and now it is experiencing uncorrectable ECC errors (and associated spontaneous reboots) many times per day. The errors are not isolated to a single DIMM.
System details: Single X5650 CPU, 48G DDR3 ram @1333Mhz in 6 DIMMs. Running Debian Linux.
As far as I can tell, there are NO correctable ECC errors detected (rasdaemon shows nothing, and the ipmi event log shows only uncorrectables).
The problem first developed a few days ago, and you can see from this log that it initially appeared to be confined to a single DIMM:
3f | 09/13/2021 | 18:13:02 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
40 | 09/14/2021 | 03:30:49 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
41 | 09/14/2021 | 04:10:28 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
42 | 09/14/2021 | 04:11:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
43 | 09/14/2021 | 04:19:31 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
44 | 09/14/2021 | 04:27:06 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
45 | 09/14/2021 | 04:28:39 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
46 | 09/14/2021 | 04:32:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
47 | 09/14/2021 | 04:35:48 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
48 | 09/14/2021 | 04:39:51 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
49 | 09/14/2021 | 04:41:29 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4a | 09/14/2021 | 04:48:16 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4b | 09/14/2021 | 04:53:43 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4c | 09/14/2021 | 04:54:52 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4d | 09/14/2021 | 05:09:41 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4e | 09/14/2021 | 05:12:04 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
4f | 09/14/2021 | 05:20:51 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
50 | 09/14/2021 | 05:23:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
51 | 09/14/2021 | 05:34:12 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
52 | 09/14/2021 | 05:39:44 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
53 | 09/14/2021 | 05:41:24 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
54 | 09/14/2021 | 05:47:19 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
55 | 09/14/2021 | 05:55:46 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
56 | 09/14/2021 | 12:05:32 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
57 | 09/14/2021 | 16:18:36 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
58 | 09/14/2021 | 17:31:57 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
59 | 09/14/2021 | 17:59:21 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5a | 09/14/2021 | 18:09:04 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5b | 09/14/2021 | 18:10:59 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5c | 09/14/2021 | 18:41:11 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5d | 09/14/2021 | 18:43:32 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5e | 09/14/2021 | 18:49:21 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
5f | 09/14/2021 | 21:39:45 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
60 | 09/14/2021 | 21:43:26 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
61 | 09/14/2021 | 21:47:11 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
62 | 09/14/2021 | 22:35:41 | Physical Security #0xaa | General Chassis intrusion () | Asserted
I then removed DIMM 1B and powered the system back up with only 5 DIMMs installed. I believe this is a valid configuration -- there are three memory channels, and each can operate with either 1 or 2 DIMMs.
Initially this seemed to solve the problem, but as you can see it made things even more confusing:
63 | 09/15/2021 | 12:21:05 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
64 | 09/15/2021 | 14:15:46 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
65 | 09/15/2021 | 14:22:07 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted
66 | 09/15/2021 | 14:31:22 | Memory | Uncorrectable ECC (@DIMM2B(CPU1)) | Asserted
67 | 09/16/2021 | 05:02:38 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted
68 | 09/16/2021 | 10:58:01 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
69 | 09/16/2021 | 11:17:37 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted
All the other answers or articles I can find focus on infrequent errors, or on scenarios where a single DIMM or slot is clearly failing. Does anyone have any idea what could be causing such a widespread series of failures in a previously-working machine? I do intend to re-seat everything, but given the multiple points of failure I don't have high hopes for that.