Score:0

Multiple uncorrectable ECC errors on multiple DIMMs

cn flag

I have a Supermicro X8DT6 system that has suddenly developed a high rate of uncorrectable ECC errors. The system was running error-free until just a few days ago, and now it is experiencing uncorrectable ECC errors (and associated spontaneous reboots) many times per day. The errors are not isolated to a single DIMM.

System details: Single X5650 CPU, 48G DDR3 ram @1333Mhz in 6 DIMMs. Running Debian Linux.

As far as I can tell, there are NO correctable ECC errors detected (rasdaemon shows nothing, and the ipmi event log shows only uncorrectables).

The problem first developed a few days ago, and you can see from this log that it initially appeared to be confined to a single DIMM:

  3f | 09/13/2021 | 18:13:02 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  40 | 09/14/2021 | 03:30:49 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  41 | 09/14/2021 | 04:10:28 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  42 | 09/14/2021 | 04:11:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  43 | 09/14/2021 | 04:19:31 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  44 | 09/14/2021 | 04:27:06 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  45 | 09/14/2021 | 04:28:39 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  46 | 09/14/2021 | 04:32:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  47 | 09/14/2021 | 04:35:48 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  48 | 09/14/2021 | 04:39:51 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  49 | 09/14/2021 | 04:41:29 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4a | 09/14/2021 | 04:48:16 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4b | 09/14/2021 | 04:53:43 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4c | 09/14/2021 | 04:54:52 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4d | 09/14/2021 | 05:09:41 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4e | 09/14/2021 | 05:12:04 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  4f | 09/14/2021 | 05:20:51 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  50 | 09/14/2021 | 05:23:42 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  51 | 09/14/2021 | 05:34:12 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  52 | 09/14/2021 | 05:39:44 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  53 | 09/14/2021 | 05:41:24 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  54 | 09/14/2021 | 05:47:19 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  55 | 09/14/2021 | 05:55:46 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  56 | 09/14/2021 | 12:05:32 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  57 | 09/14/2021 | 16:18:36 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  58 | 09/14/2021 | 17:31:57 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  59 | 09/14/2021 | 17:59:21 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5a | 09/14/2021 | 18:09:04 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5b | 09/14/2021 | 18:10:59 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5c | 09/14/2021 | 18:41:11 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5d | 09/14/2021 | 18:43:32 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5e | 09/14/2021 | 18:49:21 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  5f | 09/14/2021 | 21:39:45 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  60 | 09/14/2021 | 21:43:26 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  61 | 09/14/2021 | 21:47:11 | Memory | Uncorrectable ECC (@DIMM1B(CPU1)) | Asserted
  62 | 09/14/2021 | 22:35:41 | Physical Security #0xaa | General Chassis intrusion () | Asserted

I then removed DIMM 1B and powered the system back up with only 5 DIMMs installed. I believe this is a valid configuration -- there are three memory channels, and each can operate with either 1 or 2 DIMMs.

Initially this seemed to solve the problem, but as you can see it made things even more confusing:

  63 | 09/15/2021 | 12:21:05 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
  64 | 09/15/2021 | 14:15:46 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
  65 | 09/15/2021 | 14:22:07 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted
  66 | 09/15/2021 | 14:31:22 | Memory | Uncorrectable ECC (@DIMM2B(CPU1)) | Asserted
  67 | 09/16/2021 | 05:02:38 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted
  68 | 09/16/2021 | 10:58:01 | Memory | Uncorrectable ECC (@DIMM1A(CPU1)) | Asserted
  69 | 09/16/2021 | 11:17:37 | Memory | Uncorrectable ECC (@DIMM2A(CPU1)) | Asserted

All the other answers or articles I can find focus on infrequent errors, or on scenarios where a single DIMM or slot is clearly failing. Does anyone have any idea what could be causing such a widespread series of failures in a previously-working machine? I do intend to re-seat everything, but given the multiple points of failure I don't have high hopes for that.

Zac67 avatar
ru flag
Possible other problem sources are CPU, PSU, mainboard. Test each one in another system to verify proper function.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.