Score:0

Ubuntu 20.04 crashes: An ECC error or L2 poison was detected

kz flag

Ubuntu 20.04 crashes randomly at different times. Unable to point to a specific event.

uname -a 
Linux ubuntu 5.11.0-051100-generic #202102142330 
SMP Sun Feb 14 23:33:21 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Crashes with the following signal:

 kernel:[19849.215258] [Hardware Error]: Uncorrected, software restartable error.

 kernel:[19849.215259] [Hardware Error]: CPU:22 (19:21:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135

 kernel:[19849.215263] [Hardware Error]: Error Addr: 0x000000076bed1c00

 kernel:[19849.215264] [Hardware Error]: IPID: 0x001000b000000000

 kernel:[19849.215266] [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.

 kernel:[19849.215269] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD

Hardware info:

### CPU
  Architecture:                    x86_64
  CPU op-mode(s):                  32-bit, 64-bit
  Byte Order:                      Little Endian
  Address sizes:                   48 bits physical, 48 bits virtual
  CPU(s):                          24
  On-line CPU(s) list:             0-23
  Thread(s) per core:              2
  Core(s) per socket:              12
  Socket(s):                       1
  NUMA node(s):                    1
  Vendor ID:                       AuthenticAMD
  CPU family:                      25
  Model:                           33
  Model name:                      AMD Ryzen 9 5900X 12-Core Processor
  Stepping:                        0
  Frequency boost:                 enabled
  CPU MHz:                         2200.000
  CPU max MHz:                     6442.4800
  CPU min MHz:                     2200.0000

### Base Board Information
  Manufacturer: ASRock
  Product Name: X570 Taichi

### Memory:
G Skill Trident Z Neo DDR4 - 3600Mhz 32GB (2 x 16GB)

What are the suggested ways of finding out the root cause? How do I enable more logging or if the log already exists where can I find them etc. Any guidance will be appreciated. Thanks!

Score:2
in flag

This isn't technically an answer, but ...

The ECC error or L2 poison was detected on a data cache read by a load message points to a memory problem, either with the RAM itself or the cache on the CPU. Neither are great, but you can test the system RAM with the following process:

  1. Restart your system
  2. Press and hold the Shift key to bring up the GRUB menu
  3. Select "Ubuntu, memtest86+" and press Enter
    The memory test will run until the end of time or until you press the Esc key. Let the machine complete at least one test before escaping.

Based on reports around the web, this issue seems to be seen only with the higher-end AMD Ryzen processors. Reading through this long thread on AMD's community site revealed this interesting bit:

I replaced memory and the computer has been rock solid now for a few days. Hopefully this helps you out as it helped me out. Previous memory was Gskill 3600mhz memory... new memory is 3200 memory from Corsair.

Your question does not state what sort of memory you have installed but, if it's a higher-frequency set of modules, there may be something between the RAM and the CPU that is causing an instability. If the memory test fails and you happen to have some compatible 3200MHz RAM available (even if it's just one DIMM), consider swapping it out and performing the memory test again.

dina avatar
kz flag
Thanks a lot for the answer. My RAM is G Skill Trident Z Neo DDR4 - 3600Mhz 32GB (2x16). I did run the memtest86, it took about four and half hours and PASSED the test.
dina avatar
kz flag
Unfortunately I don't have spare memory, this is a brand new build. I hope some solutions comes along for this on BIOS or OS layer instead of hardware.
heynnema avatar
ru flag
@dnafication With memtest, did you run only 1 test, or all 4/4? AMD processors are very fussy about RAM. Is your RAM on the compatibility list? Go to the support site for your motherboard and take a look. Also, is your CPU or RAM overclocked?
heynnema avatar
ru flag
@dnafication Also show me `sudo dmidecode -s bios-version`. Have you enabled ECC for your RAM... maybe in the BIOS?
dina avatar
kz flag
thanks @heynnema, I did run all the tests (i think it showed about 10 tests and it ran for more than 4hrs). CPU or RAM should be set as is. I don't remember doing any changes or overclocking. BIOS version is `P4.30`. I will have a look regarding ECC during the boot.
dina avatar
kz flag
@heynnema, I also ran memtester: `sudo memtester 4000M 1`. no error reported.
dina avatar
kz flag
BIOS version seems to be the latest. This is the motherboard: https://www.asrock.com/mb/AMD/X570%20Taichi/#Specification
heynnema avatar
ru flag
@dnafication Go to https://www.asrock.com/MB/AMD/X570%20Taichi/index.asp#Download and look at the CPU Support list to determine your CPU name, then look at the appropriate Memory QVL list to determine if your memory is supported. Get the model of your DIMMs with `sudo lshw -C memory`.
dina avatar
kz flag
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/126576/discussion-between-dnafication-and-heynnema).
cn flag
I have ECC RAM and a Ryzen 5900X. I have verified via `edac-util` and `dmesg` that ECC appears to be working. I have never seen any errors in months. However, once every few days my machine freezes up and reboots. `/var/log/kern.log` shows this same MCE (L2 cache poisoning). I'll run `memtest86+` but I doubt it'll find any problems. I've seen others complain about this with the 5900X so I'm suspecting CPU microcode. Trying to collect more data.
Score:1
ru flag

BIOS

ASRock X570 Taichi

The BIOS is current at version P4.30.

MEMORY

G Skill Trident Z Neo DDR4 - 3600Mhz 32GB (2 x 16GB), product: F4-3600C16-16GTZNC

AMD Ryzen 9 5900X 12-Core Processor

Ryzen processors are very fussy about RAM.

These DIMMs don't appear on the memory supported list, as seen here.

memtest passed all tests.

When we look at sudo lshw -C memory we see that the DIMMs may be installed into incorrect slot locations. When using 2 equal size DIMMs, they should be installed into slots A2 and B2. Here's an image of the board layout, and the memory slots... taken from the User Manual at here... so just verify this...

enter image description here

dina avatar
kz flag
I will try this out today thanks a lot! :D
dina avatar
kz flag
I moved the RAM from A1 --> A2 and B1 --> B2. Looks like it still crashes after sometime. :( Can you suggest anything else? Are there any tests I can run, any diagnostics to see if its definitely a hardware error? I booted the system in Windows and kept it running for long enough without any crash.
heynnema avatar
ru flag
@dnafication I just noticed that you're running kernel 5.11.0-051100-generic on 20.04. I don't believe that's the stock kernel for 20.04. Did you manually install that, or did a Software Update put it there? Edit your question and show me `ls -al /boot`.
heynnema avatar
ru flag
@dnafication Boot to a Ubuntu Live 21.04 USB/DVD and run the system long enough to see if there are problems.
dina avatar
kz flag
yes, I manually installed the kernel. I will try out 21.04 and let you know.
cn flag
I've got the same board and CPU as you, but I have ECC RAM. No ECC issues indicated with `edac-utils` and it seems to be working fine according to that and `dmesg`. I'm having this issue as well. I'm trying to determine if the issue is the board or the 5900X. I may swap a 3600 in here for a bit. Given that it seems to be affecting a number of people, I'd like to get to the root of this problem.
heynnema avatar
ru flag
@dnafication Status please...
dina avatar
kz flag
I briefly tried Ubuntu live 21.04 but quickly gave up because of the Graphics driver issue and its too much work reinstalling this again and again. I moved back to Windows. Not seeing any crashes so far. A bit disappointed that I gave up but it is costing me a lot of time fiddling with all these settings.
heynnema avatar
ru flag
@dnafication The 21.04 test was supposed to see if you still had memory errors. It didn't really surprise me that graphics issues could have arisen, even though you could have installed video drivers during the test. Sorry to see you go to the "other" side.
Score:0
kz flag

Based on the suggestion from @heynnema, I was able to find out that the model of DIMMs installed on my computer is not listed in their compatibility list. Here's the steps followed:

  1. Visit the CPU Supported list of the ASRock x570 Taichi Website. Find out the core type. In my case it was Vermeer
  2. Find out the model of the DIMMs installed on the system by running sudo lshw -C memory (it was F4-3600C16-16GTZNC)
  3. Navigate to the Memory Supported List for Vermeer and see if it is supported. Unfortunately its not in the list! perhaps that is the cause for the inconsistent crashes. I will try out a supported version of DIMMs to see if the crashes occur again and update this answer accordingly.
 *-firmware
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: P4.30
       date: 04/14/2021
       size: 64KiB
       capacity: 16MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: e
       slot: System board or motherboard
       size: 32GiB
     *-bank:0
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: F4-3600C16-16GTZNC
          vendor: Unknown
          physical id: 0
          serial: 00000000
          slot: DIMM 0
          size: 16GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)
     *-bank:1
          description: Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5)Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 1
          serial: Unknown
          slot: DIMM 1
     *-bank:2
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: F4-3600C16-16GTZNC
          vendor: Unknown
          physical id: 2
          serial: 00000000
          slot: DIMM 0
          size: 16GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)
     *-bank:3
          description: Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5)Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 3
          serial: Unknown
          slot: DIMM 1
  *-cache:0
       description: L1 cache
       physical id: 11
       slot: L1 - Cache
       size: 768KiB
       capacity: 768KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 12
       slot: L2 - Cache
       size: 6MiB
       capacity: 6MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 13
       slot: L3 - Cache
       size: 64MiB
       capacity: 64MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=3
heynnema avatar
ru flag
Show me `sudo lshw -C memory`. I want to check what slots the DIMMs are in. Take out one 16G DIMM and see if the crash situation improves.
dina avatar
kz flag
@heynnema i added the output of the command in the answer above.
cn flag
The board should be able to support DIMMs not on the compatibility list just fine. I've built many Ryzen systems beginning with the 1800X. I've chased after this "compatible RAM" rabbit hole before without any positive results. Your mileage may vary. It is good to try another set of DIMMs in any case.
dina avatar
kz flag
@MishaNasledov thanks, I don't have the option to replace DIMMs unfortunately and I've decided to move back to Windows. Got Win10 pro and its been running okay so far.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.