Score:0

Tools to interpret MCE error on Ubuntu 18.04

cn flag

I have an unstable system (reboots at random) and am trying to determine the cause of the reboots. My question is whether these MCE are serious errors that could be leading to the reboots. If so, should they lead me to replace my CPU or RAM?

After every reboot (whether random, or initiated by sudo reboot) the following MCE are produced:

14:50:45 kernel: [    0.778792] mce: [Hardware Error]: Machine check events logged
14:50:45 kernel: [    0.778793] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: ee2000000004017a
14:50:45 kernel: [    0.778795] mce: [Hardware Error]: TSC 0 ADDR 5f000000 MISC 8cf00031e0000086
14:50:45 kernel: [    0.778797] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46
14:50:45 kernel: [    0.778798] mce: [Hardware Error]: Machine check events logged
14:50:45 kernel: [    0.778799] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 18: ee2000000004017a
14:50:45 kernel: [    0.778799] mce: [Hardware Error]: TSC 0 ADDR 5f100040 MISC 1cf00031e0000086
14:50:45 kernel: [    0.778801] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46
14:50:45 kernel: [    0.778802] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 19: ee2000000004017a
14:50:45 kernel: [    0.778802] mce: [Hardware Error]: TSC 0 ADDR 5f100000 MISC 54f00031e0000086
14:50:45 kernel: [    0.778804] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46

Unfortunately, these messages are gibberish without rasdaemon or mcelog to interpret them. Also unfortunately, it doesn't look like rasdaemon starts until after the message has been logged (the error does not appear in ras-mc-ctl --summary). Notice the timestamps:

14:50:50 rasdaemon[1023]: rasdaemon: ras:mc_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:mc_event
14:50:50 rasdaemon[1023]: rasdaemon: ras:aer_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:aer_event
14:50:50 rasdaemon[1023]: rasdaemon: Warning: cpu 0 offline?, imc_log not set
14:50:50 rasdaemon[1023]: rasdaemon: mce:mce_record event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event mce:mce_record
14:50:50 rasdaemon[1023]: rasdaemon: ras:extlog_mem_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:extlog_mem_event
14:50:50 rasdaemon[1023]: rasdaemon: Listening to events for cpus 0 to 15
14:50:50 rasdaemon[1025]: rasdaemon: ras:mc_event event enabled
14:50:50 rasdaemon[1025]: rasdaemon: ras:aer_event event enabled
14:50:50 rasdaemon[1025]: rasdaemon: mce:mce_record event enabled
14:50:50 rasdaemon[1025]: rasdaemon: ras:extlog_mem_event event enabled

Is there a better way to solve this problem? Would updating to 20.04 actually help, as suggested in this answer?

heynnema avatar
ru flag
Go to https://www.memtest86.com/ and download/run their free `memtest` to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take a few hours to complete.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.