I have an unstable system (reboots at random) and am trying to determine the cause of the reboots. My question is whether these MCE are serious errors that could be leading to the reboots. If so, should they lead me to replace my CPU or RAM?
After every reboot (whether random, or initiated by sudo reboot
) the following MCE are produced:
14:50:45 kernel: [ 0.778792] mce: [Hardware Error]: Machine check events logged
14:50:45 kernel: [ 0.778793] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: ee2000000004017a
14:50:45 kernel: [ 0.778795] mce: [Hardware Error]: TSC 0 ADDR 5f000000 MISC 8cf00031e0000086
14:50:45 kernel: [ 0.778797] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46
14:50:45 kernel: [ 0.778798] mce: [Hardware Error]: Machine check events logged
14:50:45 kernel: [ 0.778799] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 18: ee2000000004017a
14:50:45 kernel: [ 0.778799] mce: [Hardware Error]: TSC 0 ADDR 5f100040 MISC 1cf00031e0000086
14:50:45 kernel: [ 0.778801] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46
14:50:45 kernel: [ 0.778802] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 19: ee2000000004017a
14:50:45 kernel: [ 0.778802] mce: [Hardware Error]: TSC 0 ADDR 5f100000 MISC 54f00031e0000086
14:50:45 kernel: [ 0.778804] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1639083036 SOCKET 0 APIC 0 microcode 46
Unfortunately, these messages are gibberish without rasdaemon
or mcelog
to interpret them. Also unfortunately, it doesn't look like rasdaemon
starts until after the message has been logged (the error does not appear in ras-mc-ctl --summary
). Notice the timestamps:
14:50:50 rasdaemon[1023]: rasdaemon: ras:mc_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:mc_event
14:50:50 rasdaemon[1023]: rasdaemon: ras:aer_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:aer_event
14:50:50 rasdaemon[1023]: rasdaemon: Warning: cpu 0 offline?, imc_log not set
14:50:50 rasdaemon[1023]: rasdaemon: mce:mce_record event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event mce:mce_record
14:50:50 rasdaemon[1023]: rasdaemon: ras:extlog_mem_event event enabled
14:50:50 rasdaemon[1023]: rasdaemon: Enabled event ras:extlog_mem_event
14:50:50 rasdaemon[1023]: rasdaemon: Listening to events for cpus 0 to 15
14:50:50 rasdaemon[1025]: rasdaemon: ras:mc_event event enabled
14:50:50 rasdaemon[1025]: rasdaemon: ras:aer_event event enabled
14:50:50 rasdaemon[1025]: rasdaemon: mce:mce_record event enabled
14:50:50 rasdaemon[1025]: rasdaemon: ras:extlog_mem_event event enabled
Is there a better way to solve this problem? Would updating to 20.04 actually help, as suggested in this answer?