Score:0

Hardware errors in either CPU or RAM, what to do?

cn flag

I have a server that, every now and then reports Hardware Errors to the OS, but otherwise runs without any noticeable issues.

Today I found this while walking by the monitor attached to it.enter image description here

Can anyone tell me what this means? Is this something I need to worry about? Are there logfiles I can look deeper into? Some weeks prior, I noticed, that one of the RAM sticks wasn't detected by the system, it was only reporting 112 GB instead of 128 GB. Now it shows correctly though.

For more info, this server has the following main components:

  • Supermicro MBD-H11DSi-NT-B
  • 2x AMD Epyc 7301
  • 128GB of Kingston Server Premier KSM26RD8/16HAI DDR4-2666 regECC
  • Unraid as OS
Score:5
cn flag

Can anyone tell me what this means?

You have a hardware issue that needs to be addressed - likely memory. typing MC15_STATUS[Over|CE into google, second hit is from the unraid forums which may me helpful too.

Is this something I need to worry about?

Absolutely! Ignore hardware errors at your (data's) peril. I would be getting that system out of production without spending time asking the internet if this was an issue I needed to worry about.

Use something like memtest86 to test and diagnose the location of the issue.

cn flag
Well, there's no other "production" server. So yeah. Need to wait until after Christmas until I can shut it down and run tests on it.
Score:3
za flag

In your case I'd read the IPMI BMC event log, e.g. with ipmiutil sel. It should show the details about errors, in my case it showed even the particular memory slot location where the faulty module resided.

cn flag
I only have IPMITool which, for me, doesn't list any memory related events.
Nikita Kipriyanov avatar
za flag
There is IPMITool from Supermicro web site, which is very underfeatured. It's a shame it doesn't even know how to connect to *local* IPMI BMC via SMBus. There is also [`ipmitool` package](https://github.com/ipmitool/ipmitool), which interprets event log messages incorrectly (doesn't decode them completely, or even decodes wrong). The most correct information on PSU and other hardware health events I was able to obtail only with from [`ipmiutil`](http://ipmiutil.sourceforge.net/) (but, I must admit, ipmitool is easier to use).
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.