Score:0

Server random freeze and boot only with cold boot

cn flag

im facing extremely weird issue regards one server, it random freeze/hang with no output on server, and not responding to short keys, and required cold boot, when boot with cold boot, no errors on boot screen at all.

It's not freezing under heavy load at all, with around 9-20% cpu wheb crash, load average around 2-5(12 core cpu) and 128gb ram

We tried check logs, nothing shows like kernal panics, or anything that relate to the issue itself.

In all the freezes after cold boot, when we check the log, we do see normal OOM reaper killing php procces (users reach limits) but nothing too abusive, but always on OOM, Sometimes when server freeze in the log you see the current time, and sometimes like the it shows after thr current time of the crash few lines from older date, and freezes.

Nothing in logs can determine software related, or under heavy load, just normal operation, this is an upgraded machine from old one, that were stable for years.. The freezes are random, could be after a week server up, or two days or three weeks and etc...

Also we tried to extract vmcore dump of server freeze but still nothing catches there.

It's just freeze with not screen output, but server still running but not pringable, cant access ssh nothing, also kvm as i said show no output at all at screen.

Could it be related to maybe faulty hardware? As my suspension is about faulty RAM?

I'm extremely lost with this issue.. Thanks

Score:0
cn flag

We just migrated to another server, but after searching alot and trying debugging alot, looks like hardware issue regards the motherboard as i checked in some forums regards motherboards from asrock rack and ryzen cpus i manage to find few cases around same issue even wih windows 10 or windows server getting blue screen of death. as the OS support suggested in this case not to change the motherboard brand as could be risky to be refused to boot up, and to migrate to a new server as we did. after we migrated to new server, all issues resolved. so i guess it does relate to hardware issue and not software.

Score:0
nz flag
  1. Make sure temperatures are good, CPU/RAM/CHIPSET/DISKS, I assume your are a linux user because of OOM, install lm-sensors, and check the temps with the sensors command.
  2. It's your RAM, run memtest86, be aware full test on 128GB can take a week.
cn flag
Yeah Linux based, you think its related to temperature? Or hardware? I was thinking get new server migrate data and then move it to the old one racks so rule out possiblty of hardware
Egidijus avatar
nz flag
If there are no clear signs in software, then it is very likely hardware. Temperature is hardware (software can't feel a warm touch).
cn flag
I really doubt it relates to temperature as for server not under heavy load when it freezes, i dont think cpu can reach to 95 degrro with a cpu load of 9% or 20%, as for it reach those daily and yet nothing
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.