Score:0

ESX hosts crash within same vcenter cluster

nz flag

I have a vcenter cluster of 12 ESX hosts (ClusterA) and another cluster of 3 ESX hosts (ClusterB). All of these are a mix of poweredge r620s and r630s.

Some of the hosts have hardware errors that can be seen in the iDRAC logs and front LCD screen such as:

  • CPU machine check error
  • Correctable memory error rate exceeded As expected, this is causing those hosts to be unavailable (Not responding) in the cluster.

Fixing these hardware errors usually involves these steps:

  1. power off
  2. remove network cards
  3. power on and wait for successful boot to OS
  4. power off
  5. place the same network cards back in
  6. power on It's strange to me that this would fix CPU & memory errors, but that's what happens consistently.

ClusterB is fine - no problems ever. The real problem I'm facing is that when I fix a couple hosts from ClusterA, 1-3 other random hosts in ClusterA will crash within a day or two. After those initial 1-3 crashes, if I leave things alone, no more hosts crash afterwards for weeks. This puts me back to where I started and I've observed this behavior several times now.

Any ideas on what to check?

joeqwerty avatar
cv flag
Contact Dell support. That's your best bet.
TLMstack avatar
nz flag
@joeqwerty Unfortunately, I've already contacted Dell support several times - that's where the above remediation steps originally came from.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.