How do you deal with server fault where it hangs but doesn't get stopped?

achhainsan

8/28/24, 7:59 AM

We've some servers in linux and those servers get hang(stuck) but not stopped. So, how can I deal with those servers. It's not clear what's the cause of this stuckness. Any guidance will be appreciated.

The problems:

The server hangs time to time. It doesn't get stopped. It just hangs. Theoritically it's still up but practically it has stopped working. The one way to trace it is to monitor the logs, you'd see logs not being printed anymore.

Cause: Unknown

The server goes down time to time, too frequently on some servers.

Cause: Huge log size

Solution: logrotate

The server goes down time to time, too frequently on some servers.

Cause: Unknown

Solution: Script that auto-restarts the service in timely manner. I've less hopes that it will work though.

The clients want to be able to monitor these services by themselves and do things like restarting by themselves. What's the best monitoring tool that allows to restart the service as well(i.e something that runs scripts as I like)?

Are nagios, zabbix, monit used for this purpose? what's the best tool for this purpose?

We're using centos 7 (Yes it's reaching end of life). The servers are on virtual machine. We only have remote access. The applications are:

java servers
glassfish servers
tomcat servers

0 + 4

linux

tomcat

glassfish

java

centos

Romeo Ninov

8/28/24, 8:04 AM

Provide more information like hardware, OS, applications, do you have physical access, etc

HBruijn

8/28/24, 9:22 AM

In general you have your monitoring, logs and can check the console (for things like OoM killer events). Enterprise hardware usually comes with out-of-band management console that will also give insight into server health and events. You try and perform a root-cause analyses and based on that you decide on a solution. [STONITH](https://en.wikipedia.org/wiki/STONITH) is the common clustering solution to deal with hanging servers.

John Mahowald

8/28/24, 11:05 PM

In addition to the requested edits for any information at all, add what your organization's uptime objective is, and what high availability design is in place. Sometimes you can put multiple of the same host behind a load balancer or in a cluster. Sometimes you do not have the time or the budget and its just one server.

Greg Askew

8/29/24, 6:18 AM

Requests for products are off topic.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: How do you deal with server fault where it hangs but doesn't get stopped?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.