Score:0

How do you deal with server fault where it hangs but doesn't get stopped?

qa flag

We've some servers in linux and those servers get hang(stuck) but not stopped. So, how can I deal with those servers. It's not clear what's the cause of this stuckness. Any guidance will be appreciated.

The problems:

  1. The server hangs time to time. It doesn't get stopped. It just hangs. Theoritically it's still up but practically it has stopped working. The one way to trace it is to monitor the logs, you'd see logs not being printed anymore.

Cause: Unknown

  1. The server goes down time to time, too frequently on some servers.

Cause: Huge log size

Solution: logrotate

  1. The server goes down time to time, too frequently on some servers.

Cause: Unknown

Solution: Script that auto-restarts the service in timely manner. I've less hopes that it will work though.

  1. The clients want to be able to monitor these services by themselves and do things like restarting by themselves. What's the best monitoring tool that allows to restart the service as well(i.e something that runs scripts as I like)?

Are nagios, zabbix, monit used for this purpose? what's the best tool for this purpose?

We're using centos 7 (Yes it's reaching end of life). The servers are on virtual machine. We only have remote access. The applications are:

  • java servers

  • glassfish servers

  • tomcat servers

Romeo Ninov avatar
in flag
Provide more information like hardware, OS, applications, do you have physical access, etc
HBruijn avatar
in flag
In general you have your monitoring, logs and can check the console (for things like OoM killer events). Enterprise hardware usually comes with out-of-band management console that will also give insight into server health and events. You try and perform a root-cause analyses and based on that you decide on a solution. [STONITH](https://en.wikipedia.org/wiki/STONITH) is the common clustering solution to deal with hanging servers.
John Mahowald avatar
cn flag
In addition to the requested edits for any information at all, add what your organization's uptime objective is, and what high availability design is in place. Sometimes you can put multiple of the same host behind a load balancer or in a cluster. Sometimes you do not have the time or the budget and its just one server.
cn flag
Requests for products are off topic.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.