Score:0

Random reboot or system crash on my Ubuntu server

ug flag

I'm a noob on Ubuntu as well as askubuntu.com.

I recently bumped into a faulty system, so I'm reaching for helps.

//

I have a Ubuntu-based monitor-less server, and I usually use the machine with ssh/sftp connections from my laptop.

The machine is in the server room with air-conditioning cooling system & stable power supply, and turned on 24/7, and I've allocated public IP to the machine with UFW setting for safety.

The machine is for deep learning programming with GPU acceleration, and connecting to other servers by ssh in the internal network with private IPs.

//

Issue & Symptoms

Four months ago, the machine starts to automatically reboot or went down (system crash), while it had no problem for 8 months before.

By the meaning "went down or crash", the machine stops working, all the fans are also stopped, and the power seemingly went down.

Once the machine goes down, I have to manually plug the power connection out, wait for the remaining electric power disappear, plug the power connection back in, and then finally turn on the machine by pressing the power button.

Here, weird things are the following:

(1) The reboot/crash frequency becomes shorter, and it's been more crashing rather than rebooting itself. Also, it was initially about 2~3 weeks period for the system fault, but now it is less than 1 week period, and even less than 3 days sometimes.

(2) The machine crashes/reboots without additional process. Often, the machine reboots itself only with default processes. I've also runned burden processes several times which utilize full-CPU and full-GPU, but the reboot/crash did not happened. (so, I do not think it is the thermal issue.)

(3) The machine even crashes with simple ssh/sftp connection a lot. After this thing happened, I've checked with last -x command and there was no previous fault on the system (after the machine is recovered from the last fault).

(4) I've also checked syslog, but there was no suspicious log.

(5) Plus, the ssh connection often delayed or lost the connection without the system reboot/crash (with broken pipe).

//

HW Spec

Here is the HW spec:

CPU: Intel i7-11700KF (with additional CPU cooler)

M/B: Intel Z590

RAM: Samsung 32GB * 4EA

SSD: Samsung M2 NVME 1TB * 2EA

GPU: NVIDIA RTX3090 * 2EA

Power: Seasonic PX-1300

+) Ubuntu 20.04.4 LTS

//

I've checked RAM and the storage with memtest and smartctl, and the inspection results say they are fine.

Could you help me out for solving this issue? What should I check next? If there is any information that I need to provide, I will add it by updating this post.


EDITED : @waltinator

I've checked the log, and everything seems just normal, except for the UFW BLOCK log. (As I have other machines, linked to private IPs, I can compare with them. And, these machines do not have UFW BLOCK log since they are in the safe network, so I did not set up the UFW.)

There are tons of UFW BLOCK logs (since I've set up UFW for unwanted attack from anonymous sources), but the SRC and DST looks fine, though.

For example, the following:

Jun  9 11:45:51 (removed) kernel: [70349.077829] [UFW BLOCK] IN=(removed) OUT= MAC=(removed) SRC=192.168.0.1 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0x00 TTL=1 ID=51159 DF PROTO=2 
Jun  9 11:46:21 (removed) kernel: [70379.078710] [UFW BLOCK] IN=(removed) OUT= MAC=(removed) SRC=192.168.0.1 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0x00 TTL=1 ID=10357 DF PROTO=2 
SRC=192.168.163.XXX DST=224.0.0.251 (the SRC is the other machine that I'm using within the same network router)

If I remove the machine from the wireless router and connect to LAN cable directly, the last block message SRC=192.168.163.XXX DST=224.0.0.251 disappears from the log.

Please see below.


EDITED : GENERAL

As far as I've checked, the UTF BLOCK does not seems the reboot or system crash directly that the tons of UTF BLOCK logs are due to the fault of internal/external network collision.

However, it seems that the UTF BLOCK acts as random processes which causes the reboot/crash.

I guess the machine reboots or shutdowns with crash due to random processes, including the UFW BLOCK, since the symptom (5) Plus, the ssh connection often delayed or lost the connection without the system reboot/crash (with broken pipe) does not happens after I connect the machine directly to the LAN without the network router.

Also, the CPU/GPU usage is stable, and there was no anonymous attacks from outside as far as I know from the IT team.

Could it be due to the HW issues?

waltinator avatar
it flag
Read `man journalctl', click on my userid and read my profile for `journalctl` hints. Start with `sudo journalctl --list-boots` (Warning - takes a long time the first time it's run. Be prepared to wait). Then, taking the boot number from the list, `sudo journalctl -b # -ex` will show the last several messages before the system goes down, with extended messages.
labonteck1 avatar
ug flag
@waltinator Thanks for your comment. I've checked the log, and everything seems just normal, except for the `UFW BLOCK` log. (As I have other machines, linked to private IPs, I can compare with them. And, these machines do not have `UFW BLOCK` log since they are in the safe network, so I did not set up the UFW.) There are tons of `UFW BLOCK` logs (since I've set up **UFW** for unwanted attack from anonymous sources), but the `SRC` and `DST` looks fine, though.
labonteck1 avatar
ug flag
@waltinator Here are some examples: SRC=192.168.1.1 DST=224.0.0.1 // SRC=192.168.0.1 DST=224.0.0.1 // SRC=192.168.163.XXX DST=224.0.0.251 // These are all from private network for the network setting (right?), and `192.168.163.XXX` is the other machine that I'm using, where I used to jump into that machine with `ssh` from the current machine with system fault. Or, could these multiple `UFW BLOCK` cause system fault? I also have other boot logs with multiple `UFW BLOCK` with no system fault.
labonteck1 avatar
ug flag
@waltinator Except for the `UFW BLOCK`, the log seems fine. Might be a HW issue?
waltinator avatar
it flag
Please [edit] your Question to add new information, properly formatted. Information added via comments is hard for you to format, hard for us to read and ignored by both current and future readers (who have better answers). Please don't use Add Comment, since that's our way to help you improve your question. Please read https://askubuntu.com/help/how-to-ask and https://askubuntu.com/help/formatting . Help us help you.
labonteck1 avatar
ug flag
@waltinator Thanks for your comment. I've edited the question.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.