We recently deployed some new hardware and since Day 1 have been experiencing random reboots, an a lot of them. I've actually been working at the console and it's just rebooted without any warning.
We've gone down a bunch of rabbit holes trying to troubleshoot, but so far nothing has panned out. It's happening on multiple devices which makes me tend to think that it is not a hardware problem with one bad device.
First we thought it might be heat, as these are deployed "in the field," but the reboots happen at all hours of the day/night, not just at the hottest times of the day. Sometimes it's in the middle of the night when it's 50 degrees F in the cabinet and the device is running at it's lowest load.
It does, however, seem to be during times of heaviest CPU load. Here are recent 'last reboot' entries:
reboot system boot 5.4.0-77-generic Sun Aug 1 17:31 still running
reboot system boot 5.4.0-77-generic Sun Aug 1 15:48 still running
reboot system boot 5.4.0-77-generic Sun Aug 1 15:32 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 19:02 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 17:56 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 17:30 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 17:17 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 16:52 still running
reboot system boot 5.4.0-77-generic Sat Jul 31 16:40 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 23:13 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 22:37 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 22:05 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 21:42 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 21:24 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 20:53 still running
reboot system boot 5.4.0-77-generic Fri Jul 30 20:42 still running
dmesg doesn't show anything useful related to the reboots. We've tailed /var/log/kern.log and syslog.log all day, but there's nothing added just before the reboots.
Thinking that it might be heat-related we did a 'watch -n 1 sensors' around the times when they are most likely to reboot, and although the CPU was "warm" it was still below the HIGH limit, and 20-30 degrees C lower than the CRITICAL limit which as I understand is where it would shutdown/reboot.
What can we try next to track down the cause of these reboots?
Thanks.