My Ubuntu server shut itself off today and after looking at the /var/log/kern.log
it was because it overheated:
Sep 8 07:00:22 ipc2-server kernel: [289498.255583] QNX4 filesystem 0.2.3 registered.
Sep 10 20:04:00 ipc2-server kernel: [509336.574882] thermal thermal_zone1: critical temperature reached (100 C), shutting down
Sep 10 20:04:01 ipc2-server kernel: [509337.601860] thermal thermal_zone1: critical temperature reached (100 C), shutting down
This seems fine except it happened out of nowhere. Looking at my Netdata logs shows that it went from a stable 44° Celsius t0 70° within 40 seconds at which point the server shutdown (The red curve slopping down from 70° is during the time the server was off):
As you can see only two sensors reported this change and the CPU utilisation was at 20% before the server shutdown:
Later you can see a normal heat spike from an increase in CPU usage when all temperature sensors report an increase in heat.
This is the first time this has happened to me and brings up some questions.
- Are there any further logs I can use to investigate this issue to confirm it was a hardware failure or an actual overheat?
- Is it normal for temperature sensors to fail over time?
- Can they be replaced if that is the case?
- Can I change the behaviour of Ubuntu so that it shutdowns the server if all temperature sensors are reporting high values?