Score:1

How to troubleshoot CPU HW crash in Ubuntu 18.04

cn flag

I bought a new computer a few months ago. I installed Ubuntu 18.04 and it's working fine except when I compile c++ code: it freezes hard as soon as there is a spike of high CPU usage (10+ cores).

The only working workaround is to compile with -j8. Going -j10 or above will make the system crash most of the time. -j16 crashes 100% of the time with big projects (and no ccache).

Details about my setup:

  • Asus gaming computer: Asus Strix GT15 - Best Buy link. You've guessed it, I bought it for the GPU... otherwise I would have built it myself with good quality components (especially PSU and heatsink).
  • MB: Asus strix B460-G Gaming
  • CPU: Intel Core i7-10700KF
  • Power supply: Unknown OEM 500W 80 PLUS
  • The crash occurs when the GPU is idle (desktop).
  • I can't install a more recent Ubuntu versions due to the required work environment.

What I tried, but did not resolve the issue (it's a little less frequent, but still happenning):

Bios:

  • I reduced the Turbo to the minimum (1s instead of 60s), the CPU heatsink seems very inefficient for this furnace CPU.
  • Reduced the number of Amps AND maximum Wattage the CPU /Motherboard is allowed to use, in case the PSU is too weak.
  • Increased the fan speed sooner, when the CPU temps hits 50C (temps are not much better, but now it's very loud when compiling)
  • Replaced the OEM "thermal paste" with a high quality paste (reduced temps by 2-3C)

Crash notes:

  • journalctl -b -1 doesn't have any trace about a crash, so I think it's a HW CPU crash...
  • Ctrl-Alt-F* keys do not work
  • Can't connect via ssh after the crash
  • Audio crashes too when it happens
  • I don't think the PSU is the problem because I can use stress -c 16 and ./gpu_burn 300 at the same time and the system doesn't crash. Stress only uses sqrt()...

Thanks in advance!

Update #1

Temps:

  • without these Bios settings mods, they would easily go up to 90C after sustained 100% CPU usage. With these temps, I did not let it run long enough.
  • after the modifications, temps rarely go above 80C.
  • The freeze seems to be related to sudden spike in CPU usage, not by high CPU temps.
  • room temp is 20-22C
  • idle CPU temp is 27-28C

Current kernel:

uname -a
Linux rog 5.4.0-87-generic #98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hu flag
May need to monitor the CPU temps, or reset the heatsink with new thermal paste.
Doug Smythies avatar
gn flag
@mikewhatever : The OP already changed to high quality thermal paste, and claims 2-3 degree improvment. I agree monitor CPU temps, and suggest (as always) `sudo /turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 6`. My i5-10600K is from the same era, and I had to enable HWE on 20.04 server to use a newer kernel. Suggest you try a newer kernel, just as a test.
cn flag
@DougSmythies Thanks for the suggestions, I'll try a newer kernel and see about the turbostat output !
Doug Smythies avatar
gn flag
A spike in CPU utilization could also cause an increase in CPU temperature that occurs so fast that you don't see it on any monitoring program. Are you running any thermal throttling daemon? Like thermald or using TCC offset?
cn flag
@DougSmythies thermald didn't recognize this CPU so it's not running...
Doug Smythies avatar
gn flag
Suggest a simple thermald config file. see [here](https://askubuntu.com/questions/1373324/cpu-temperature-spike-in-90c-only-when-plugged-in) for example (it the same as I have suggested before). Note that systemctl status might complain, it does on my computer, but it actually works fine.
cn flag
@DougSmythies Seems to work! I used the very simple and generic example with temp of 60C. I will come back in a week to tell if it's really working well. If you create an answer with this link, I will accept it. Did not update the kernel yet, still at 5.4 (default latest version).
Score:1
gn flag

Everyone should understand the thermal characteristics of their computer, and provide adequate protection. Often users are not aware of how extremely rapid the processor package temperature can increase with a step function load. An example from my 20.04 test server:

doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 0.1
PkgTmp  PkgWatt
33  1.88    
33  1.69    
33  1.56    
33  1.74    
49  24.99   800 degrees per second
57  133.28  80 degrees per second
61  133.66  40 degrees per second
61  132.58  0 degrees per second
63  133.57  
64  134.12

The load was applied about 4/5ths of the way along the sample time (25 / (133.5 - 1.7) ~= 20%, or 4/5ths) and the temperature already went up 16 degrees, or 800 degrees per second. The load here was the prime95 torture test, the maximum heat sub-test. The example computer is water cooled with the water pump always on at maximum rate. Processor i5-10600K.

For ASUS motherboards, please know that the CPU fan sensor is actually an external thermistor that will lag the actual processor package temperature both in time and value. On my ASUS motherboard, under heavy load, the CPU fan sensor lags the actual processor temperature by 12 degrees.

In the end, it is possible for the processor package temperature to hit the shutdown limit so fast that various monitoring programs or daemons don't even notice. Sometimes thermal protection needs to react sooner to have time to take effect before any overshoot temperature triggers a shutdown.

Method 1: Thermald

For `/etc/thermald/thermal-conf.xml` use the very basic and simple configuration, as per the `man thermal-conf.xml` page:
<?xml version="1.0"?>

<!--
use "man thermal-conf.xml" for details
-->

<!-- BEGIN -->
<ThermalConfiguration>
        <Platform>
                <Name>Overide CPU default passive</Name>
                <ProductName>*</ProductName>
                <Preference>QUIET</Preference>
                <ThermalZones>
                        <ThermalZone>
                                <Type>cpu</Type>
                                <TripPoints>
                                        <TripPoint>
                                                <Temperature>41000</Temperature>
                                                <type>passive</type>
                                        </TripPoint>
                                </TripPoints>
                        </ThermalZone>
                </ThermalZones>
        </Platform>
</ThermalConfiguration>
<!-- END -->

Note: I am using a ridiculously low trip point of 41 degrees, because my system is water cooled and I can not get to desired example temperatures.

doug@s19:~$ sudo systemctl start thermald
doug@s19:~$ sudo systemctl status thermald
● thermald.service - Thermal Daemon Service
     Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-11-05 07:41:45 PDT; 17s ago
   Main PID: 3461 (thermald)
      Tasks: 2 (limit: 38214)
     Memory: 2.2M
     CGroup: /system.slice/thermald.service
             └─3461 /usr/sbin/thermald --systemd --dbus-enable --adaptive

Nov 05 07:41:45 s19 systemd[1]: Starting Thermal Daemon Service...
Nov 05 07:41:45 s19 systemd[1]: Started Thermal Daemon Service.
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: Polling mode is enabled: 4
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: XML zone: invalid sensor type []

While thermald status shows some complaining, it actually works properly, although a little slow to respond:

doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 1
PkgTmp  PkgWatt
33      1.44
33      1.34
33      1.33
58      63.26
61      114.43
61      114.68
48      86.59
47      55.48
47      55.53
41      42.77
43      33.43
41      34.30
41      28.04
43      33.63
40      34.45
44      33.57
41      34.40
44      33.85
34      14.50
34      1.33
34      1.33

Adjust the trip point as needed to get the most out of your system while still preventing the overshoot high point causing a shutdown. Having too low a trip point might reduce system performance to undesirable levels.

Method 2: TCC Offset

If your kernel is new enough and your processor is supported, TCC offset can be used to have the processor itself do the thermal throttling. Depending on the timing window parameters, the response time can be much faster. For this example, the timing window was set in BIOS to the fastest response time:

First, find which cooling device:

doug@s19:~$ grep . /sys/devices/virtual/thermal/cooling_device*/type
/sys/devices/virtual/thermal/cooling_device0/type:Fan
/sys/devices/virtual/thermal/cooling_device10/type:Processor
/sys/devices/virtual/thermal/cooling_device11/type:Processor
/sys/devices/virtual/thermal/cooling_device12/type:Processor
/sys/devices/virtual/thermal/cooling_device13/type:Processor
/sys/devices/virtual/thermal/cooling_device14/type:Processor
/sys/devices/virtual/thermal/cooling_device15/type:Processor
/sys/devices/virtual/thermal/cooling_device16/type:Processor
/sys/devices/virtual/thermal/cooling_device17/type:intel_powerclamp
/sys/devices/virtual/thermal/cooling_device18/type:TCC Offset
/sys/devices/virtual/thermal/cooling_device1/type:Fan
/sys/devices/virtual/thermal/cooling_device2/type:Fan
/sys/devices/virtual/thermal/cooling_device3/type:Fan
/sys/devices/virtual/thermal/cooling_device4/type:Fan
/sys/devices/virtual/thermal/cooling_device5/type:Processor
/sys/devices/virtual/thermal/cooling_device6/type:Processor
/sys/devices/virtual/thermal/cooling_device7/type:Processor
/sys/devices/virtual/thermal/cooling_device8/type:Processor
/sys/devices/virtual/thermal/cooling_device9/type:Processor

It is device 18. Set the offset and then check it via turbostat without the --quiet option:

doug@s19:~$ echo 59 | sudo tee /sys/devices/virtual/thermal/cooling_device18/cur_state
59
doug@s19:~$ sudo /home/doug/temp-k-git/linux/tools/power/x86/turbostat/turbostat --Summary --show Bzy_MHz,PkgWatt,PkgTmp --interval 0.1
turbostat version 21.05.04 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0x16 CPUID levels
CPUID(1): family:model:stepping 0x6:a5:5 (6:165:5) microcode 0xec
...
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x3b641422 (41 C) (100 default - 59 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x883f0800 (37 C)
...
Bzy_MHz PkgTmp  PkgWatt
800     33      1.35
800     33      1.34
800     34      1.40
4187    49      86.23
4100    52      91.72
4100    53      91.29
...

Notice the throttling is virtually immediate, 4.8 GHz would have been the un-throttled CPU frequency. Note that the throttling limit for my processor (not all processors) is the non-turbo maximum clock frequency of 4.1 GHz, and so it can not actually reach the ridiculously low limit of 41 degrees.

cn flag
I used the generic thermald config you suggested and I didn't crash yet (limited to 60C for now). I will do more tests to find out at which temp it crashes... Thanks!
heynnema avatar
ru flag
After our brief discussion about TCC Offset (I couldn't find that thread again) elsewhere, my echo command would be `echo 59 | sudo tee /sys/devices/virtual/thermal/cooling_device14/cur_state`, but where does this command go to be executed at system boot time? rc.local no longer exists. And is 59 the celcius desired max temp? And what BIOS setting did you change?
Doug Smythies avatar
gn flag
@heynnema : The TCC offset would have to be done per boot. I use a systemd post-boot service to take care of such things in my computers. The "59" is an offset from your processor's TCC. See this line `cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x3b641422 (41 C) (100 default - 59 offset)` above. The timing settings might not be available in your BIOS, they are in mine (ASUS Z490-A Prime). I have asked Intel for the MSR addresses, but didn't get an answer. I also asked for the offset to be a target temperature rather than an offset, and they agreed, but then it got lost somewhere.
heynnema avatar
ru flag
Thanks for your reply. I'd be interested in hearing what *"post-boot service"* that you're using. My "cur_state" starts at 0... so would 59 be a good example offset for 59 Celsius?
Doug Smythies avatar
gn flag
What is your processor? It'll take me awhile to reply about a post-boot service example. Typical TCC is 100 degrees, so an offset of 41 would give trip point of 59 degrees
heynnema avatar
ru flag
i7 11th generation
Doug Smythies avatar
gn flag
@heynnema : I edited my answer with a post-boot service example. Sorry it took so long.
heynnema avatar
ru flag
@DougSmythies Thanks Doug! I'll take a close look at it.
heynnema avatar
ru flag
@DougSmythies I do a very similar thing, but only use one .service file, that directly calls one .sh script. Since /etc/rc.local file no longer exists, I created a /etc/rc.local folder for all of my .sh scripts.
Doug Smythies avatar
gn flag
@heynnema : Being a server person, I do not normally suspend. However, while working on an unrelated issue, I noticed the TCC offset does not survive a suspend/resume cycle. I posted the fixed post boot service as an edit to this answer.
heynnema avatar
ru flag
@DougSmythies Thanks for the update!
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.