Everyone should understand the thermal characteristics of their computer, and provide adequate protection. Often users are not aware of how extremely rapid the processor package temperature can increase with a step function load. An example from my 20.04 test server:
doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 0.1
PkgTmp PkgWatt
33 1.88
33 1.69
33 1.56
33 1.74
49 24.99 800 degrees per second
57 133.28 80 degrees per second
61 133.66 40 degrees per second
61 132.58 0 degrees per second
63 133.57
64 134.12
The load was applied about 4/5ths of the way along the sample time (25 / (133.5 - 1.7) ~= 20%, or 4/5ths) and the temperature already went up 16 degrees, or 800 degrees per second. The load here was the prime95 torture test, the maximum heat sub-test. The example computer is water cooled with the water pump always on at maximum rate. Processor i5-10600K.
For ASUS motherboards, please know that the CPU fan sensor is actually an external thermistor that will lag the actual processor package temperature both in time and value. On my ASUS motherboard, under heavy load, the CPU fan sensor lags the actual processor temperature by 12 degrees.
In the end, it is possible for the processor package temperature to hit the shutdown limit so fast that various monitoring programs or daemons don't even notice. Sometimes thermal protection needs to react sooner to have time to take effect before any overshoot temperature triggers a shutdown.
Method 1: Thermald
For `/etc/thermald/thermal-conf.xml` use the very basic and simple configuration, as per the `man thermal-conf.xml` page:
<?xml version="1.0"?>
<!--
use "man thermal-conf.xml" for details
-->
<!-- BEGIN -->
<ThermalConfiguration>
<Platform>
<Name>Overide CPU default passive</Name>
<ProductName>*</ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>cpu</Type>
<TripPoints>
<TripPoint>
<Temperature>41000</Temperature>
<type>passive</type>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
<!-- END -->
Note: I am using a ridiculously low trip point of 41 degrees, because my system is water cooled and I can not get to desired example temperatures.
doug@s19:~$ sudo systemctl start thermald
doug@s19:~$ sudo systemctl status thermald
● thermald.service - Thermal Daemon Service
Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2021-11-05 07:41:45 PDT; 17s ago
Main PID: 3461 (thermald)
Tasks: 2 (limit: 38214)
Memory: 2.2M
CGroup: /system.slice/thermald.service
└─3461 /usr/sbin/thermald --systemd --dbus-enable --adaptive
Nov 05 07:41:45 s19 systemd[1]: Starting Thermal Daemon Service...
Nov 05 07:41:45 s19 systemd[1]: Started Thermal Daemon Service.
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: Polling mode is enabled: 4
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: XML zone: invalid sensor type []
While thermald status shows some complaining, it actually works properly, although a little slow to respond:
doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 1
PkgTmp PkgWatt
33 1.44
33 1.34
33 1.33
58 63.26
61 114.43
61 114.68
48 86.59
47 55.48
47 55.53
41 42.77
43 33.43
41 34.30
41 28.04
43 33.63
40 34.45
44 33.57
41 34.40
44 33.85
34 14.50
34 1.33
34 1.33
Adjust the trip point as needed to get the most out of your system while still preventing the overshoot high point causing a shutdown. Having too low a trip point might reduce system performance to undesirable levels.
Method 2: TCC Offset
If your kernel is new enough and your processor is supported, TCC offset can be used to have the processor itself do the thermal throttling. Depending on the timing window parameters, the response time can be much faster. For this example, the timing window was set in BIOS to the fastest response time:
First, find which cooling device:
doug@s19:~$ grep . /sys/devices/virtual/thermal/cooling_device*/type
/sys/devices/virtual/thermal/cooling_device0/type:Fan
/sys/devices/virtual/thermal/cooling_device10/type:Processor
/sys/devices/virtual/thermal/cooling_device11/type:Processor
/sys/devices/virtual/thermal/cooling_device12/type:Processor
/sys/devices/virtual/thermal/cooling_device13/type:Processor
/sys/devices/virtual/thermal/cooling_device14/type:Processor
/sys/devices/virtual/thermal/cooling_device15/type:Processor
/sys/devices/virtual/thermal/cooling_device16/type:Processor
/sys/devices/virtual/thermal/cooling_device17/type:intel_powerclamp
/sys/devices/virtual/thermal/cooling_device18/type:TCC Offset
/sys/devices/virtual/thermal/cooling_device1/type:Fan
/sys/devices/virtual/thermal/cooling_device2/type:Fan
/sys/devices/virtual/thermal/cooling_device3/type:Fan
/sys/devices/virtual/thermal/cooling_device4/type:Fan
/sys/devices/virtual/thermal/cooling_device5/type:Processor
/sys/devices/virtual/thermal/cooling_device6/type:Processor
/sys/devices/virtual/thermal/cooling_device7/type:Processor
/sys/devices/virtual/thermal/cooling_device8/type:Processor
/sys/devices/virtual/thermal/cooling_device9/type:Processor
It is device 18. Set the offset and then check it via turbostat without the --quiet option:
doug@s19:~$ echo 59 | sudo tee /sys/devices/virtual/thermal/cooling_device18/cur_state
59
doug@s19:~$ sudo /home/doug/temp-k-git/linux/tools/power/x86/turbostat/turbostat --Summary --show Bzy_MHz,PkgWatt,PkgTmp --interval 0.1
turbostat version 21.05.04 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0x16 CPUID levels
CPUID(1): family:model:stepping 0x6:a5:5 (6:165:5) microcode 0xec
...
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x3b641422 (41 C) (100 default - 59 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x883f0800 (37 C)
...
Bzy_MHz PkgTmp PkgWatt
800 33 1.35
800 33 1.34
800 34 1.40
4187 49 86.23
4100 52 91.72
4100 53 91.29
...
Notice the throttling is virtually immediate, 4.8 GHz would have been the un-throttled CPU frequency. Note that the throttling limit for my processor (not all processors) is the non-turbo maximum clock frequency of 4.1 GHz, and so it can not actually reach the ridiculously low limit of 41 degrees.