Score:1

CPU running hotter at idle when using grub boot parameters

pl flag

So, my laptop has been "freeze crashing" randomly (System totally unresponsive, mouse freezed, clock not going forward, no keyboard commands have any effect, and the only way to get out of it is hard rebooting using physical power button), from anywhere to a few minutes to several hours of using my computer.

So naturally, I investigated the issue and tried to find a fix. After looking at the kernel log i saw that the last logged events before freezing is several "Hardware Errors":

kernel: mce: [Hardware Error]: Machine check events logged

So, I search it up and try to find solutions. And I did, I found this post. Which basically tells me to add a few boot parameters. And it does fix the issue, I haven't had any more Hardware Errors logged, or any random freezes ever since. These are the boot parameters:

noapic pci=assign-busses apicmaintimer idle=poll reboot=cold,hard

But the issue is, now my laptop is idling at a way higher temperature when using these boot parameters. Around 70 degrees Celsius, instead of 35-40. Now, obviously I've checked System Monitor to see if there is anything taking up CPU usage, but there's nothing. It's using anywhere between 0 to 3% of CPU utilization on all 4 threads, nothing out of the ordinary.

And I know its the boot parameters causing this issue, because I've tried removing them, and after rebooting, fans aren't spinning as loudly and its idling at a normal temperature. But, the Hardware Errors are back, and so are the random freezes.

I am quite a novice at Linux stuff, so I literally have no idea what these boot parameters do. Can someone experienced tell me what it is they're doing, and why they're causing my CPU to idle so much hotter?

EDIT #1

So thanks to the help of matigo and Doug, I was told that the idle=poll parameter is disabling the idle system for the CPU, which obviously make the CPU run hotter and create more waste heat.

When removing that boot parameter, the Hardware Errors are back.

So, my freezes and Hardware Errors seem to have something to do with how the CPU switches between idle states.

My CPU is an Intel Core i7-7500U

This is the output from running grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name:

/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu0/cpuidle/state4/name:C6
/sys/devices/system/cpu/cpu0/cpuidle/state5/name:C7s
/sys/devices/system/cpu/cpu0/cpuidle/state6/name:C8
/sys/devices/system/cpu/cpu0/cpuidle/state7/name:C9
/sys/devices/system/cpu/cpu0/cpuidle/state8/name:C10

So basically what I need help with is this, to get rid of these Hardware Errors and crashes without completely disabling the CPU idle system, if possible.

in flag
Which version of Ubuntu are you using? I had a similar issue on a Lenovo W541 with 16.04 and 18.04. Upgrading to 20.04 dropped idle temperatures by 30 degrees and improved SSD thermals as well.
B.Tibell avatar
pl flag
@matigo I'm using Zorin OS 16 based on Ubuntu 20.04.3, and I have a HP 17x115dx. I've tried several Ubuntu based distros but I've had this freezing issue with all of them, including Ubuntu, Lubuntu, Zorin OS, Linux Mint and Pop OS.
in flag
Zorin is very much off-topic here, but those boot options are effectively killing your system's ability to manage idle power usage. You may want to [read this](https://www.kernel.org/doc/html/v5.0/admin-guide/pm/cpuidle.html) and decide if the boot parameters are worth it ...
B.Tibell avatar
pl flag
Okay.. Thank you, any idea what could be causing the hardware errors and why these boot parameters help to stop them?
Doug Smythies avatar
gn flag
Please edit your question and add the names of your idle states. Do `grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name`. Also add the processor make and model.
Score:0
gn flag

The boot parameter idle=poll basically disables the idle system, rendering idle as no-op spin cycles. So, yes you would expect a lot more waste heat becuase the CPUs never go to sleep.

Here is an example from my test server, using turbostat:

doug@s19:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 15
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
0.01    938     558     36      1.34    0.68    0.00    0.89
0.02    800     455     36      1.33    0.67    0.00    0.89 <<< All idle states enabled
60.14   4799    109298  47      29.48   28.82   0.00    0.89 <<< transition sample
99.76   4800    180297  47      47.24   46.59   0.00    0.89 <<< All idle states disabled, except poll.
99.76   4800    180311  49      47.65   46.99   0.00    0.89
99.76   4800    180305  49      47.82   47.17   0.00    0.89

Note: the intel_pstate CPU frequency scaling driver "sees" the CPUs as busy, but top does not:

top - 19:23:43 up  7:14,  3 users,  load average: 0.00, 0.00, 0.00
Tasks: 214 total,   1 running, 213 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31936.7 total,  31137.0 free,    312.3 used,    487.5 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  31227.9 avail Mem
B.Tibell avatar
pl flag
After removing the `idle=poll` parameter, the Hardware errors are back and presumably the random freezes. What exactly do these Hardware errors mean? And is there any other way I can get rid of them, without disabling the idle system?
Doug Smythies avatar
gn flag
How many idle states do you have? Do `grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/disable`. Then start to disable them one at a time to see if the MCEs go away. Do (say deepest was 7) `echo 1 | sudo tee /sys/devices/system/cpu/cpu*/cpuidle/state7/disable`. The exact meaning of MCEs can be difficult to determine. What CPU make and model?
B.Tibell avatar
pl flag
I have 8 idle states. The CPU is an Intel Core i7-7500U.
Doug Smythies avatar
gn flag
I would try: Disable idle state 2; If that doesn't help, then disable HWP (intel_pstate=no_hwp) boot parameter.
B.Tibell avatar
pl flag
Tried both, and the errors still show up. I've noticed that the errors tend to show up AFTER I stop a CPU intensive task. And that also aligns with when my freezes happened, for example right after quitting a game, or other resource intensive task.
B.Tibell avatar
pl flag
So I tested if disabling the idle state or adding the boot parameter (after rebooting of course) helped by running `stress --cpu 4` for a few minutes and then ending it, but after checking the log file the mce Hardware Error shows up right after I end the stress test.
Doug Smythies avatar
gn flag
Experiment with disabling idle states. For example all of them from some level and deeper.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.