Score:0

Ubuntu 20.04.4 LTS randomly freezes

ng flag

The freeze happens regardless of what I'm doing, whenever I'm using the computer or I'm away. The keyboard is not responsive and the screens go to power-saving mode.

However, I can access it through SSH from another computer. Sadly I don't know how to diagnose the issue. Any help or hint would be greatly appreciated!

Here are a few details I hope will help (taken from an SSH session, while the computer is frozen), feel free to ask for additional information.

$ free -h

              total        used        free      shared  buff/cache   available
Mem:           31Gi        13Gi       978Mi       237Mi        16Gi        16Gi
Swap:          30Gi       1,9Gi        28Gi

$ grep -i swap /etc/fstab

UUID=2cd379c8-d157-4eee-a667-12271c8607be  none            swap    sw                         0       0

$ ll /dev/disk/by-uuid/2cd379c8-d157-4eee-a667-12271c8607be

lrwxrwxrwx 1 root root 15 janv. 16 06:34 /dev/disk/by-uuid/2cd379c8-d157-4eee-a667-12271c8607be -> ../../nvme0n1p4

$ sysctl vm.swappiness

vm.swappiness = 60

ls -al /var/crash

Does not contain any file on this date

$ sudo lshw -c video

  *-display                 
       description: VGA compatible controller
       product: Advanced Micro Devices, Inc. [AMD/ATI]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:0c:00.0
       version: c7
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: irq:79 memory:d0000000-dfffffff memory:e0000000-e01fffff ioport:e000(size=256) memory:fcc00000-fccfffff memory:fcd00000-fcd1ffff

$ sudo lsmod | grep -i amd

edac_mce_amd           36864  0
amdgpu               9809920  29
iommu_v2               24576  1 amdgpu
gpu_sched              45056  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    86016  2 amdgpu,drm_ttm_helper
drm_kms_helper        307200  1 amdgpu
gpio_amdpt             20480  0
drm                   618496  15 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm
gpio_generic           20480  1 gpio_amdpt

EDIT

The /var/log/kern.log file shows this, around the time of the freeze, is it relevant?

Jan 16 11:28:32 benj-pc kernel: [17630.400119] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 11:28:38 benj-pc kernel: [17630.400121] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 11:28:38 benj-pc kernel: [17635.530047] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1239598, emitted seq=1239600
Jan 16 11:28:38 benj-pc kernel: [17635.530232] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2804 thread Xorg:cs0 pid 2805
Jan 16 11:28:38 benj-pc kernel: [17635.530385] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 16 11:28:38 benj-pc kernel: [17636.140112] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 11:28:38 benj-pc kernel: [17636.140250] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Jan 16 11:28:38 benj-pc kernel: [17636.440685] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 11:28:38 benj-pc kernel: [17636.440820] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jan 16 11:28:39 benj-pc kernel: [17636.740600] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx
Jan 16 11:28:39 benj-pc kernel: [17636.754647] [drm] free PSP TMR buffer
Jan 16 11:28:39 benj-pc kernel: [17636.800093] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17400300 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800102] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17d61000 flags=0x0010]
Jan 16 11:28:39 benj-pc kernel: [17636.800107] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d16a02400 flags=0x0010]
Jan 16 11:28:39 benj-pc kernel: [17636.800112] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d16a03500 flags=0x0010]
Jan 16 11:28:39 benj-pc kernel: [17636.800116] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17401300 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800120] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17402300 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800124] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17404200 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800128] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17403300 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800132] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d17405200 flags=0x0030]
Jan 16 11:28:39 benj-pc kernel: [17636.800136] amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d16a03500 flags=0x0010]
Jan 16 11:28:39 benj-pc kernel: [17636.800223] amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
Jan 16 11:28:39 benj-pc kernel: [17636.800227] amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
Jan 16 11:28:39 benj-pc kernel: [17636.800301] amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
Jan 16 11:28:39 benj-pc kernel: [17636.801277] AMD-Vi: IOMMU event log overflow
Jan 16 11:28:39 benj-pc kernel: [17637.312179] amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 16 11:28:39 benj-pc kernel: [17637.312406] [drm] PCIE GART of 512M enabled (table at 0x00000080008CA000).
Jan 16 11:28:39 benj-pc kernel: [17637.312434] [drm] VRAM is lost due to GPU reset!
Jan 16 11:28:39 benj-pc kernel: [17637.313819] [drm] PSP is resuming...
Jan 16 11:28:40 benj-pc kernel: [17637.511892] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
Jan 16 11:28:42 benj-pc kernel: [17639.666741] [drm] psp gfx command LOAD_ASD(0x4) failed and response status is (0x0)
Jan 16 11:28:42 benj-pc kernel: [17639.666747] [drm:psp_resume [amdgpu]] *ERROR* PSP load asd failed!
Jan 16 11:28:42 benj-pc kernel: [17639.666964] [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Jan 16 11:28:42 benj-pc kernel: [17639.667151] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Jan 16 11:28:42 benj-pc kernel: [17639.667270] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667272] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667292] amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) failed
Jan 16 11:28:42 benj-pc kernel: [17639.667309] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667317] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667323] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667327] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667331] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667337] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667344] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667346] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667348] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667351] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667353] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667356] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667359] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667363] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667366] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667369] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667371] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667375] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667377] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667379] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.667381] [drm] Skip scheduling IBs!
Jan 16 11:28:42 benj-pc kernel: [17639.691588] amdgpu 0000:0c:00.0: amdgpu: GPU reset end with ret = -22
Jan 16 11:28:52 benj-pc kernel: [17649.855833] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=62382, emitted seq=62384
Jan 16 11:28:52 benj-pc kernel: [17649.855833] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=41176, emitted seq=41178
Jan 16 11:28:52 benj-pc kernel: [17649.856020] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 16 11:28:52 benj-pc kernel: [17649.856025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 16 11:28:52 benj-pc kernel: [17649.856173] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 16 11:28:52 benj-pc kernel: [17649.856177] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 16 11:28:52 benj-pc kernel: [17649.856179] amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:9d99, as another already in progress
guiverc avatar
cn flag
I suggest applying all security fixes & upgrades available to your system, as by your own provided details you're some time behind. Refer to https://fridge.ubuntu.com/2022/09/01/ubuntu-20-04-5-lts-released/ on the 20.04.5 ISO release date, but do note installed systems got those fixes (*including reporting as 20.04.5*) more than a week before the ISO release date. Try again after applying system fixes.
Benj avatar
ng flag
@guiverc thank you for your answer. Do you mean an apt update and upgrade? If so, then I've just did it after posting this message. Wait and see now. Cheers
guiverc avatar
cn flag
There are many reasons why `apt upgrade` won't install all fixes, thus why `apt full-upgrade` exists.. Refer to `man apt` for clues, but for my more modern release the page says "*full-upgrade performs the function of upgrade but will remove currently installed packages if this is needed to upgrade the system as a whole.*"
Benj avatar
ng flag
Thanks, appreciated!
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.