Frequent crashes since upgrade to 23.04

Question

Score:2

Ubuntu

Frequent crashes since upgrade to 23.04

Jan Schejbal

6/18/24, 11:12 AM

Since the upgrade to 23.04, I'm getting far too frequent (almost daily) crashes: either my Gnome session terminating, throwing me back to the login screen, or some GPU-related crash that ends up with the screen slowly flashing between a black screen and a text-only screen (unresponsive to keyboard input like CTRL+ALT+F1).

The latter happens particularly often if I try to use Google Maps in Firefox. I have an AMD CPU with built-in GPU, and the logs suggest it has something to do with that:

kernel: [198871.116760] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=3351772, emitted seq=3351774
kernel: [198871.117505] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3623 thread gnome-shel:cs0 pid 3668
kernel: [198871.118214] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
kernel: [198871.268814] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
kernel: [198871.295338] amdgpu 0000:07:00.0: amdgpu: MODE2 reset
kernel: [198871.295395] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [198871.295597] [drm] PCIE GART of 1024M enabled.
kernel: [198871.295599] [drm] PTB located at 0x000000F47FC00000
kernel: [198871.295660] [drm] PSP is resuming...
kernel: [198871.996967] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
kernel: [198872.261894] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
kernel: [198872.272774] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: [198872.278755] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
kernel: [198872.278899] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
kernel: [198872.278906] amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
kernel: [198872.278914] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
kernel: [198872.278921] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
kernel: [198872.279350] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
kernel: [198872.279790] [drm] DMUB hardware initialized: version=0x01010026
kernel: [198872.627457] [drm] kiq ring mec 2 pipe 1 q 0
kernel: [198872.810879] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
kernel: [198872.811161] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
kernel: [198872.811379] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
kernel: [198872.811597] amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed
kernel: [198872.811649] amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
kernel: [198872.811652] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
rtkit-daemon[2054]: message repeated 3 times: [ Supervising 14 threads of 11 processes of 1 users.]
firefox_firefox.desktop[6647]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
google-chrome.desktop[5953]: [5992:5992:0525/212139.578910:ERROR:shared_context_state.cc(870)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
google-chrome.desktop[5953]: [5992:5992:0525/212139.579172:ERROR:gpu_service_impl.cc(986)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
gnome-shell[3623]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
gnome-shell[3623]: amdgpu: The process will be terminated.

The former happens even (espeically?) if I don't touch the computer, and leaves the following in the syslog:

... gnome-shell[118241]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
... gnome-shell[118241]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
[repeats]
... thunderbird[119653]: Couldn't map window 0x7f716cad7f40 as subsurface because its parent is not mapped.
[repeats]
... kernel: [224847.218436] gnome-shell[118241]: segfault at ffffffffffffff48 ip 00007f0fbe6b5ebb sp 00007ffcf07dc3d8 error 5 in libmutter-clutter-12.so.0.0.0[7f0fbe653000+8b000] likely on CPU 14 (core 7, socket 0)

I'm running Wayland/Gnome/Pipewire, and I'm using an external monitor together with the built-in one.

What's the best way to quickly get my computer to be usable again?

629

3 + 1

gnome

wayland

amd-graphics

23.04

Frequent crashes since upgrade to 23.04

Followed those instructions and have not had a gnome-shell crash all day. Hopefully on our way to stability.

Post an answer