Score:2

Frequent crashes since upgrade to 23.04

ng flag

Since the upgrade to 23.04, I'm getting far too frequent (almost daily) crashes: either my Gnome session terminating, throwing me back to the login screen, or some GPU-related crash that ends up with the screen slowly flashing between a black screen and a text-only screen (unresponsive to keyboard input like CTRL+ALT+F1).

The latter happens particularly often if I try to use Google Maps in Firefox. I have an AMD CPU with built-in GPU, and the logs suggest it has something to do with that:

kernel: [198871.116760] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=3351772, emitted seq=3351774
kernel: [198871.117505] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3623 thread gnome-shel:cs0 pid 3668
kernel: [198871.118214] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
kernel: [198871.268814] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
kernel: [198871.295338] amdgpu 0000:07:00.0: amdgpu: MODE2 reset
kernel: [198871.295395] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [198871.295597] [drm] PCIE GART of 1024M enabled.
kernel: [198871.295599] [drm] PTB located at 0x000000F47FC00000
kernel: [198871.295660] [drm] PSP is resuming...
kernel: [198871.996967] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
kernel: [198872.261894] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
kernel: [198872.272774] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: [198872.278755] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
kernel: [198872.278899] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
kernel: [198872.278906] amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
kernel: [198872.278914] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
kernel: [198872.278921] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
kernel: [198872.279350] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
kernel: [198872.279790] [drm] DMUB hardware initialized: version=0x01010026
kernel: [198872.627457] [drm] kiq ring mec 2 pipe 1 q 0
kernel: [198872.810879] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
kernel: [198872.811161] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
kernel: [198872.811379] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
kernel: [198872.811597] amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed
kernel: [198872.811649] amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
kernel: [198872.811652] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
rtkit-daemon[2054]: message repeated 3 times: [ Supervising 14 threads of 11 processes of 1 users.]
firefox_firefox.desktop[6647]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
google-chrome.desktop[5953]: [5992:5992:0525/212139.578910:ERROR:shared_context_state.cc(870)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
google-chrome.desktop[5953]: [5992:5992:0525/212139.579172:ERROR:gpu_service_impl.cc(986)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
gnome-shell[3623]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
gnome-shell[3623]: amdgpu: The process will be terminated.

The former happens even (espeically?) if I don't touch the computer, and leaves the following in the syslog:

... gnome-shell[118241]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
... gnome-shell[118241]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
[repeats]
... thunderbird[119653]: Couldn't map window 0x7f716cad7f40 as subsurface because its parent is not mapped.
[repeats]
... kernel: [224847.218436] gnome-shell[118241]: segfault at ffffffffffffff48 ip 00007f0fbe6b5ebb sp 00007ffcf07dc3d8 error 5 in libmutter-clutter-12.so.0.0.0[7f0fbe653000+8b000] likely on CPU 14 (core 7, socket 0)

I'm running Wayland/Gnome/Pipewire, and I'm using an external monitor together with the built-in one.

What's the best way to quickly get my computer to be usable again?

lapisdecor avatar
ro flag
Are you using the Espresso gnome extension?
Score:1
bb flag

Edit #3: (out of order for a reason) - this may be our issue and our fix: https://bugs.launchpad.net/ubuntu/+source/mutter/+bug/2012230 It appears it was backported to Lunar Lobster on June 13th and should be present on any up to date system but we're still seeing crashes in the last 48 hours, so I'm still trying to ascertain if we have the "fixed" version mutter

Edit #4: (out of order but that's OK) - if our bug is indeed the above link, it was fixed in Mantic Minotaur first (mutter 44.2-3) as of July 2nd and is triaged (acknowledged) for Lunar Lobster but no backported fix has been built yet. Assuming (and this is a bad assumption but let's go with it for a moment) both Lunar & Mantic started on Mutter 44.2-0, and they got the fix into Mantic on 44.2-3, then we should see Lunar go greater than 44.2-0 for the fix, and checking this:

apt-cache policy mutter

I see I am still on 44.2-0, so I would reckon there is no fix made generally available just yet for us Lunar users.

Edit #1: I tried to comment and not answer, but I need 50 rep to comment and no rep I guess to answer, because #reasons. So I apologize, this is not an answer.

Edit #2: July 3rd (2 weeks later) - three desktops (two of mine, and my oldest's) are now experiencing this wayland-gnome crash nearly daily. They are all running Lunar Lobster of various cleanliness & kernels, but all are AMD cpus (all different families but all AMD) and all AMD gpus (all different families but all AMD) and are now all crashing nearly daily. Tailing syslog shows a different binary fingered on the gnome segfault, it's been chrome and libre office (soffice.bin) for instance, so I don't know what is the root cause here. On my work PC (the cleanest, most stock Lunar Lobster) I switched to xorg and haven't had a crash since (~7 days clean). On my personal PC I switched to xorg and haven't had a crash in 3 days, not enough time to really draw hard conclusions that the crash is isolated to Wayland, but it's something.

Original post:

I have a ryzen 1700, and an amd gpu (discrete), and I'm having a similar issue in 23.04 the last few weeks. syslog has this error same as you (sans unique bits):

kernel gnome-shell segfault at ffffffffffffff48 error 5 in libmutter-clutter-12.so.0.0.0 likely on CPU core socket

but after about ~5 seconds my desktop "recovers" by exiting everything and returning me to the login screen. Tailing syslog I can't tell if the crash report was successfully sent, so I have no idea if they know about this or not. My googles showed only your post here with anything to do with this.

Prior to the crash the only thing that stands out is chrome + wayland melting down -

google-chrome.desktop ERROR:wayland_frame_manager.cc(521) The server has buggy presentation feedback. Discarding all presentation feedback requests in all frames except the last 3.
Score:1
uz flag

update 7/3/2023

I found an answer about upgrading drivers and today, for the first time in a long time have not had one crash.

I am now running Ubunut 6.4 kernel from mainline, which still did not fix the crash. I then updated drivers from here:

https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers

Followed those instructions and have not had a gnome-shell crash all day. Hopefully on our way to stability.

[previous answer] Not an answer, unable to comment as well, sorry.

Exact same issue. I am now running Ubuntu 23.04 with kernel 6.4 rc6 in the hopes there had been some fix, but nothing. I have gone through 6.3 kernels up to 6.3.7 and no go.

Here is dmesg, same exact behavior as described above, crash to login.

[Tue Jun 20 12:54:26 2023] show_signal_msg: 49 callbacks suppressed

[Tue Jun 20 12:54:26 2023] gnome-shell[73273]: segfault at ffffffffffffff48 ip 00007fcdde316ebb sp 00007ffd254ea428 error 5 in libmutter-clutter-12.so.0.0.0[7fcdde2b4000+8b000] likely on CPU 6 (core 3, socket 0)

[Tue Jun 20 12:54:26 2023] Code: 30 48 85 c0 74 09 c3 0f 1f 84 00 00 00 00 00 48 8b 47 68 48 85 c0 75 ee 48 8b 47 28 c3 66 90 f3 0f 1e fa 48 63 05 b5 e7 07 00 <48> 8b 44 38 28 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3

Hardware is Lenovo T14s Gen3 AMD Ryzen Pro 5 with Radeon 680M GPU.

Score:0
bn flag

I am in 23.04. Same issue, random crashes. At the beginning though it was the SSD drive about to die, checked using smartmontools, everything reported right, updated to the last nVidia driver 535 from the proprietary drivers, Upgraded to kernel 6.4.3, no joy...

However noticed the laptop (HP Pavilion 15-csxxx, using NVIDIA GeForce GTX MX150) was all the times with fans running very fast, and a lot of heat coming from the sides. Installed lm-sensors and from the command line using sensors, noticed how the cores sometimes raised up to 85°C (185°F) and the ACPI driver up to 95°C (203°F)... in those events, happening randomly, there was a thermal cut, because the fans were not able to dissipate all the heat, and the system crashed, time and again...

Looking at some comments in forums, about nVidia GPU and new drivers v535.x problems, I decided to replace it by OpenSource driver noveau... And presto! All troubles disappeared . No more crashes, nor raising temps in HW, even if I run VBox virtual machines or do a video format conversion. Temps are stable at around 39°C (102°F) I'll do what was suggested by some other forums: Not install any nVidia-related drivers until the upgrade looks fine.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.