Score:0

How to diagnose strange intermittent kernel misbehaviors

mg flag

See the Event Logs heading farther down.

I'm on Ubuntu Server 21.04 running kernel 5.11.0-1015-raspi on aarch64.

What are the most effective things to prepare to move forward on diagnosing this next time it happens?

Occasionally after heavy use I start getting strange issues such as these:

  • some processes that should be doing nothing display 100% usage of a single core on top (this happened recently with bash scripts looping on inotifywait on dev event files)
  • these and a handful of other processes do not terminate with kill -9 (I would have assumed inotifywait was simply terminating immediately except for this)
  • the system may keep services running but ttys may halt processing input or output, including the serial tty
  • swapoff /path/to/swap may hang indefinitely even when no swap space is used anymore
  • systemctl shutdown may hang indefinitely, or the system may partly shut down and then hang
  • usb keyboard lights may stop responding
  • login prompts may wait a very long time after a user is entered, and then hang after displaying only part of the password prompt
  • keystrokes may be dropped
  • sometimes repeated kernel messages on a tty indicating the same hung task
  • When indefinitely nonresponsive, I don't see any kernel panic on an open dmesg --follow, journalctl --follow, or tty
  • The caps lock light specifically appears generally nonfunctional on this machine. The caps lock light also appears nonfunctional on my aarch64 olimex teres.

I have recently updated the system and hope these issues may decrease, but I'd like to know what more I can do that may help in diagnosing or handling them. I took the effort to plug a serial cable in and was very surprised that the serial terminal itself could hang indefinitely mid-output.

This usually happens associated with excessive swap allocation, in excess of available ram, but some of the issues, like the strange processes that won't kill -9, imply more than just memory thrashing to me, and the issues don't go away when memory is freed, although I'm not experienced with the Linux kernel.

Ideally I'd like to eventually narrow down the issue to a bug in the kernel, a problem with my hardware, or a compromised system.

Event logs:

2021-08-09

After systemctl isolate graphical and systemctl isolate multi-user systemd-journal is using 99% cpu flooding the journal that org.gnome.Shell@x11 is pending stop. systemctl status says there is no such service. I attempted journalctl | pastebinit. The interface stopped responding before I got the url, i'm afraid.

This doesn't appear to be a virtual memory issue this time, but here are the memory outputs I got before it froze:

free -h: https://paste.ubuntu.com/p/3c5tSTgGc4 (this was taken while it was unswapping; it did finish unswapping)

sysctl vm.swappiness: https://paste.ubuntu.com/p/cpvJw4Nd8f

At 10:29 UTC my tmux session froze. I switched to tty3 and tried to log in. The tty hung displaying the password. At 10:32 UTC the fan spun up high for about 1 minute.

I have an offline system connected to the serial terminal with dmesg open. The last lines are in regard to rfkill, handcopied onto my mobile phone below:

[225366.651144] md: data-check of RAID array md4
[225724.680213] rfkill: input handler enabled
[225745.716506] rfkill: input handler disabled
[225751.439369] rfkill: input handler enabled

At 10:33 tty3 displayed "Login timed out after 60 seconds." without ever displaying a password prompt. It hangs without displaying another login prompt. I sent a ^C to the serial tty around 10:35 and it was echo'd back to me but no terminal prompt was output to indicate that dmesg was interrupted. 10:36 or 10:37 serial tty outputs/echos a carriage return. No new input. Fan spins up again. 10:39 serial tty shows a prompt, which processes the return key pending, and hangs again. 10:42 have a serial prompt ! 11:00 but I am still trying to execute any commands in the prompt. It is incredibly slow but is not losing keystrokes from its buffer (which sometimes happens for me) 11:01 the system responds on serial and tty3. It killed pastebinit due to oom.

lshw -C memory: https://paste.ubuntu.com/p/x5GMkHRktS

heynnema avatar
ru flag
Edit your question and show me `free -h` and `sysctl vm.swappiness` and `swapon -s` and `sudo lshw -C memory`. Start comments to me with @heynnema or I'll miss them.
fuzzyTew avatar
mg flag
@heynnema I got only 2 of your requested commands. I'm trying to get more data but the serial tty is taking over a minute per character, and I make a lot of typos. Is the org.gnome.Shell@x11 service helpful at all?
heynnema avatar
ru flag
It would be helpful to do `tail /var/log/syslog` to see the last few entries, and see if there's something repeating. Do you have access to a Ubuntu Live Desktop DVD/USB? Can you create one on another system? Boot to it and see how the system responds. I suspect that you have a hardware problem. Maybe even with your RAID.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.