Score:2

Read Only filesystems across multiple devices

my flag

The company I work for has about 100 Ubuntu 18.04 server machines scattered across the United State as part of one of our product lines. We haven't had ANY issues with these machines for 1-2 years, until this past week. In the past 5 days six units have had critical errors ultimately resulting in a Read Only file system.

I finally got physical access to one of the devices. I found the following in DMESG: EXT4-fs (dm-0): Couldn't remount RDWR because of unprocessed orphan inode list. Please umount/remount instead And running fsck.ext4 -n /dev/sda2 (which is the root partition) yields several orphaned inodes. I'm sure an fsck could fix it but I'm more interested in what is causing this in the first place.

I have found some kernel errors in the syslog too:


Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937302] BUG: unable to handle kernel paging request at ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937348] IP: kmem_cache_alloc+0x7a/0x1c0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937360] PGD 87d99067 P4D 87d99067 PUD 0 
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937383] Oops: 0000 [#3] SMP PTI
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937395] Modules linked in: ccm intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass snd_hda_codec_hdmi punit_atom_debug joydev iwlmvm snd_hda_codec_realtek intel_cstate snd_hda_codec_generic mac80211 snd_hda_intel iwlwifi snd_hda_codec snd_hda_core snd_hwdep hid_multitouch input_leds cfg80211 snd_pcm ftdi_sio lpc_ich serio_raw snd_timer usbserial btusb cdc_acm btrtl snd mei_txe shpchp mei soundcore hci_uart btbcm btqca btintel rfkill_gpio bluetooth ecdh_generic pwm_lpss_platform pwm_lpss mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937571]  raid0 multipath linear hid_generic usbhid i915 crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel cryptd syscopyarea sysfillrect igb sysimgblt psmouse fb_sys_fops dca i2c_algo_bit drm ptp pps_core ahci libahci video i2c_hid hid
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937646] CPU: 0 PID: 1212 Comm: uwsgi Tainted: G      D          4.15.0-151-generic #157-Ubuntu
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937657] Hardware name: Winmate Inc. IB3S/IB32S, BIOS V210 05/06/2019
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937676] RIP: 0010:kmem_cache_alloc+0x7a/0x1c0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937689] RSP: 0018:ffffb7b6c1207d58 EFLAGS: 00010286
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937703] RAX: ffff93cdf5ef2eaa RBX: 0000000000000000 RCX: 0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937715] RDX: 0000000000009791 RSI: 00000000014080c0 RDI: 0000440bc0024800
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937727] RBP: ffffb7b6c1207d88 R08: ffffd7b6bfc24800 R09: ffff93aaf1400c00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937738] R10: 0000000000000010 R11: 0000000000026d00 R12: ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937750] R13: 00000000014080c0 R14: ffff93aafb017800 R15: ffff93aaf1405e00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937765] FS:  00007fe86c207740(0000) GS:ffff93aaffc00000(0000) knlGS:0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937789] CR2: ffff93cdf5ef2eaa CR3: 00000001314ce000 CR4: 00000000001006f0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937800] Call Trace:
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937824]  ? __delayacct_tsk_init+0x1e/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937844]  __delayacct_tsk_init+0x1e/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937868]  copy_process.part.35+0x6d3/0x1c00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937887]  ? __handle_mm_fault+0xa21/0xff0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937911]  _do_fork+0xdf/0x400
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937931]  ? __do_page_fault+0x2a1/0x4b0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937951]  ? get_unused_fd_flags+0x30/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937971]  SyS_clone+0x19/0x20
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937990]  do_syscall_64+0x73/0x130
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938009]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938025] RIP: 0033:0x7fe86a002b7c
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938036] RSP: 002b:00007fff26bfcc60 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938052] RAX: ffffffffffffffda RBX: 00007fff26bfcc60 RCX: 00007fe86a002b7c
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938063] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938075] RBP: 00007fff26bfccd0 R08: 00007fe86c207740 R09: 00007fe86a5cab40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938086] R10: 00007fe86c207a10 R11: 0000000000000246 R12: 0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938098] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000001abacf8
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938113] Code: 50 08 65 4c 03 05 0f d5 1b 4d 49 83 78 10 00 4d 8b 20 0f 84 09 01 00 00 4d 85 e4 0f 84 00 01 00 00 49 63 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 49 33 9f 40 01 00 00 48 89 c1 48 0f c9 4c 89 e0 48 31 
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938259] RIP: kmem_cache_alloc+0x7a/0x1c0 RSP: ffffb7b6c1207d58
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938269] CR2: ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938284] ---[ end trace 5841e09627f12869 ]---
Jul 26 19:46:35 xxxxxxx kernel: [167923.077278] BUG: unable to handle kernel paging request at ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077295] IP: down_write+0x1f/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077298] PGD a0599067 P4D a0599067 PUD 0 
Jul 26 19:46:35 xxxxxxx kernel: [167923.077304] Oops: 0002 [#2] SMP PTI
Jul 26 19:46:35 xxxxxxx kernel: [167923.077308] Modules linked in: ccm arc4 snd_hda_codec_hdmi iwlmvm snd_hda_codec_realtek snd_hda_codec_generic mac80211 intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp kvm_intel joydev kvm irqbypass punit_atom_debug intel_cstate iwlwifi snd_hda_intel snd_hda_codec ftdi_sio serio_raw hid_multitouch snd_hda_core lpc_ich cfg80211 input_leds mei_txe snd_hwdep snd_pcm usbserial btusb btrtl mei snd_timer snd cdc_acm soundcore shpchp hci_uart btbcm btqca btintel bluetooth rfkill_gpio pwm_lpss_platform pwm_lpss ecdh_generic mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Jul 26 19:46:35 xxxxxxx kernel: [167923.077360]  raid0 multipath linear hid_generic usbhid i915 igb drm_kms_helper dca ahci i2c_algo_bit crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt ghash_clmulni_intel ptp cryptd fb_sys_fops psmouse pps_core libahci drm i2c_hid video hid
Jul 26 19:46:35 xxxxxxx kernel: [167923.077381] CPU: 2 PID: 22792 Comm: uwsgi Tainted: G    B D W        4.15.0-151-generic #157-Ubuntu
Jul 26 19:46:35 xxxxxxx kernel: [167923.077384] Hardware name: Winmate Inc. IB3S/IB32S, BIOS V210 05/06/2019
Jul 26 19:46:35 xxxxxxx kernel: [167923.077389] RIP: 0010:down_write+0x1f/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077392] RSP: 0018:ffffb4e7018cfd10 EFLAGS: 00010246
Jul 26 19:46:35 xxxxxxx kernel: [167923.077396] RAX: ffff994c94603766 RBX: ffff994c94603766 RCX: 0000000000027f57
Jul 26 19:46:35 xxxxxxx kernel: [167923.077398] RDX: ffffffff00000001 RSI: 0000000001000200 RDI: ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077401] RBP: ffffb4e7018cfd18 R08: ffffd4e6ffd292c0 R09: ffff996d60d7e4e0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077404] R10: 00007f220ffec000 R11: ffff996d70adde00 R12: ffff994c9460375e
Jul 26 19:46:35 xxxxxxx kernel: [167923.077407] R13: ffff996d54325ec0 R14: ffff994c9460375e R15: ffff996df104f000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077410] FS:  00007f221338d740(0000) GS:ffff996dffd00000(0000) knlGS:0000000000000000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 19:46:35 xxxxxxx kernel: [167923.077416] CR2: ffff994c94603766 CR3: 00000000943ba000 CR4: 00000000001006e0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077419] Call Trace:
Jul 26 19:46:35 xxxxxxx kernel: [167923.077428]  anon_vma_clone+0x8f/0x1c0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077433]  anon_vma_fork+0x32/0x130
Jul 26 19:46:35 xxxxxxx kernel: [167923.077440]  copy_process.part.35+0xfe1/0x1c00
Jul 26 19:46:35 xxxxxxx kernel: [167923.077446]  _do_fork+0xdf/0x400
Jul 26 19:46:35 xxxxxxx kernel: [167923.077454]  ? __do_page_fault+0x2a1/0x4b0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077460]  ? get_unused_fd_flags+0x30/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077465]  SyS_clone+0x19/0x20
Jul 26 19:46:35 xxxxxxx kernel: [167923.077471]  do_syscall_64+0x73/0x130
Jul 26 19:46:35 xxxxxxx kernel: [167923.077475]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
Jul 26 19:46:35 xxxxxxx kernel: [167923.077479] RIP: 0033:0x7f2211188b7c
Jul 26 19:46:35 xxxxxxx kernel: [167923.077482] RSP: 002b:00007fff81411ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Jul 26 19:46:35 xxxxxxx kernel: [167923.077486] RAX: ffffffffffffffda RBX: 00007fff81411ac0 RCX: 00007f2211188b7c
Jul 26 19:46:35 xxxxxxx kernel: [167923.077488] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Jul 26 19:46:35 xxxxxxx kernel: [167923.077491] RBP: 00007fff81411b30 R08: 00007f221338d740 R09: 00007f2211750b40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077494] R10: 00007f221338da10 R11: 0000000000000246 R12: 0000000000000000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077497] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000001735cf8
Jul 26 19:46:35 xxxxxxx kernel: [167923.077500] Code: 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 9e d7 ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 73 b5 fe ff 65 48 8b 04 25 00 5c 
Jul 26 19:46:35 xxxxxxx kernel: [167923.077534] RIP: down_write+0x1f/0x40 RSP: ffffb4e7018cfd10
Jul 26 19:46:35 xxxxxxx kernel: [167923.077537] CR2: ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077541] ---[ end trace 4d3c04fc4bbb2b33 ]---

There are others that I can post too if needed.

I'm also seeing this on boot frequently:

[ FAILED ]Failed to start host name service
See systemctl status systemd-hostnamed.service for details
...
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution

We've seen this all over the country within just the last 5 days, so I don't think it is Hardware or environment related. We haven't released any updated to our software in a few weeks (and some of our clients ignore our software updates anyway).

Does anyone have any thoughts on what could be causing this and how to prevent it? Thanks!

Edit 1: results of ls -la /boot

total 143024
drwxr-xr-x  3 root root     4096 Jul 23 06:35 .
drwxr-xr-x 24 root root     4096 Jul 22 06:57 ..
-rw-r--r--  1 root root   217414 Jun 18 16:49 config-4.15.0-147-generic
-rw-r--r--  1 root root   217414 Jul  9 20:19 config-4.15.0-151-generic
drwxr-xr-x  5 root root     4096 Jul 23 06:34 grub
-rw-r--r--  1 root root 60458100 Jul 20 20:08 initrd.img-4.15.0-147-generic
-rw-r--r--  1 root root 60462046 Jul 23 06:35 initrd.img-4.15.0-151-generic
-rw-------  1 root root  4082393 Jun 18 16:49 System.map-4.15.0-147-generic
-rw-------  1 root root  4082629 Jul  9 20:19 System.map-4.15.0-151-generic
-rw-------  1 root root  8449696 Jun 18 18:42 vmlinuz-4.15.0-147-generic
-rw-------  1 root root  8453792 Jul  9 20:23 vmlinuz-4.15.0-151-generic

results of free -h

              total        used        free      shared  buff/cache   available
Mem:           3.7G        165M        3.2G        6.7M        435M        3.4G
Swap:            0B          0B          0B

swapon -s yielded no results

results of sysctl vm.swappiness

vm.swappiness = 60

Edit 2:

Found this bug report pertaining to the -151 kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1938013

I also pulled out an old unit and tested it thoroughly on 4.15.0-142-generic. I then updated it to -151 and was able to induce an error using nmcli to attempt a wifi connection. After a reboot into -142, I could no longer induce the error. I still have more tests to do on the original unit and will post when done.

heynnema avatar
ru flag
The "BUG: unable to handle kernel paging request at ffff93cdf5ef2eaa" is the problem. It may not be fixable. It **MAY** be a BIOS issue. Check your BIOS version with `sudo dmidecode -s bios-version` and go to the manufacturer's web site to check for a newer version. It **MAY** be a kernel issue. Check if your kernel has recently been updated with `ls -al /boot`. Try booting to an older kernel and see if it helps with the paging error. Run `memtest`. And, of course, do the `fsck`.
heynnema avatar
ru flag
Edit your question and show me `ls -al /boot` and `free -h` and `swapon -s` and `sysctl vm.swappiness`. Start comments to me with @heynnema or I'll miss them.
JPetersonVNL avatar
my flag
@heynnema Thanks for the suggestions! I've posted the results of those commands
JPetersonVNL avatar
my flag
@heynnema I booted into the older kernel 147 and it didn't fix anything, but i'm guessing the damage was already done to the fs. If 151 is causing the kernel errors that damage the fs, then maybe I just need to avoid 151. How can I go about preventing the 151 update on other units in the field? I have SSH access but not physical.
heynnema avatar
ru flag
Did this problem begin on or about Jul 23? That's when the -151 kernel was installed. It's too early to tell if -151 is the problem, but I'm starting to get a feeling from other reports that it may be. Booting to -147 won't fix file system errors that are already there. Boot to a Ubuntu Live USB/DVD, and do the `fsck,` then reboot to -147 and see if you continue to get page fault errors. Did you check your BIOS? Did you run `memtest`?
heynnema avatar
ru flag
Also, why no swap?
heynnema avatar
ru flag
Actually, it was Jul 9, not Jul 23.
JPetersonVNL avatar
my flag
@heynnema Yes, the problem was first reported to us on the 26th. The bios is relatively recent and there are no known issues with its version. Memtest is running now, and then I also want to do a smartctl to check the SSD. Once those tests are done I'll fsck it and boot into 147 and test. See my latest edit for a test I did tonight. I'm guessing the no swap is because its mounted as read-only? Thank you for all your help with this!
heynnema avatar
ru flag
No swap may be because it's set up as a server. Do `grep -i swap /etc/fstab` to check. That bug report doesn't appear to have any relation to your page faults. Keep me posted.
Score:1
my flag

I don't have definitive confirmation, but I do have quite a bit of observational confirmation that this was a result of the Ubunut 151 kernel release. I was able to easily reproduce the issue while running 151 but after downgrading to any previous version I could not. One unfortunate side affect was the persistence of the damage. The kernel crash itself was not the direct cause of the RO-filesystem. That was the damage to the FS (orphaned inodes and the such) which were caused by the kernel crash. This means that even after rolling back to a previous kernel, the damage to the FS may have already been done causing the unit to go RO even after the rollback. To help with this, after rolling back the kernel, I also enable an auto fsck on boot. Its been months, and the issue seems to has been resolved. Thanks @heynnema for your help and letting me bounce ideas off ya!

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.