Score:0

Recurring need to run fsck because system won't boot

cn flag

Once in a while my Linux system won't boot and gives filesystem errors. I can "fix" them by booting with a LiveCD and running:

sudo fsck -y /dev/sda1

The command says it finds bad blocks and fixes them, then the system will boot again. Does the fact that they keep happening indicate hardware failure, or could there be something else wrong?

I note that when I instead run:

sudo fsck -y /dev/sda

I get these errors:

fsck from util-linux 2.34 [/usr/sbin/fsck.ext2 (1) -- /dev/sda] fsck.ext2 /dev/sda  e2fsck 1.45.5 (07-Jan-2020) ext2fs_open2: Bad magic number in super-block fsck.ext2: Superblock invalid, trying backup blocks... fsck.ext2: Bad magic number in super-block while trying to open /dev/sda

The superblock could not be read or does not describe a valid ext2/ext3/ext4 filesystem.  If the device is valid and it really contains an ext2/ext3/ext4 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>  or
    e2fsck -b 32768 <device>

Found a dos partition table in /dev/sda

Is this because it's invalid to run fsck on the whole disk instead of just one partition, or is there something corrupt on my drive? I've seen many places on the internet giving instructions that run fsck on the whole disk. My disk has only one partition, a Linux ext4 one.

Here is a picture of the Disks application Smart Data & Tests window. enter image description here

The result of grep -i FPDMA /var/log/syslog* is:

adam>grep -i FPDMA /var/log/syslog*
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [  728.921941] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [  729.213899] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:20 adam-gregs-better-computer kernel: [  729.373884] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:42:40 adam-gregs-better-computer kernel: [  870.000879] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:42:40 adam-gregs-better-computer kernel: [  870.000904] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:05 adam-gregs-better-computer kernel: [  895.312734] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:05 adam-gregs-better-computer kernel: [  895.312760] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:06 adam-gregs-better-computer kernel: [  895.476760] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:06 adam-gregs-better-computer kernel: [  895.640724] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [  938.924872] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [  938.924901] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [  938.924924] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [  938.924945] ata3.00: failed command: WRITE FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:53 adam-gregs-better-computer kernel: [  942.878558] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:53 adam-gregs-better-computer kernel: [  942.878583] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog.1:Sep 18 08:30:43 adam-gregs-better-computer kernel: [   33.579255] ata3.00: failed command: READ FPDMA QUEUED
ru flag
I would suggest that with your system constantly needing to run a file system check, your disk might be failing, especially when you get bad block notices every single `fsck`. I would start backing up your data to another drive and prepare for a reinstallation soon to a new disk, since a dying disk is a fast way to lose your important data.
heynnema avatar
ru flag
Edit your question and show me screenshots of the `Disks` application **SMART Data & Tests** data window. Resize the window to capture all of the data for the screenshot. Start comments to me with @heynnema or I'll miss them.
cn flag
@heynnema I updated the question with the screenshot.
heynnema avatar
ru flag
Is this a SSD or HDD? How old is it?
heynnema avatar
ru flag
Edit your question and show me `grep -i FPDMA /var/log/syslog*`.
cn flag
@heynnema Done.
cn flag
@heynnema It's an SSD. I'm not exactly sure how old it is - I borrowed it about 2 years ago or so from someone who got a better computer. It's 240GB.
Score:3
uz flag
Jos

To answer your last question first, a fsck is a file system check, not a disk check. You can of course check your whole disk, but fsck will check and possibly repair each file system separately, possibly in parallel.

Encountering bad blocks at each run of fsck does indicate a hardware failure. The contents of a bad block are copied to an available good block, and then the block is marked as "bad", meaning the file system software will no longer use it. So the number of bad blocks on your disk seems to increase. You may want to verify that you have proper backups.

heynnema avatar
ru flag
OP has a SSD. SSD possibly needs a firmware update, or a GRUB tweak. Please see "NCQ errors" in my answer.
Score:1
ru flag

fsck

Let's repair your file system (again)...

  • boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode
  • open a terminal window by pressing Ctrl+Alt+T
  • type sudo fdisk -l
  • identify the /dev/sdXX device name for your "Linux Filesystem"
  • type sudo fsck -f /dev/sda1, replacing sdXX with the number you found earlier
  • repeat the fsck command if there were errors
  • type reboot

Bad blocks and SMART Data

The SMART Data indicates what would normally be a failing HDD. However, we have an SSD that's not too old. We'll look at solving NCQ errors first.

Note: Determine the manufacturer and model # of the SSD, and then visit their web site to check for updated firmware.

Note: Maintain good backups, just in case the SSD is failing.

NCQ errors

grep -i FPDMA /var/log/syslog*

/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [  728.921941] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [  729.213899] ata3.00: failed command: READ FPDMA QUEUED

Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.

Edit sudo -H gedit /etc/default/grub and change the following line to include this extra parameter. Then do sudo update-grub to write the changes to disk. Reboot. Monitor hangs/etc., and watch grep -i FPDMA /var/log/syslog* or dmesg for continued error messages.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"
cn flag
The drive is ADATA SU635. I couldn't find a firmware update on their website. Also the Amazon page said it was first available in January 2020, so maybe it's actually newer than I thought (I must have started using it sometime in 2020). In the process of opening the computer to check its model, I also discovered that it was at a slant due to missing some screws that would keep it in its enclosure, which must have made it move when I tilted the computer at some point. I wonder if that was causing the problem? I screwed it in and we'll see if the issues keep happening.
heynnema avatar
ru flag
@user2596667 Go ahead and do my answer to try and solve the problem.
cn flag
I'd rather wait to see if screwing in the drive fixed things. So far no NCQ errors have appeared since then. If some do or if it fails again then I'll try your suggested steps.
cn flag
Could you also elaborate on why it's needed to repair the filesystem again with fsck, since I just did run it and fixed errors? Is it because the -f option is important, or because it's necessary to keep re-running it until there are no errors? Also what specifically in my screenshot indicates a failing drive, and what is different about an SSD that makes it potentially fixable where a mechanical drive wouldn't be?
heynnema avatar
ru flag
@user2596667 You need to run `fsck` again because that's been the primary fix, and because it's finding errors. The -f just forces the check to occur, even if the drive reports that it's clean. If you look at the SMART Data, the Relocated Sector Count, and Reported Uncorrectable Errors, and Relocation Count, and UDMA CRC Error Rate, and Read Error Retry Rate are all non-zero values. A SSD failure is an electronic failure, a HDD failure is usually a physical media error.
cn flag
OK thanks. I'm still not sure I fully understand why it's OK for SSDs to have some errors, but I found [this](https://www.crucial.com/support/articles-faq-ssd/my-ssd-has-bad-sectors) website which says that the important point is not whether there are bad sectors, but rather whether they are increasing over time. So I'll monitor whether there are any new bad sectors that appear now that I've physically secured the drive and run fsck -f.
cn flag
I did get a new NCQ error, and checked the Disks application again and noticed a few more bad sectors (but no crashes or problems, so I wouldn't have noticed it without monitoring, thanks!). So now I've implemented your suggestion of enabling libata.force=noncq. We'll see if any more bad sectors appear now that this option is enabled. I ran fsck again and it found no new errors. The bad sectors are up to 1880 now.
cn flag
I got another boot failure and more bad sectors (up to 1952 now). I also got a weird message when trying to boot: `mount: mounting /run on /root/run failed: Bad message` `[!!!!!!] Failed to mount API filesystems.` I re-ran fsck again to be able to boot again, but since I had libata.force=noncq and still got problems, I must conclude that it is in fact a failing drive.
heynnema avatar
ru flag
@user2596667 Yup, sounds like a bad drive... unless this is a desktop computer, and then the power supply might be suspect too.
cn flag
It is a desktop computer, but it has another SSD drive that has 0 bad sectors.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.