Score:-1

fsck.mode=force is NOT checking the file system

in flag

I have been having hard drive issues on Ubuntu 18.04 where the system randomly remounts my root partition (/dev/sda4) as read-only due to errors.

dmesg|grep 'I/O error' reveals obvious problems with sda4. I don't have the exact output right now as the box was successfully rebooted and is not having issues as of this moment.

My plan was to run a file system check on the file system. I followed this answer as well as this tutorial carefully. In the latter tutorial I used the section titled: "How to force fsck to check filesystem after system reboot on Linux when using systemd"

After reboot, however, the file system is NOT checked as revealed by the output of this command:

tune2fs -l /dev/sda4 | grep checked             
Last checked:             Sat Nov 21 15:36:56 2020

I have tried these variations of the GRUB CMDLINE but they have been unsuccessful:

GRUB_CMDLINE_LINUX_DEFAULT="maybe-ubiquity fsck.mode=force"

and

GRUB_CMDLINE_LINUX_DEFAULT="maybe-ubiquity fsck.mode=force fsck.repair=yes"

And yes, I did run update-grub. What am I missing?

Output of smartctl -a /dev/sda:

Device Model:     Samsung SSD 860 EVO 250GB
Serial Number:    S59WNG0MA22770K
LU WWN Device Id: 5 002538 e70265a2a
Firmware Version: RVT03B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   Unknown(0x09fc), ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May  3 11:35:14 2023 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  85) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   038   038   010    Pre-fail  Always       -       311
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       21420
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       14
177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       2041
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   038   038   010    Pre-fail  Always       -       311
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   067   065   000    Old_age   Always       -       33
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       10
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       166281511800

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

UPDATE:

The server crashed again this morning (it's still up but / is mounted as read-only) and here is what I see in dmesg:

dmesg |grep sda

[70547.166349] sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[70547.166354] sd 0:0:0:0: [sda] tag#13 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[70948.441912] sd 0:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[70948.441918] sd 0:0:0:0: [sda] tag#15 CDB: Read(10) 28 00 1a cb 1c 00 00 00 08 00
[70948.441922] print_req_error: I/O error, dev sda, sector 449518592
[70948.442312] sd 0:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[70948.442315] sd 0:0:0:0: [sda] tag#16 CDB: Read(10) 28 00 1a cb 1c 00 00 00 08 00
[70948.442316] print_req_error: I/O error, dev sda, sector 449518592
[70948.442955] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[70948.442960] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 0e 0b 03 08 00 00 20 00
[70948.442962] print_req_error: I/O error, dev sda, sector 235602696
[70948.443389] sd 0:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[70948.443393] sd 0:0:0:0: [sda] tag#18 CDB: Read(10) 28 00 0e 0b 03 08 00 00 08 00
[70948.443396] print_req_error: I/O error, dev sda, sector 235602696
[72347.211525] sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[72347.211531] sd 0:0:0:0: [sda] tag#19 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[74147.255746] sd 0:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[74147.255752] sd 0:0:0:0: [sda] tag#21 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[75947.299631] sd 0:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[75947.299637] sd 0:0:0:0: [sda] tag#23 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[77747.345291] sd 0:0:0:0: [sda] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[77747.345297] sd 0:0:0:0: [sda] tag#25 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[79547.389376] sd 0:0:0:0: [sda] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[79547.389382] sd 0:0:0:0: [sda] tag#27 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[81347.432593] sd 0:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[81347.432598] sd 0:0:0:0: [sda] tag#29 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00

I do realize the drive needs to be replaced, but my goal is simply to run fsck on the root partition.

Nmath avatar
ng flag
It should be noted that forcing a fsck on your system every boot is not going to help solve the problem that is causing your file system to become corrupted or read only. Your description of the problem has all of the characteristics of a failing hard drive, but similar issues can also be caused by errant/misconfigured/broken software. Be aware that your release has [fewer than 30 days left of community support](https://ubuntu.com//blog/18-04-end-of-standard-support). It may be a good to reinstall a newer release. That won't fix a hardware problem but it could solve software-based issues.
codemonkey avatar
in flag
Unsure how you came to the conclusion that I planned to force an fsck on *EVERY* boot, but thank you for the tip.
Nmath avatar
ng flag
If that's incorrect, then it's very confusing what you are asking because all of your resources and attempts and narrative indicate that you want it to be run when your system reboots. So if this isn't what you want, please edit the question so it's clear what you actually want to do.
Score:1
vg flag

I don't know how to force the fsck using the solution you're trying, but I can suggest an alternate solution:

Use tune2fs and limit the value to very low remounts and very low timestamps

# To see current settings
sudo tune2fs -l /dev/sda4
# To alter it
sudo tune2fs -c 1 -i 1d /dev/sda4 

This will force the check every 1 remount or every 1 day since last check, whatever happens sooner.

Check SMART

As others have said, this is just a bandaid to HW problems. Sometimes the HDD is dying, at other times it's unrelated HW problem (perform a memtest), at other times it's just a loose SATA cable (unplug & plug it again from both ends, if that doesn't fix it, try another cable).

Beware worst-case scenario the PSU is malfunctioning and is damaging the rest of the HW (in such case, replacing the HDD will only fix the problem temporarily because over time the new HDD will be damaged by the PSU). Check the voltages are within acceptable levels.

Posting the output of smart:

sudo smartctl -a /dev/sda

Can help diagonising what might be going on.

Update

I don't know why you can't run the fsck via tune2fs either.

But I saw your SMART. According to it your disk is aging, but appears to be healthy.

The problem may ve somewhere else, like the SATA cable.

If you can't make the fsck to work, then all I can suggest is to boot from a liveUsb and run the command by hand.

Update 2

OK you posted the dmseg messages. We have conflicting bits of information coming from the SMART & OS, so I'll write it in detail.

Bad blocks

SMART says your drives has bad blocks. This is normal for any SSD as they get old, and the drive will reallocate the data into spare blocks. Once it runs out of spare, the drive needs to be replaced.

SMART says the amount of bad blocks is within "normal": The most important attributes to see here are Reallocated_Sector_Ct and Runtime_Bad_Block.

It says it detected 311 bad blocks, and reallocated 311 into spare. This is good. If there had been 311 bad blocks but only 310 reallocations, it means the data in one of the blocks was lost.

What is important is the "normalized" value (038). This is how the manufacturer tells you what they consider normal.

A value where 100 means perfect and 0 means really bad. Right now it's 38, which is saying "this is getting bad"; but the manufacturer says it's ok as long as that value is above 010 (the THRESHold).

Here we have our first conflicting info: Used_Rsvd_Blk_Cnt_Tot says the reserve hasn't been touched at all, despite having bad blocks. It doesn't add up.

But I wouldn't be surprised if the firmware just doesn't track this value despite reporting it, so we'll ignore this for now.

Wear Levelling

This is the most problematic attribute to read. Wear_Leveling_Count says it's 001. Normally a value of 1 means your drive is dead and must be replaced ASAP.

It means it's ran out of spare blocks. But there have been firmware bugs where this attribute is reported backwards, and a value of 1 means the drive is at 99% health.

Using a TBW calculator I inserted your number of LBAs written + 512 sector size and got that your drive has 77.43TiB written. According to google your model should have 150TBW so it should still be viable.

I'm afraid the best solution here is to spin up a Windows box and run CrystalDiskInfo which accounts for these firmware bugs (using an internal database) and will report you a very accurate health assessment.

Given that your smart says SMART overall-health self-assessment test result: PASSED I'm inclined to believe it wants to say 99%, instead of 1%.

But if I'm wrong we can stop here, the disk must be replaced.

Cable problems / Motherboard problems

The errors in Linux' dmesg basically say it tried to read a sector and got bad data.

The kernel even says it tried to read sector 235602696 twice and got different data:

  • 28 00 0e 0b 03 08 00 00 20 00
  • 28 00 0e 0b 03 08 00 00 08 00.

If the disk says there are no errors but the OS says there are; then the data was corrupted on transit. Normally this indicates:

  • SATA Cable is loosely plugged
  • SATA Cable is damaged
  • Power Cable is loosely plugged
  • Power Cable is damaged
  • Motherboard bus failure
  • PSU failure
  • RAM failure

But here's where we have our second source of conflicting information: UDMA_CRC_Error_Count is 0.

This means the disk never detected a single error caused by a bad/loose cable or a bad motherboard bus.

This is just very unlikely. SMART says the disk is fine, the commands arriving from the OS into the disk are never corrupted by bad wiring; yet the OS read the same sector twice and got a different byte.

The only thing I can think of that would make this possible is if you have bad RAM. Or an extremely unlikely cable problem where all the data that goes into the disk is never corrupted but the data that goes out of it does get corrupted.

Course of action

My gut tells me the disk is bad. But:

  1. Backup all the data to another disk. In a LiveUSB run (and an external USB drive big enough):
sudo apt install zstd

# To backup
sudo zstd -16v < /dev/sda > /media/external_disk/backup_file.zst

# To restore (don't do that on step 1, see step 5)
sudo zstdcat -v /media/external_disk/backup_file.zst > /dev/sda
  1. Backup the data again, but this time with just a regular copy files (if the disk dies, it's much easier to recover from a simple backup than trying to loop-mount a compressed zstd image of a disk and read the files from that)
  2. Reboot and run a memtest to discard RAM errors
  3. Shutdown, open the case and unplug and plug the SATA and power (to drive) cables again. Check they're not damaged. Possibly replace them.
  4. Boot on LiveUSB drive again and perform a secure wipe of the disk. If there is something buggy going on with your drive, perhaps this will reset it back to a working condition (or perhaps it will result in the last command it runs if the disk is beyond salvation). This should take several minutes:
sudo blkdiscard -s /dev/sda
  1. If things went well so far, restore your backup with the sudo zstdcat command in step 1.

If the disk still has problems and memtest succeeded, then personally I'd just rule the disk as bad.

We can't ignore that a value of 038 in Reallocated_Sector_Ct means things are getting bad, despite the manufacturer saying it's not "that" bad yet.

Ah! Important: If at some point you left the disk turned off for more than 3 months; this scenario is quite possible. Despite popular belief, NAND cells can lose their storage if left unpowered for too long ("too long" can be anywhere from 7 days to 7 years; but most common case is 3 months). Specially if they're old.

If this happened to you, then just perform the above steps: backup the data, secure wipe the disk, restore the backup.

Good luck.

codemonkey avatar
in flag
This has already been tried to no avail as suggested by this answer: https://askubuntu.com/a/1352782/248914 As for the output of the `smartctl` command, I have updated my question.
codemonkey avatar
in flag
Thanks for the update. The box did crash again this morning and I was able to grab some logs from `dmesg`. I did update my question with it. Can you give an opinion on what you see there?
Matias N Goldberg avatar
vg flag
Updated my reply with a detailed explanation of what I saw. All I can say is good luck nailing down the culprit and if you're lucky perhaps you can still use the disk for some more time.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.