I don't know how to force the fsck using the solution you're trying, but I can suggest an alternate solution:
Use tune2fs
and limit the value to very low remounts and very low timestamps
# To see current settings
sudo tune2fs -l /dev/sda4
# To alter it
sudo tune2fs -c 1 -i 1d /dev/sda4
This will force the check every 1 remount or every 1 day since last check, whatever happens sooner.
Check SMART
As others have said, this is just a bandaid to HW problems. Sometimes the HDD is dying, at other times it's unrelated HW problem (perform a memtest), at other times it's just a loose SATA cable (unplug & plug it again from both ends, if that doesn't fix it, try another cable).
Beware worst-case scenario the PSU is malfunctioning and is damaging the rest of the HW (in such case, replacing the HDD will only fix the problem temporarily because over time the new HDD will be damaged by the PSU).
Check the voltages are within acceptable levels.
Posting the output of smart:
sudo smartctl -a /dev/sda
Can help diagonising what might be going on.
Update
I don't know why you can't run the fsck via tune2fs either.
But I saw your SMART. According to it your disk is aging, but appears to be healthy.
The problem may ve somewhere else, like the SATA cable.
If you can't make the fsck to work, then all I can suggest is to boot from a liveUsb and run the command by hand.
Update 2
OK you posted the dmseg messages. We have conflicting bits of information coming from the SMART & OS, so I'll write it in detail.
Bad blocks
SMART says your drives has bad blocks. This is normal for any SSD as they get old, and the drive will reallocate the data into spare blocks. Once it runs out of spare, the drive needs to be replaced.
SMART says the amount of bad blocks is within "normal": The most important attributes to see here are Reallocated_Sector_Ct
and Runtime_Bad_Block
.
It says it detected 311 bad blocks, and reallocated 311 into spare. This is good. If there had been 311 bad blocks but only 310 reallocations, it means the data in one of the blocks was lost.
What is important is the "normalized" value (038). This is how the manufacturer tells you what they consider normal.
A value where 100 means perfect and 0 means really bad. Right now it's 38, which is saying "this is getting bad"; but the manufacturer says it's ok as long as that value is above 010 (the THRESHold).
Here we have our first conflicting info: Used_Rsvd_Blk_Cnt_Tot
says the reserve hasn't been touched at all, despite having bad blocks. It doesn't add up.
But I wouldn't be surprised if the firmware just doesn't track this value despite reporting it, so we'll ignore this for now.
Wear Levelling
This is the most problematic attribute to read. Wear_Leveling_Count
says it's 001. Normally a value of 1 means your drive is dead and must be replaced ASAP.
It means it's ran out of spare blocks.
But there have been firmware bugs where this attribute is reported backwards, and a value of 1 means the drive is at 99% health.
Using a TBW calculator I inserted your number of LBAs written + 512 sector size and got that your drive has 77.43TiB written. According to google your model should have 150TBW so it should still be viable.
I'm afraid the best solution here is to spin up a Windows box and run CrystalDiskInfo which accounts for these firmware bugs (using an internal database) and will report you a very accurate health assessment.
Given that your smart says SMART overall-health self-assessment test result: PASSED
I'm inclined to believe it wants to say 99%, instead of 1%.
But if I'm wrong we can stop here, the disk must be replaced.
Cable problems / Motherboard problems
The errors in Linux' dmesg basically say it tried to read a sector and got bad data.
The kernel even says it tried to read sector 235602696 twice and got different data:
- 28 00 0e 0b 03 08 00 00 20 00
- 28 00 0e 0b 03 08 00 00 08 00.
If the disk says there are no errors but the OS says there are; then the data was corrupted on transit. Normally this indicates:
- SATA Cable is loosely plugged
- SATA Cable is damaged
- Power Cable is loosely plugged
- Power Cable is damaged
- Motherboard bus failure
- PSU failure
- RAM failure
But here's where we have our second source of conflicting information: UDMA_CRC_Error_Count
is 0.
This means the disk never detected a single error caused by a bad/loose cable or a bad motherboard bus.
This is just very unlikely. SMART says the disk is fine, the commands arriving from the OS into the disk are never corrupted by bad wiring; yet the OS read the same sector twice and got a different byte.
The only thing I can think of that would make this possible is if you have bad RAM. Or an extremely unlikely cable problem where all the data that goes into the disk is never corrupted but the data that goes out of it does get corrupted.
Course of action
My gut tells me the disk is bad. But:
- Backup all the data to another disk. In a LiveUSB run (and an external USB drive big enough):
sudo apt install zstd
# To backup
sudo zstd -16v < /dev/sda > /media/external_disk/backup_file.zst
# To restore (don't do that on step 1, see step 5)
sudo zstdcat -v /media/external_disk/backup_file.zst > /dev/sda
- Backup the data again, but this time with just a regular copy files (if the disk dies, it's much easier to recover from a simple backup than trying to loop-mount a compressed zstd image of a disk and read the files from that)
- Reboot and run a memtest to discard RAM errors
- Shutdown, open the case and unplug and plug the SATA and power (to drive) cables again. Check they're not damaged. Possibly replace them.
- Boot on LiveUSB drive again and perform a secure wipe of the disk. If there is something buggy going on with your drive, perhaps this will reset it back to a working condition (or perhaps it will result in the last command it runs if the disk is beyond salvation). This should take several minutes:
sudo blkdiscard -s /dev/sda
- If things went well so far, restore your backup with the
sudo zstdcat
command in step 1.
If the disk still has problems and memtest succeeded, then personally I'd just rule the disk as bad.
We can't ignore that a value of 038 in Reallocated_Sector_Ct
means things are getting bad, despite the manufacturer saying it's not "that" bad yet.
Ah! Important: If at some point you left the disk turned off for more than 3 months; this scenario is quite possible. Despite popular belief, NAND cells can lose their storage if left unpowered for too long ("too long" can be anywhere from 7 days to 7 years; but most common case is 3 months).
Specially if they're old.
If this happened to you, then just perform the above steps: backup the data, secure wipe the disk, restore the backup.
Good luck.