Today my home server went to kernel panic, something went wrong with its system drive. I swapped the drive, restored the server and now I'm trying to figure out what happened to the old one. It actually is quite old, so I guess it will be a hw failure, still I'd like to try to learn something about recovery technics (and find why SMART didn't warn me). I can see the drive as /dev/sdb now, and I can detect lvm there, so I renamed ubuntu-vg to ubuntu-vg-old and activated it.
root@calcium:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
ubuntu-lv ubuntu-vg -wi-ao---- <29.06g
backups ubuntu-vg-old -wi-a----- 1.29t
ubuntu-lv ubuntu-vg-old -wi-a----- 200.00g
Unfortunately, mounting it doesn't work and after long timeout the command fails making drive inaccessible:
root@calcium:~# mount /dev/ubuntu-vg-old/ubuntu-lv /mnt -o ro,user
mount: /mnt: can't read superblock on /dev/mapper/ubuntu--vg--old-ubuntu--lv.
root@calcium:~# pvscan
Error reading device /dev/sdb at 0 length 512.
Error reading device /dev/sdb at 0 length 4096.
Error reading device /dev/sdb1 at 0 length 4096.
Error reading device /dev/sdb2 at 0 length 4096.
Error reading device /dev/sdb3 at 0 length 4096.
PV /dev/sda3 VG ubuntu-vg lvm2 [58.12 GiB / 29.06 GiB free]
Total: 1 [58.12 GiB] / in use: 1 [58.12 GiB] / in no VG: 0 [0 ]
After reboot (I didn't find another way to make it accessible again) the drive is back. I tried to fix it:
root@calcium:~# fsck /dev/mapper/ubuntu--vg--old-ubuntu--lv
fsck from util-linux 2.36.1
e2fsck 1.46.3 (27-Jul-2021)
/dev/mapper/ubuntu--vg--old-ubuntu--lv: recovering journal
fsck.ext4: Input/output error while trying to re-open /dev/mapper/ubuntu--vg--old-ubuntu--lv
/dev/mapper/ubuntu--vg--old-ubuntu--lv: ********** WARNING: Filesystem still has errors **********
But this behaves exactly same as mount, long timeout and the drive is dropped from the system. I ran SMART offline surface test overnight (smartctl -t offline /dev/sdb
), it didn't find any issues nor changed any offline SMART attribute. badblocks read test also runs well, with no errors:
root@calcium:~# badblocks -b 4096 -c 1024 -s -o bb.out /dev/sdb
Checking for bad blocks (read-only test): done
So I tried nondestructive read-write test with badblocks (badblocks -b 4096 -c 1024 -s -n -v /dev/sdb
) and the drive drops from the system again after about half an hour of run. I already replaced SATA cable and connected the drive to a different port. There is clearly an issue only when writing to particular sector(s).
Is there anything more I could try before full format (which most probably will fail too, I guess)?
Smart data:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 414
2 Throughput_Performance 0x0026 055 051 000 Old_age Always - 18840
3 Spin_Up_Time 0x0023 077 066 025 Pre-fail Always - 7179
4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6274
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 31668
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 2
12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2286
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always - 19262840
191 G-Sense_Error_Rate 0x0022 099 099 000 Old_age Always - 11132
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 044 000 Old_age Always - 35 (Min/Max 14/56)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 087 083 000 Old_age Always - 1617
198 Offline_Uncorrectable 0x0030 252 084 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 235
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 2
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 6320
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 31656 -
# 2 Short offline Completed without error 00% 31632 -
# 3 Short offline Completed: read failure 10% 31608 2541336840
# 4 Extended offline Completed without error 00% 31587 -
# 5 Short offline Completed without error 00% 31560 -
# 6 Short offline Completed without error 00% 31536 -
# 7 Short offline Completed without error 00% 31512 -
# 8 Short offline Completed without error 00% 31488 -
# 9 Short offline Completed without error 00% 31464 -
#10 Short offline Completed without error 00% 31440 -
#11 Extended offline Completed without error 00% 31419 -
#12 Short offline Completed without error 00% 31392 -
#13 Short offline Completed without error 00% 31368 -
#14 Short offline Completed without error 00% 31344 -
#15 Short offline Completed without error 00% 31320 -
#16 Short offline Completed without error 00% 31296 -
#17 Short offline Completed without error 00% 31272 -
#18 Extended offline Completed without error 00% 31251 -
#19 Short offline Completed without error 00% 31224 -
#20 Short offline Completed without error 00% 31200 -
#21 Short offline Completed without error 00% 31176 -