Score:0

xfs superblock corrupted after power cut

ng flag
Ben

Like others before me the superbock on my xfs drive has become corrupted. I've tried xfs_repair and xfs_repair -L to restore the drive but both report back the same result:

Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval -1

fatal error -- Input/output error

{mkfs.xfs -Nf /dev/sdb1} reports back the following:

meta-data=/dev/sdb1              isize=512    agcount=4, agsize=244188544 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=976754176, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=476930, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Having followed various posts on this subject which all say siliar things to what I've tried above I have the sinking feeling the drive contents is lost (Thanks EDF Energy). Does anyone have any further recovery suggestions?

Edit: Results of SMART scan...

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0  1
Serial Number:    PBGJYR4S
LU WWN Device Id: 5 000cca 23dc7b57b
Firmware Version: MJAOA5F0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Jun 11 19:11:40 2021 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 248) Self-test routine in progress...
                                        80% of test remaining.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       80
  3 Spin_Up_Time            0x0007   176   176   024    Pre-fail  Always       -       411 (Average 468)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6222
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       273
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       273
194 Temperature_Celsius     0x0002   142   142   000    Old_age   Always       -       42 (Min/Max 20/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   174   174   000    Old_age   Always       -       1272

SMART Error Log Version: 1
ATA Error Count: 1272 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1272 occurred at disk power-on lifetime: 6220 hours (259 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 21 5f b7 c0 01  Error: ICRC, ABRT 33 sectors at LBA = 0x01c0b75f = 29407071

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 00 b7 c0 e0 08   1d+02:04:04.303  READ DMA EXT
  25 00 f8 00 b6 c0 e0 08   1d+02:04:04.303  READ DMA EXT
  25 00 08 f8 b5 c0 e0 08   1d+02:04:04.302  READ DMA EXT
  25 00 08 f0 b5 c0 e0 08   1d+02:04:04.302  READ DMA EXT
  25 00 08 e8 b5 c0 e0 08   1d+02:04:04.302  READ DMA EXT

Error 1271 occurred at disk power-on lifetime: 6220 hours (259 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 97 b5 c0 01  Error: ICRC, ABRT 1 sectors at LBA = 0x01c0b597 = 29406615

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 90 b5 c0 e0 08   1d+02:04:04.096  READ DMA EXT
  25 00 08 88 b5 c0 e0 08   1d+02:04:04.095  READ DMA EXT
  25 00 08 80 b5 c0 e0 08   1d+02:04:04.095  READ DMA EXT
  25 00 08 78 b5 c0 e0 08   1d+02:04:04.095  READ DMA EXT
  25 00 08 70 b5 c0 e0 08   1d+02:04:04.095  READ DMA EXT

Error 1270 occurred at disk power-on lifetime: 6220 hours (259 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 81 7f b5 c0 01  Error: ICRC, ABRT 129 sectors at LBA = 0x01c0b57f = 29406591

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 b4 c0 e0 08   1d+02:04:03.858  READ DMA EXT
  25 00 f8 08 0a 00 e0 08   1d+02:04:03.856  READ DMA EXT
  c8 00 08 f8 08 00 e0 08   1d+02:04:03.856  READ DMA
  c8 00 08 f0 08 00 e0 08   1d+02:04:03.856  READ DMA
  c8 00 08 e8 08 00 e0 08   1d+02:04:03.855  READ DMA

Error 1269 occurred at disk power-on lifetime: 6220 hours (259 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 c7 08 00 00  Error: ICRC, ABRT 1 sectors at LBA = 0x000008c7 = 2247

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c0 08 00 e0 08   1d+02:04:03.648  READ DMA
  c8 00 08 b8 08 00 e0 08   1d+02:04:03.641  READ DMA
  27 00 00 00 00 00 e0 08   1d+02:04:03.640  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 08   1d+02:04:03.638  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08   1d+02:04:03.636  SET FEATURES [Set transfer mode]

Error 1268 occurred at disk power-on lifetime: 6220 hours (259 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 bf 08 00 00  Error: ICRC, ABRT 1 sectors at LBA = 0x000008bf = 2239

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b8 08 00 e0 08   1d+02:04:03.440  READ DMA
  c8 00 08 b0 08 00 e0 08   1d+02:04:03.440  READ DMA
  c8 00 08 a8 08 00 e0 08   1d+02:04:03.440  READ DMA
  c8 00 08 a0 08 00 e0 08   1d+02:04:03.440  READ DMA
  c8 00 08 98 08 00 e0 08   1d+02:04:03.440  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xb0)       Completed without error       00%     36443         -
# 2  Vendor (0x71)       Completed without error       00%     36443         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Michael Hampton avatar
cz flag
Sounds like the drive has failed rather than the filesystem. Test it, and be mentally prepared for the necessity to restore from backup.
Michael Hampton avatar
cz flag
The SMART output confirms that it's dead, and eligible for RMA. Sorry for your loss.
Score:3
ca flag

The warning Input/output error means that your drive has failed, not your filesystem - xfs_repair was unable to read the affected sectors from the disks.

Your SMART output confirm the above: it shows multiple read aborts at 6220 hours, which is exactly your HDD poweron time (ie: the errors happened just now, not in a distant past).

XFS maintains some backup superblocks (one in each AG), but your disk seems to give errors for many different sectors, way apart, so I do not recommend trying with zeroing the affected sectors. Rather, I would use ddrescue to clone your disk on a different device and focusing any restore attempt on the cloned image.

Obviously, if you have working and current backups, you can simple trash the disk and restore your data on a new one.

Score:1
in flag

I have to say this looks like a bad disk or a disk that is very near to total catastrophic failure. I hope you have backups!?!

If you do not have backups, you should really power off that system, boot to alternate media or take the failing disk to another system and start trying to copy data to a new disk. This can be done with dd (or ddrescue) on the block level with multiple retries if you do not have access to a stand-alone disk repair workstation (still the best gadget I've ever purchased!). ddrescue has a max-retries option and sometimes with several retries you can successfully read data on a bad disk. sometimes ... The Trinity Rescue Kit LiveCD has this GNU tool available. I am not sure if all liveCD distributions have it. Worth getting it and keeping it handy though! here is an example from another article that discusses doing this for reference: https://superuser.com/questions/905811/faster-recovery-from-a-disk-with-bad-sectors

For everyone else, just a bit of wisdom I've had to learn the hard way a few times in my career. It is better to have backups and never need them, than need backups and have nothing! Configure backups! This should be the first thing you do after OS install!!! Make it a habit and/or now and you will never be caught without backups!

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.