Score:14

Unable to mount an XFS filesystem from Linux RAID6 array ("Log inconsistent")

fr flag
Bob

First time poster - my apologies if I don't get the etiquette correct.

I have a ~200TB RAID6 array with 30 disks and I'm unable to mount it - I just get the message:

mount /dev/md125 /export/models
mount:/dev/md125: can't read superblock

If I run mdadm --detail on it, it shows as clean:

/dev/md125:
           Version : 1.2
     Creation Time : Wed Sep 13 15:09:40 2017
        Raid Level : raid6
        Array Size : 218789036032 (203.76 TiB 224.04 TB)
     Used Dev Size : 7813894144 (7.28 TiB 8.00 TB)
      Raid Devices : 30
     Total Devices : 30
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri May 20 23:54:52 2022
             State : clean
    Active Devices : 30
   Working Devices : 30
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : localhost.localdomain:SW-RAID6
              UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
            Events : 1152436

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1      65      161        1      active sync   /dev/sdaa1
       2      65      177        2      active sync   /dev/sdab1
       3      65      193        3      active sync   /dev/sdac1
       4      65      209        4      active sync   /dev/sdad1
       5       8       17        5      active sync   /dev/sdb1
       6       8       33        6      active sync   /dev/sdc1
       7       8       49        7      active sync   /dev/sdd1
       8       8       65        8      active sync   /dev/sde1
       9       8       81        9      active sync   /dev/sdf1
      10       8       97       10      active sync   /dev/sdg1
      11       8      113       11      active sync   /dev/sdh1
      12       8      129       12      active sync   /dev/sdi1
      13       8      145       13      active sync   /dev/sdj1
      14       8      161       14      active sync   /dev/sdk1
      15       8      177       15      active sync   /dev/sdl1
      16       8      193       16      active sync   /dev/sdm1
      17       8      209       17      active sync   /dev/sdn1
      18       8      225       18      active sync   /dev/sdo1
      19       8      241       19      active sync   /dev/sdp1
      20      65        1       20      active sync   /dev/sdq1
      21      65       17       21      active sync   /dev/sdr1
      22      65       33       22      active sync   /dev/sds1
      23      65       49       23      active sync   /dev/sdt1
      24      65       65       24      active sync   /dev/sdu1
      25      65       81       25      active sync   /dev/sdv1
      26      65       97       26      active sync   /dev/sdw1
      27      65      113       27      active sync   /dev/sdx1
      28      65      129       28      active sync   /dev/sdy1
      29      65      145       29      active sync   /dev/sdz1

cat /proc/stat shows:

[root@knox ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active raid6 sdo1[18] sdh1[11] sdad1[4] sdd1[7] sdb1[5] sdi1[12] sdt1[23] sdr1[21] sdp1[19] sdx1[27] sdg1[10] sdn1[17] sdm1[16] sdab1[2] sdu1[24] sdl1[15] sde1[8] sdf1[9] sdw1[26] sdc1[6] sdq1[20] sdy1[28] sds1[22] sdv1[25] sdac1[3] sdz1[29] sdaa1[1] sda1[0] sdj1[13] sdk1[14]
      218789036032 blocks super 1.2 level 6, 512k chunk, algorithm 2 [30/30] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
      bitmap: 0/59 pages [0KB], 65536KB chunk

md126 : active raid1 sdae3[0] sdaf2[1]
      976832 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 sdaf1[1] sdae1[0]
      100554752 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

Examine on the individual devices also shows as healthy (I haven't included the results for them all because it would take up too much space but they're all the same as this one):

/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
           Name : localhost.localdomain:SW-RAID6
  Creation Time : Wed Sep 13 15:09:40 2017
     Raid Level : raid6
   Raid Devices : 30

 Avail Dev Size : 15627788288 sectors (7.28 TiB 8.00 TB)
     Array Size : 218789036032 KiB (203.76 TiB 224.04 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 917e739e:36fa7cf6:c618d73c:43fb7dec

Internal Bitmap : 8 sectors from superblock
    Update Time : Fri May 20 23:54:52 2022
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 2b5e9556 - correct
         Events : 1152436

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The relevant entries in dmesg show:

[13297.001208] XFS (md125): Mounting V5 Filesystem
[13297.008854] XFS (md125): Log inconsistent (didn't find previous header)
[13297.008874] XFS (md125): failed to find log head
[13297.008878] XFS (md125): log mount/recovery failed: error -5
[13297.008934] XFS (md125): log mount failed

The background to this is rather long and involved but the short version is that I tried to grow the array with the addition of an additional disk and the operation got interrupted. I eventually got the array rebuilt by reshaping it back to the original 30 disks (which took a full two weeks!) but now it doesn't want to mount.

Unfortunately, it's not backed up (I mean to where fdo you back up 200TB?!?!). Nothing of value was supposed to be stored here but, human beings what they are, some critcal stuff has been stored there.

I've looked at xfs_repair but I'm not sure if I should run it on the RAID array (md125) or on the individual sd* devices.

Thanks

Update (the history behind it all):

The device is SuperMicro server running CentOS 7 (3.10.0-1160.11.1.e17.x86_64) with version 4.1 – 2018-10-01 of mdadm with 30 x 8TB disk in a RAID6 configuration. It also has boot and root on 2 RAID1 arrays – the RAID6 array being solely for data. It was runing out of space so we decided to add more drives to the array (it can hold a total of 45 drives).

Since the original disk in the array were 4kN drives and the supplied devices were 512e it was necessary to reformat them with sg_format to convert them (a procedure supported by Western Digital). I started with one disk as a test. Unfortunately the process was interrupted part way through so I restarted it and completed OK, sort of – it did convert the disk to 4096k but it did throw an I/O error or two but they didn’t seem too concerning and I figured, if there was a problem, it would show up in the next couple of steps. I’ve since discovered the dmesg log and that indicated that there were significantly more I/O errors than I thought.

Anyway, since sg_format appeared to complete OK, I moved onto the next stage which was to partition the disk with the following commands

     parted -a optimal /dev/sd<x>
     (parted) mklabel msdos
     (parted) mkpart primary 2048s 100% (need to check that the start is correct)
     (parted) align-check optimal 1 (verify alignment of partition 1)
     (parted) set 1 raid on (set the FLAG to RAID)
     (parted) print

I then added the new disk to the array:

     mdadm --add /dev/md125 /dev/sd<x>

And it completed without any problems.

I then proceeded to grow the array:

     mdadm --grow --raid-devices=31 --backup-file=/grow_md125.bak /dev/md125

I monitored this with cat /proc/mdstat and it showed that it was reshaping but the speed was 0K/sec and the reshape didn’t progress from 0%.

About 12 hours later, as the reshape hadn’t progressed from 0%, I looked at ways of aborting it, such as mdadm --stop /dev/md125 which didn't work so I ended up rebooting the server

The server came up in emergency mode.

I was able to log on as root OK but the RAID6 array ws stuck in the reshape state.

I then tried mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19 /dev/md125 and this produced:

     mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got <varying numbers>
     mdadm: No RAID super block on /dev/sde
     .
     .
     mdadm: /dev/sde1 is identified as a member of /dev/md125, slot 6
     .
     .
     mdadm: /dev/md125 has an active reshape - checking if critical section needs to be restored
     mdadm: No backup metadata on /grow_md125.back
     mdadm: Failed to find backup of critical section
     mdadm: Failed to restore critical section for reshape, sorry.

I tried difference variations on this including mdadm --assemble --invalid-backup --force all to no avail.

At this point I have also removed the suspect disk but this hasn't made any difference.

But the closest I've come to fixing this is running mdadm /dev/md125 --assemble --invalid-backup --backup-file=/grow_md125.bak --verbose /dev/sdc1 /dev/sdd1 ....... /dev/sdaf1 and this produces:

     mdadm: /dev/sdaf1 is identified as a member of /dev/md125, slot 4.
     mdadm: /dev/md125 has an active reshape - checking if critical section needs to be restored
     mdadm: No backup metadata on /grow_md125.back
     mdadm: Failed to find backup of critical section
     mdadm: continuing without restoring backup
     mdadm: added /dev/sdac1 to /dev/md125 as 1
     .
     .
     .
     mdadm: failed to RUN_ARRAY /dev/md125: Invalid argument

dmesg has this information:

     md: md125 stopped.
     md/raid:md125: reshape_position too early for auto-recovery - aborting.
     md: pers->run() failed ...
     md: md125 stopped.

Since all of the above, I booted from a rescue CD and was able to reshape it back to the original 30 devices and have booted back into the native installation (I did have to remark out that array from fstab to do so).

Nikita Kipriyanov avatar
za flag
Repair should be done on the RAID. Also, if it's *partitionable* (check with `fdisk -l /dev/mdXXX` if there are any partitions), you should work with partitions. Also, **avoid such large arrays**. Better is to have "RAID60" in 3 of 10 form (3 RAID6 arrays of 10 devices each striped together). Yes, you'd lose some space, but management operations wouldn't last for weeks. // Also about the history how you get into this state, interrupted extension and reshaping back. It easily could be that data is irrecoverrable now. Sorry.
fr flag
Bob
Thanks @NikitaKipriyanov.. The background to all this is very long. Is there some way to post large slabs of text?
Nikita Kipriyanov avatar
za flag
ServerFault post limit is 30k characters. If your data is larger than that, probably, it's not for ServerFault, because it's unlikely that such question will be worth for anyone else. Also, regarding RAID, it's quite deep and popular topic and there are many questions like this on Serverfault, some of them are answered, but it is very hard to go beyond the general answer: make a snapshot and try it various ways, or find a paid professional who'll resolve your particular case.
fr flag
Bob
Thanks @NikitaKipriyanov. I've just edited my original post to include background.
djdomi avatar
za flag
I agree to nikitia, stop any changes, catch a professional
U. Windl avatar
it flag
On "so I ended up rebooting the server": it would have been wise to inspect syslog or dmesg *before* rebooting. I guess there were a lot of I/O errors. Maybe trying to remove the disk again using `mdadm` would have been more clever, or an attempt to "hard fail" the bad drive via software commands (like in https://stackoverflow.com/a/1365156/6607497).
Score:13
za flag

I want to extend suggestions above.

It is extremly worth setting up overlay block device, so any changes to the file system that you'll do in attempt to recover it will not change anything on the RAID and this will allow you to reset everything and start from the beginning. Therefore, you'll be given infinite number of attempts, thus releasing the psychological pressure.

I did that with Qemu's qemu-nbd, Linux nbd.ko (Network Block Device driver) and the qcow2 overlay file.

  1. Connect additional disk where the overlay will be stored. Load NBD driver. Mount your scratch disk somewhere:
modprobe nbd
mount /dev/sdXXN /tmp/overlay
  1. Create a qcow2 overlay file:
qemu-img create -f qcow2 -b /dev/md125 -F raw /tmp/overlay/attempt1.qcow2
  1. Create a block device out of overlay file using qemu-nbd:
qemu-nbd -c /dev/nbd0 /tmp/overlay/attempt1.qcow2

Now you have a /dev/nbd0 which is a "writeable clone" of your array. You can safely write to this device, any changes will be written to /tmp/overlay/attempt1.qcow2. So, for example, when you attempt @shodanshok's advice, apply it to /dev/nbd0.

  1. If you stuck, disconnect the overlay and remove the overlay file
qemu-nbd -d /dev/nbd0
rm /tmp/overlay/attempt1.qcow2

Then repeat everything from step (2). Alternatively, you can create as many overlays as space and /dev/nbdX devices permit (I have 16 of them, for instance) and work in parallel. All of them should use different overlay images, of course. This is useful if you happen to recover only partial data in some attempt and the other part of data in some other attempt.

When working with clones of XFS filesystem remember that each of them should have distinct UUID.

When (if) the correct recovery path is found, it can be reapplied to the raw device, "irreversibly recovering the filesystem", or you can rent some space, dump recovered data there from overlay NBD's, recreate RAID and file system and download it back.

I know, this is hard and cumbersome. This is why data recovery organizations charge a lot when they work with RAIDs. When you try it yourself, you'll agree that these bills are't as inflated as it could appear at first sight.

And I repeat that again, RAID6 of 30 devices is a pain. Better have e.g. 3 RAID6 arrays of 10 drives each, then stripe them together using layered MD RAID 0 or LVM. This will make things more manageable, and your reshape/check operations will not take weeks to complete. Yes, you do RAID consistency checks (scrubbing) regularly, at least every other month, don't you?

Update: There is valuable information in comments, which is worth adding here.

  • I doubt qemu stuff will be available in the Synology DSM. But you can connect disks to ordinary PC with Linux and proceed. Or try booting Synology from network or LiveUSB — the NAS which can connect 30 disks is basically an ordinary amd64 rack-mountable computer. –

  • @Mark suggests another way to create an overlay:

@Bob, there are other options for creating an overlay — I've used a USB thumb drive and the steps at https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID

Nice way, which uses Device Mapper framework, likely to be present in the DSM! Also it is probably faster than my approach. It is dmsetup command who creates the virtual device with sparse overlay file. However, since the RAID array itself appears clean in your case and all we talk about is fixing a file system, I suggest to create overlay of an assembled array (/dev/md125) rather than of individual array components.

fr flag
Bob
Thanks @Nikita, checking with `fdisk -l /dev/md125` gives me: ```Disk /dev/md125: 224040.0 GB, 224039972896768 bytes, 54697259008 sectors Units = sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 524288 bytes / 14680064 bytes ```
fr flag
Bob
`parted` returns: ``` [root@knox ~]# parted GNU Parted 3.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) select /dev/md125 Using /dev/md125 (parted) print Error: /dev/md125: unrecognised disk label Model: Linux Software RAID Array (md) Disk /dev/md125: 224TB Sector size (logical/physical): 4096B/4096B Partition Table: unknown Disk Flags: ```
fr flag
Bob
your description of the overlay process sounds somewhat complicated and I can understand why you say that data recovery organisations are worth their money. With respect to you question about scrubbing the answer is a definite yes for our smaller Synology NAS but with this beast I'm ashamed to admit that I'm not sure. I'm not sure if I mentioned that Linux and Linux RAID especially is somewhat new to me so I'm rather out of my depth on this.
Nikita Kipriyanov avatar
za flag
I doubt qemu stuff will be available in the Synology DSM. But you can connect disks to ordinary PC with Linux and proceed. Or try booting Synology from network or LiveUSB — the NAS which can connect 30 disks is basically an ordinary amd64 rack-mountable computer.
Mark avatar
tz flag
@Bob, there are other options for creating an overlay -- I've used a USB thumb drive and the steps at https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
Nikita Kipriyanov avatar
za flag
Nice way, which uses Device Mapper framework, likely to be present in the DSM! Also it is probably faster than my approach. It is `dmsetup` command who creates an virtual device with sparse overlay file. However, since the RAID array itself appears clean in your case and all we talk about is fixing a file system, I suggest to create overlay of an assembled array (`/dev/md125`) rather that individual array components.
Score:10
ca flag

The logs

[13297.001208] XFS (md125): Mounting V5 Filesystem
[13297.008854] XFS (md125): Log inconsistent (didn't find previous header)
[13297.008874] XFS (md125): failed to find log head
[13297.008878] XFS (md125): log mount/recovery failed: error -5
[13297.008934] XFS (md125): log mount failed

make me think that the aborted reshape "shuffled" the LBA number so that XFS is not finding its intent log. This probably means wide corruption, so as already said by others, stop here and contact a professional data recovery service.

If this is not possible, I would try a last attempt by ignoring XFS log with something as mount -o ro,norecovery /dev/md125 /export/models but in the very improbable case should it works, be prepared to extensive data corruption.

Again, if it stored critical data, contact a data recovery firm before doing anything.

fr flag
Bob
Thanks @shodanshok. I'll try talking the boss into it.
Criggie avatar
in flag
@bob if this is work, then you need to document your actions and be careful. "Cover your arse" even if it costs money. If the loss of this data costs the company, they may blame you for it.
U. Windl avatar
it flag
YEs, the issue as it is now seems unrelated to the RAID below; instead the issue is "Log inconsistent". The really interesting question is how this state happened. Syslogs from the past may be helpful.
cn flag
@Bob mount with the "norecovery" option is safe. If it works, check if you can access your data and if it is coherent. However be prepared to need some place to copy your valuable data elsewhere.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.