First time poster - my apologies if I don't get the etiquette correct.
I have a ~200TB RAID6 array with 30 disks and I'm unable to mount it - I just get the message:
mount /dev/md125 /export/models
mount:/dev/md125: can't read superblock
If I run mdadm --detail
on it, it shows as clean:
/dev/md125:
Version : 1.2
Creation Time : Wed Sep 13 15:09:40 2017
Raid Level : raid6
Array Size : 218789036032 (203.76 TiB 224.04 TB)
Used Dev Size : 7813894144 (7.28 TiB 8.00 TB)
Raid Devices : 30
Total Devices : 30
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri May 20 23:54:52 2022
State : clean
Active Devices : 30
Working Devices : 30
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : localhost.localdomain:SW-RAID6
UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
Events : 1152436
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 65 161 1 active sync /dev/sdaa1
2 65 177 2 active sync /dev/sdab1
3 65 193 3 active sync /dev/sdac1
4 65 209 4 active sync /dev/sdad1
5 8 17 5 active sync /dev/sdb1
6 8 33 6 active sync /dev/sdc1
7 8 49 7 active sync /dev/sdd1
8 8 65 8 active sync /dev/sde1
9 8 81 9 active sync /dev/sdf1
10 8 97 10 active sync /dev/sdg1
11 8 113 11 active sync /dev/sdh1
12 8 129 12 active sync /dev/sdi1
13 8 145 13 active sync /dev/sdj1
14 8 161 14 active sync /dev/sdk1
15 8 177 15 active sync /dev/sdl1
16 8 193 16 active sync /dev/sdm1
17 8 209 17 active sync /dev/sdn1
18 8 225 18 active sync /dev/sdo1
19 8 241 19 active sync /dev/sdp1
20 65 1 20 active sync /dev/sdq1
21 65 17 21 active sync /dev/sdr1
22 65 33 22 active sync /dev/sds1
23 65 49 23 active sync /dev/sdt1
24 65 65 24 active sync /dev/sdu1
25 65 81 25 active sync /dev/sdv1
26 65 97 26 active sync /dev/sdw1
27 65 113 27 active sync /dev/sdx1
28 65 129 28 active sync /dev/sdy1
29 65 145 29 active sync /dev/sdz1
cat /proc/stat
shows:
[root@knox ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active raid6 sdo1[18] sdh1[11] sdad1[4] sdd1[7] sdb1[5] sdi1[12] sdt1[23] sdr1[21] sdp1[19] sdx1[27] sdg1[10] sdn1[17] sdm1[16] sdab1[2] sdu1[24] sdl1[15] sde1[8] sdf1[9] sdw1[26] sdc1[6] sdq1[20] sdy1[28] sds1[22] sdv1[25] sdac1[3] sdz1[29] sdaa1[1] sda1[0] sdj1[13] sdk1[14]
218789036032 blocks super 1.2 level 6, 512k chunk, algorithm 2 [30/30] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
bitmap: 0/59 pages [0KB], 65536KB chunk
md126 : active raid1 sdae3[0] sdaf2[1]
976832 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md127 : active raid1 sdaf1[1] sdae1[0]
100554752 blocks super 1.2 [2/2] [UU]
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>
Examine
on the individual devices also shows as healthy (I haven't included the results for them all because it would take up too much space but they're all the same as this one):
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
Name : localhost.localdomain:SW-RAID6
Creation Time : Wed Sep 13 15:09:40 2017
Raid Level : raid6
Raid Devices : 30
Avail Dev Size : 15627788288 sectors (7.28 TiB 8.00 TB)
Array Size : 218789036032 KiB (203.76 TiB 224.04 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=0 sectors
State : clean
Device UUID : 917e739e:36fa7cf6:c618d73c:43fb7dec
Internal Bitmap : 8 sectors from superblock
Update Time : Fri May 20 23:54:52 2022
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 2b5e9556 - correct
Events : 1152436
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
The relevant entries in dmesg show:
[13297.001208] XFS (md125): Mounting V5 Filesystem
[13297.008854] XFS (md125): Log inconsistent (didn't find previous header)
[13297.008874] XFS (md125): failed to find log head
[13297.008878] XFS (md125): log mount/recovery failed: error -5
[13297.008934] XFS (md125): log mount failed
The background to this is rather long and involved but the short version is that I tried to grow the array with the addition of an additional disk and the operation got interrupted. I eventually got the array rebuilt by reshaping it back to the original 30 disks (which took a full two weeks!) but now it doesn't want to mount.
Unfortunately, it's not backed up (I mean to where fdo you back up 200TB?!?!). Nothing of value was supposed to be stored here but, human beings what they are, some critcal stuff has been stored there.
I've looked at xfs_repair
but I'm not sure if I should run it on the RAID array (md125) or on the individual sd* devices.
Thanks
Update (the history behind it all):
The device is SuperMicro server running CentOS 7 (3.10.0-1160.11.1.e17.x86_64) with version 4.1 – 2018-10-01 of mdadm with 30 x 8TB disk in a RAID6 configuration. It also has boot and root on 2 RAID1 arrays – the RAID6 array being solely for data. It was runing out of space so we decided to add more drives to the array (it can hold a total of 45 drives).
Since the original disk in the array were 4kN drives and the supplied devices were 512e it was necessary to reformat them with sg_format to convert them (a procedure supported by Western Digital). I started with one disk as a test. Unfortunately the process was interrupted part way through so I restarted it and completed OK, sort of – it did convert the disk to 4096k but it did throw an I/O error or two but they didn’t seem too concerning and I figured, if there was a problem, it would show up in the next couple of steps. I’ve since discovered the dmesg log and that indicated that there were significantly more I/O errors than I thought.
Anyway, since sg_format appeared to complete OK, I moved onto the next stage which was to partition the disk with the following commands
parted -a optimal /dev/sd<x>
(parted) mklabel msdos
(parted) mkpart primary 2048s 100% (need to check that the start is correct)
(parted) align-check optimal 1 (verify alignment of partition 1)
(parted) set 1 raid on (set the FLAG to RAID)
(parted) print
I then added the new disk to the array:
mdadm --add /dev/md125 /dev/sd<x>
And it completed without any problems.
I then proceeded to grow the array:
mdadm --grow --raid-devices=31 --backup-file=/grow_md125.bak /dev/md125
I monitored this with cat /proc/mdstat and it showed that it was reshaping but the speed was 0K/sec and the reshape didn’t progress from 0%.
About 12 hours later, as the reshape hadn’t progressed from 0%, I looked at ways of aborting it, such as mdadm --stop /dev/md125 which didn't work so I ended up rebooting the server
The server came up in emergency mode.
I was able to log on as root OK but the RAID6 array ws stuck in the reshape state.
I then tried mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19 /dev/md125
and this produced:
mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got <varying numbers>
mdadm: No RAID super block on /dev/sde
.
.
mdadm: /dev/sde1 is identified as a member of /dev/md125, slot 6
.
.
mdadm: /dev/md125 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.
I tried difference variations on this including mdadm --assemble --invalid-backup --force
all to no avail.
At this point I have also removed the suspect disk but this hasn't made any difference.
But the closest I've come to fixing this is running mdadm /dev/md125 --assemble --invalid-backup --backup-file=/grow_md125.bak --verbose /dev/sdc1 /dev/sdd1 ....... /dev/sdaf1
and this produces:
mdadm: /dev/sdaf1 is identified as a member of /dev/md125, slot 4.
mdadm: /dev/md125 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: continuing without restoring backup
mdadm: added /dev/sdac1 to /dev/md125 as 1
.
.
.
mdadm: failed to RUN_ARRAY /dev/md125: Invalid argument
dmesg
has this information:
md: md125 stopped.
md/raid:md125: reshape_position too early for auto-recovery - aborting.
md: pers->run() failed ...
md: md125 stopped.
Since all of the above, I booted from a rescue CD and was able to reshape it back to the original 30 devices and have booted back into the native installation (I did have to remark out that array from fstab to do so).