Ok, here is what I did. May it help the next person.
Fact Finding
First, I attached all disks to an HBA. GNU/Linux tried to assemble the
raid, but indeed found (at least) two raid volumes (and a bit extra). I
then made a backup of the first 32 and last 32MB of each disk, indexed by
their WWID/WWN.
I then downloaded the SNIA DDF specification
(https://www.snia.org/tech_activities/standards/curr_standards/ddf)
because I knew that megaraid/dell (partially) implemented it (the ddf
anchor block magic is not de11de11
by chance :), and then wrote a very
ugly script to decode the data and make sense of it.
This showed me that the array was, in fact, split into three different
configurations, one that included one disk, another that included that
disk and 4 more, and another one that contained the remaining 2 disks.
The script itself is not very useful without understanding what you are doing, so I didn't include it here.
Eventually, this allowed me to eke out the correct original order of
the disks. Hint: after creating an array, write down the order of WWNs
(perccli /c0/s0 show all | grep WWN
) and the strip size, at least.
This process also gave me the start offset (always 0) and size of the
partitions (19531825152 sectors).
The raid5 variant used by the H740P (and probably all megaraid
controllers) is called left-symmetric
or "RAID-5 Rotating Parity N with
Data Continuation (PRL=05, RLQ=03)".
Re-assembling the disks for testing
I then tried to test-reassemble the raid using mdadm --build
. Unfortunately, mdadm refuses to assemble raid5 arrays - you
have to write to the array and destroy data :(
As a workaround, to test out whether the order is correct, I started a kvm
in snapshot mode with some random GNU/Linux boot image as /dev/sda
and the disks as virtio disks:
exec kvmb -snapshot -m 16384 \
-drive file=linux.img,snapshot=off \
-drive file=/dev/sdm,if=virtio,snapshot=on \
-drive file=/dev/sdl,if=virtio,snapshot=on \
-drive file=/dev/sdk,if=virtio,snapshot=on \
-drive file=/dev/sdi,if=virtio,snapshot=on \
-drive file=/dev/sdg,if=virtio,snapshot=on \
-drive file=/dev/sdf,if=virtio,snapshot=on \
-drive file=/dev/sdh,if=virtio,snapshot=on
This made the disks appear in the specified order as /dev/vda
, /dev/vdb
and so on, and allowed me to test out various options easily. The first try inside the VM succeeded:
mdadm --create /dev/md0 -f \
--metadata 1.0 \
--raid-devices 7 \
-z $((19531825152/2))K -c 256K \
-l raid5 -p ddf-N-continue \
--assume-clean -k resync \
/dev/vd?
For raid5, the partition size is uncritical - if it is larger, your GPT
partition table is corrupt and you have extra data, but the rest of the
disk should still be readable.
I verified the correctness of data by mounting the partition (which should
not throw errors, but might succeed even if the order is wrong), and using
btrfs scrub
, which checks checksums of metadata and data blocks, which is the ultimate test, and a major plus of btrfs.
I then ran the backzp again.
I then wrote down the WWN of all the disks in-order, so I can recreate it
with perccli
. I also made a backup of the first and last 1GB of data of
the volume itself, in case the raid controller would overwrite those.
Moving the volume back into the raid controller
Since about 14TB of the data was not backed up (because the data can be
retrieved from elsewhnere with some effort and I was too imnpatient to
wait for a copy), making a full restore was not an option I looked forward
to, so I tried to move the array back into the controller.
My first attempt was to format the array as a DDF container with the raid5
volume inside, using the same parameters as the controller uses, but unfortunately, the megaraid controller - while using
DDF itself - does not support "foreign" DDF for imports and showed the disks simply
as "unconfigured good".
I then tried to recreate the array simply by adding it again, e.g.:
perccli /c0 add vd r5 name=XXX drives=3,6,9,1,2,3,0 pdcache=off wb ra strip=256
Doing this on a booted sytem with perccli ensures that the raid controller
will do a background initialise, which is not destructive and, with RAID5,
will not even destroy data when the disk order or strip size is wrong, as
long as you use exactly all the disks from the original array in any order, without leaving one out or giving too many.
This is where I failed - somehow, I bungled the order of disks completely,
and also managed to corrupt the first 1.5MB of the volume. I have absolutely
no idea what went wrong, but I tried many permutations and didn't see
the correct data, to the point where I thought the raid controller would
somehow reorder my disks (but it doesn't, it exactly takes the order as
specified).
Long story short, I attached the disks to the HBA again and tried and
failed to make sense of it. This is where my original backup came handy:
although I lost the order of disks, I had a sharp look at the backup, and
lowered the potential order to two possible permutations simply by staring at hexdumps. Creating the array with mdadm
and testing the data have me the correct ordering.
I then again wrote down the order of WWNs, attached the disks to the
controller, booted and did perccli /c0 add...
. I then restored the first
1.5MB of the volume (which included GPT partition and LVM labels, and
some old leftover garbage data that was very useful during guessing what
the order could be). A certain level of confidence in being able to undo
mistakes is helpful in this situation.
Result: array is back, btrfs is consistent and con troller is now background-initialising, which makes the whole system slow for a few days, but is a small price to pay.
Things Learned
I learned a great deal!
The perc controllers (and likely all megaraid controllers) don't cope
well with frequent quick and intermittent disk problems - I suspect the disks going away and coming back quickly triggered a race condition where the controller was trying to write the new configuration to the disks and only partially succeeded with some disks, eventually splitting the raid into two. This is clearly a firmware bug. But then, who would expect power cables to be faulty...
mdadm is not very helpful in understanding or displaying DDF headers -
I simply couldn't make sense of the displayed data, and as I found out
when decoding the headers myself, this is because a lot of information is
missing from --detail
and --examine
output. It is also not very helpful in experimenting, as it refuses to do a non-destructive read-only assemble.
perc/megaraid controllers use SNIA DDF format internally, and this being a
publicly accessible specification, was extremely useful, although in the end I figured out what I needed without this information.
Being able to guess the correct order of raid strips from data alone
is very useful. Leftover garbage and other data that can help with this
is also very useful. I will consider writing "disk 1", "disk 2" and so on into
"empty" areas of my RAID volume headers from now on (there are long stretches of 0 bytes in the first 2MB).
It is very easy to fuck up - device names, raid member numbers, WWNs,
slot numbers and so on all being different can mean a lot of data to
manage, and WWNs are long and my old eyes are not that good anymore. Plus, I am not well-organised and overly self-confident :/
Creating and deleting an array using disks with data on it
will not erase the data, at least with RAID5 and using background
initialisation. Foreground initialisation will almost certainly zero out
the disks. That means that you cna create and delete the array as many
times as you wish without risking data loss, with one possible exception:
deleting an array sometimes requires the force option because the RAID
controller thinks it is "in use" due to a valid partition label. And this might zero out the GPT label - YMMV, and make sure you have a backup of the
first few megabytes just in case.
Perc/megaraid don't understand non-dell/megaraid DDF containers. At
least I didn't find out how to make my controller accept mdadm-created DDF
containers. Being able to format the disk in GNU/Linux and moving them back into the controller would have helped a lot and would have avoided many hours of grief on my side.
Summary
I got back everything without restoring from backup, at the expense of a
few days of slow background initialisation time. I wrote down my solution
above, in the case that some of it might be useful to other people in
similar situations.