Score:1

RAID arrays failed, now will not restart; mdadm --examine shows drive healthy but --assemble fails missing two disks

vn flag

This is a Mint 21.1 x64 Linux system, which has over the years had disks added to RAID arrays until we now have one array of 10 3TB and one array of 5 6TB. Four HDs dropped out of the arrays, two from each, apparently as a result of one controller failing. We've replaced controllers, but that has not restored the arrays to function. mdadm --assemble reports unable to start either array, insufficient disks (with two failed in each, I'm not surprised); mdadm --run reports I/O error (syslog seems to suggest this is because it can't start all the drives, but there is no indication that it tried to start the two apparently unhappy ones), but I can still mdadm --examine failed disks and they look absolutely normal. Here's output from a functional drive:

mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 6e072616:2f7079b0:b336c1a7:f222c711

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:30:27 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 2faf0b93 - correct
         Events : 21397

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 9
   Array State : AAAAAA..AA ('A' == active, '.' == missing, 'R' == replacing)

And here's output from a failed drive:

mdadm --examine /dev/sdk
/dev/sdk:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : d62b85bc:fb108c56:4710850c:477c0c06

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : d53202fe - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Edit: Here's the --examine report from the second failed drive; as you can see, it failed at the same time the entire array fell off line.

# mdadm --examine /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 35ebf7d9:55148a4a:e190671d:6db1c2cf

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : c13b7b79 - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 7
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The second array, 5x6TB, fell off line two minutes later when two disks quit. The two failed disks on this array, and the two on the other array, all connected to a single 4-port SATA controller card which of course has now been replaced.

The main thing I find interesting about this is that the failed drive seems to report itself as alive, but mdadm doesn't agree with it. journalctl doesn't seem to go back as far as 2 April, so I may not be able to find out what happened. Anyone have any ideas about what I can do to bring this beast back online?

Peter Zhabin avatar
cn flag
Please show output of `mdadm --examine` of **both** failed drives within the same array, you need to make sure that all of these drives failed at the same point in time, as there's a chance one drive fell of array long time ago and this event gone unnoticed.
tsc_chazz avatar
vn flag
Done. See edited request above.
Peter Zhabin avatar
cn flag
I believe in this state it is safe to force array online with `mdadm --assemble --force /dev/mdX` or `mdadm --assemble --force --scan`. As usual to be on the safe side it is recommended to have physical disk images made before attempting recovery.
tsc_chazz avatar
vn flag
`mdadm --assemble --force --scan` resulted in the second array, `/dev/md127` (5x6TB), restarting, but the first array. `/dev/md126` (10x3TB) failed with "cannot re-read metadata from /dev/sdh - aborting". `/dev/sdh` was not one of the failed drives earlier...
tsc_chazz avatar
vn flag
Stopping the array and using `mdadm --assemble --force /dev/md126 /dev/sda ...` (explicitly naming all drives) seems to have gotten things started again. If you post as an answer I can credit you for that.
Score:0
cn flag
  1. Always make an image-level backups of all drives in the array before attempting any potentially destructive mdadm commands. With these backups at hand you can later attempt recovery on a VM outside the box.
  2. Examine Update time field in for failed drives in the output of mdadm --examine /dev/sdX to determine exact sequence of events when drives were falling out of the array. Sometimes the first drive failure comes unnoticed and bringing that old drive online will result in a catastrophic failure while trying to mount a filesystem.
  3. In your case both drives failed at once, so it should be safe to force array online with mdadm --assemble --force /dev/mdX or mdadm --assemble --force --scan. If it were not the case, you should force online only the last drive that fell off the array by specifying array member drives for mdadm --assemble --force /dev/mdX /dev/sda /dev/sdb missing /dev/sdd, note that the order of drives is important.
  4. As you were able to get things going only with explicit device list for assemble I believe your array is currently in a degraded state with that /dev/sdh marked offline. Look into the output of cat /proc/mdstat to determine that, do a backup, troubleshoot your hardware and rebuild your array completely after that.
tsc_chazz avatar
vn flag
The last step is definitely also useful, but in this case it looks like a transient, as /proc/mdstat shows all drives as up.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.