Score:0

Many drives in two RAID6 arrays simultaneously failed, seem to be working after reboot except SMART long test

tr flag

in my storage server, I operate three RAID6 Linux software arrays. Everything was working fine until it was not.

There are two RAID6 arrays and one RAID5 array, all consisting of SATA drives, all connected to a HBA9500-16i controller. Suddenly, multiple drives of one RAID6 and one RAID5 array started to show this:

May 15 01:20:07 xxxstor kernel: [42205.209000] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:07 xxxstor kernel: [42205.309428] sd 8:0:6:0: Power-on or device reset occurred
May 15 01:20:19 xxxstor kernel: [42217.044287] sd 8:0:8:0: [sdk] tag#1591 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
May 15 01:20:19 xxxstor kernel: [42217.044294] sd 8:0:8:0: [sdk] tag#1591 CDB: Read(16) 88 00 00 00 00 01 47 85 00 58 00 00 00 08 00 00
May 15 01:20:19 xxxstor kernel: [42217.044297] print_req_error: I/O error, dev sdk, sector 5494866008
May 15 01:20:19 xxxstor kernel: [42217.044361] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:19 xxxstor kernel: [42217.055768] sd 8:0:8:0: Power-on or device reset occurred
May 15 01:20:20 xxxstor kernel: [42217.758365] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:20 xxxstor kernel: [42217.825959] sd 8:0:8:0: Power-on or device reset occurred

After this, several drives in these arrays were marked as failing and an automatic replacement by the spares was initiated. However, the freshly employed spares also started to show I/O errors, were marked as failing and the recovery stopped. When I found out about the situation in the morning, majority of the drives were marked as failed and the arrays seemed unrecoverable. The failed HDDs show various errors in their SMART logs:

Error 503 occurred at disk power-on lifetime: 22577 hours (940 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 10 00 ac 3c 40 00      19:33:18.999  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      19:33:18.999  READ LOG EXT
  61 00 30 00 a4 3c 40 00      19:33:18.996  WRITE FPDMA QUEUED
  61 00 28 00 bc 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED
  61 00 20 00 a0 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED

or extended:

0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

SMART log of other drive reads

Error 2 occurred at disk power-on lifetime: 19503 hours (812 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 00 80 80 00 40 00      18:18:52.230  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      18:18:52.230  READ LOG EXT
  61 80 08 80 d6 6e 40 00      18:18:52.230  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      18:18:52.227  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      18:18:52.224  SET FEATURES [Enable write cache]

and corresponding ext. log looks similar to the previous one.

The only difference I see between those drives that failed and those that did not fail, is the SMART attribute 199 (UDMA_CRC_Error_Count). In those that failed, this is non-zero. For those that are still fine, it is zero.

After I rebooted the system (there wasn't anything I was able to do with the system), the failed mark on all drives disappeared and I was able to assemble the arrays back and their automatic reconstruction started.

So my question is: Did really such unlikely event occur and did multiple drives just happen to fail at the same time? Or is it the HBA controller and/or backplane faulty and it messed up with so many drives at the same time?

In the case of broken controller, can the drives be trusted, in spite of their SMART log or should I just save the data and get rid of the drives?

In case of the broken controller, should I just replace it or does it make sense to try updating firmware/bios of the controller card or Linux driver?

I will be very thankful for any hint. The kernel version is 4.19.181 with the mpt3sas driver version 35.00.00.00. Thank you.

Edit: In the meantime, I realized that all the HDDs that reported some SMART issues (UDMA_CRC, errors in the log, etc) are on the back panel of the server. The drives on the front panel are all fine, no issues. The same HBA controller controls both backplanes.

cn flag
`Did really such unlikely event occur and did multiple drives just happen to fail at the same time?` Yes. Same/similar models/batch - almost certainly yes. Dodgy low cost consumer hardware - virtual certainty.
br flag
Please don't use R5, it's dangerous.
michalt avatar
tr flag
@GregAskew Not similar batch/model. In one failing array 12TB drives - combination of WD Gold, HGST Ultrastar Helium, all server/NAS grade drives. In the other array - 6TB, combination of WD Gold and SG Exos 7E8. Different age, different batches, all failed at the same time within a minute from each other. Well, they are not SaS but I would not refer to them as to dodgy. I have used these series for quite a few years and they worked quite reliably. Their failures were usually predictible and they rarely failed after less than 40k hours of operation. Never more than one at the time.
Nikita Kipriyanov avatar
za flag
Check power supply also.
tsc_chazz avatar
vn flag
Echoing Nikita Kipriyanov: power supply. Almost certainly the failing drives all share a single 12v or 5v bus from the PSU, and the bus dropped out briefly.
michalt avatar
tr flag
@NikitaKipriyanov and tsc_chazz, thank you for your insight. I definitely cannot exclude this possibility. However, there is not much I can reasonably do to prevent it any time soon again. The server has two redundant PSUs, cca. 1kW each. Each PSU is connected to an independent and different online UPSs. IPMI did not report any voltage drop, neither did any UPS log any issue with the voltage around the time of the incident. It is a fact that the incident occurred right after a regular scheduled io-intensive activity started that involves a set of five HDDs in RAID5.
Nikita Kipriyanov avatar
za flag
The fact they all failed simultaneously hints at that actual failure lies in something they share. PSU (notice that brief brownout might be not registered in logs, I'd specifically test simultaneous start of all drives while measuring the voltage by some external fast device *on the outlet* near the drive. Who knows, maybe some spike filtering capacitor near drives is broken), controller/HBA, the enclosure backplane. Also, may there be some external physical impact ([vibration](https://www.youtube.com/watch?v=tDacjrSCeq4) etc) which acted on all drives?
michalt avatar
tr flag
@NikitaKipriyanov Thanks. I ordered the maintenance of the controller, so I will have chance to measure the voltages, hopefully. I also sent email to our IT stuff asking them whether anyone shouted at my storage at 1am on that day. No reply so far :-D. Other vibrations are unlikely. The most essntial question is whether I can trust the HDDs with hundreds of UDMA_CRC_Errors when their extended self tests passed. I am currently rebuilding the RAID6 and everything seems all right, even though all the drives are simultaneously working.
Nikita Kipriyanov avatar
za flag
UDMA CRC errors speak of problems with the interface. I.e. platters are likely ok, but cable, or connection (cable attachment, backplane connector), drive controller board, or HBA is faulty.
michalt avatar
tr flag
@NikitaKipriyanov One of the involved drives showed some Pending sectors in the meanwhile. So now it is all messed up and I cannot distinguish what is the result of the previous failure from my post and what is the result of natural aging of the HDDs uncovered in the subsequent RAID6 rebuild. I have checked all the cables and connections, it is seemingly all right. We will see what the maintenance will show. Thank you for all your input. I really appreciate it.
michalt avatar
tr flag
The maintenance by the company that supplied it did not identify any issue neither on the HBA or the backplanes. I suspect, however, that they only went through the output of the diagnostic utilities. They suggested that the UDMA_CRC errors were caused by a broad variety of the HDDs attached to the backplanes. They said that 11 different types of HDDs is too much and as they possess different spin-up times and latencies, it may cause some issues with the HBA.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.