Many drives in two RAID6 arrays simultaneously failed, seem to be working after reboot except SMART long test

Question

Score:0

Server

Many drives in two RAID6 arrays simultaneously failed, seem to be working after reboot except SMART long test

michalt

5/16/24, 11:41 AM

in my storage server, I operate three RAID6 Linux software arrays. Everything was working fine until it was not.

There are two RAID6 arrays and one RAID5 array, all consisting of SATA drives, all connected to a HBA9500-16i controller. Suddenly, multiple drives of one RAID6 and one RAID5 array started to show this:

May 15 01:20:07 xxxstor kernel: [42205.209000] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:07 xxxstor kernel: [42205.309428] sd 8:0:6:0: Power-on or device reset occurred
May 15 01:20:19 xxxstor kernel: [42217.044287] sd 8:0:8:0: [sdk] tag#1591 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
May 15 01:20:19 xxxstor kernel: [42217.044294] sd 8:0:8:0: [sdk] tag#1591 CDB: Read(16) 88 00 00 00 00 01 47 85 00 58 00 00 00 08 00 00
May 15 01:20:19 xxxstor kernel: [42217.044297] print_req_error: I/O error, dev sdk, sector 5494866008
May 15 01:20:19 xxxstor kernel: [42217.044361] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:19 xxxstor kernel: [42217.055768] sd 8:0:8:0: Power-on or device reset occurred
May 15 01:20:20 xxxstor kernel: [42217.758365] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:20 xxxstor kernel: [42217.825959] sd 8:0:8:0: Power-on or device reset occurred

After this, several drives in these arrays were marked as failing and an automatic replacement by the spares was initiated. However, the freshly employed spares also started to show I/O errors, were marked as failing and the recovery stopped. When I found out about the situation in the morning, majority of the drives were marked as failed and the arrays seemed unrecoverable. The failed HDDs show various errors in their SMART logs:

Error 503 occurred at disk power-on lifetime: 22577 hours (940 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 10 00 ac 3c 40 00      19:33:18.999  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      19:33:18.999  READ LOG EXT
  61 00 30 00 a4 3c 40 00      19:33:18.996  WRITE FPDMA QUEUED
  61 00 28 00 bc 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED
  61 00 20 00 a0 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED

or extended:

0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

SMART log of other drive reads

Error 2 occurred at disk power-on lifetime: 19503 hours (812 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 00 80 80 00 40 00      18:18:52.230  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      18:18:52.230  READ LOG EXT
  61 80 08 80 d6 6e 40 00      18:18:52.230  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      18:18:52.227  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      18:18:52.224  SET FEATURES [Enable write cache]

and corresponding ext. log looks similar to the previous one.

The only difference I see between those drives that failed and those that did not fail, is the SMART attribute 199 (UDMA_CRC_Error_Count). In those that failed, this is non-zero. For those that are still fine, it is zero.

After I rebooted the system (there wasn't anything I was able to do with the system), the failed mark on all drives disappeared and I was able to assemble the arrays back and their automatic reconstruction started.

So my question is: Did really such unlikely event occur and did multiple drives just happen to fail at the same time? Or is it the HBA controller and/or backplane faulty and it messed up with so many drives at the same time?

In the case of broken controller, can the drives be trusted, in spite of their SMART log or should I just save the data and get rid of the drives?

In case of the broken controller, should I just replace it or does it make sense to try updating firmware/bios of the controller card or Linux driver?

I will be very thankful for any hint. The kernel version is 4.19.181 with the mpt3sas driver version 35.00.00.00. Thank you.

Edit: In the meantime, I realized that all the HDDs that reported some SMART issues (UDMA_CRC, errors in the log, etc) are on the back panel of the server. The drives on the front panel are all fine, no issues. The same HBA controller controls both backplanes.

54

0 + 11

storage

smart

software-raid

hba

Many drives in two RAID6 arrays simultaneously failed, seem to be working after reboot except SMART long test

Post an answer