in my storage server, I operate three RAID6 Linux software arrays. Everything was working fine until it was not.
There are two RAID6 arrays and one RAID5 array, all consisting of SATA drives, all connected to a HBA9500-16i controller. Suddenly, multiple drives of one RAID6 and one RAID5 array started to show this:
May 15 01:20:07 xxxstor kernel: [42205.209000] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:07 xxxstor kernel: [42205.309428] sd 8:0:6:0: Power-on or device reset occurred
May 15 01:20:19 xxxstor kernel: [42217.044287] sd 8:0:8:0: [sdk] tag#1591 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
May 15 01:20:19 xxxstor kernel: [42217.044294] sd 8:0:8:0: [sdk] tag#1591 CDB: Read(16) 88 00 00 00 00 01 47 85 00 58 00 00 00 08 00 00
May 15 01:20:19 xxxstor kernel: [42217.044297] print_req_error: I/O error, dev sdk, sector 5494866008
May 15 01:20:19 xxxstor kernel: [42217.044361] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:19 xxxstor kernel: [42217.055768] sd 8:0:8:0: Power-on or device reset occurred
May 15 01:20:20 xxxstor kernel: [42217.758365] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:20 xxxstor kernel: [42217.825959] sd 8:0:8:0: Power-on or device reset occurred
After this, several drives in these arrays were marked as failing and an automatic replacement by the spares was initiated. However, the freshly employed spares also started to show I/O errors, were marked as failing and the recovery stopped. When I found out about the situation in the morning, majority of the drives were marked as failed and the arrays seemed unrecoverable. The failed HDDs show various errors in their SMART logs:
Error 503 occurred at disk power-on lifetime: 22577 hours (940 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 10 00 ac 3c 40 00 19:33:18.999 WRITE FPDMA QUEUED
2f 00 01 10 00 00 00 00 19:33:18.999 READ LOG EXT
61 00 30 00 a4 3c 40 00 19:33:18.996 WRITE FPDMA QUEUED
61 00 28 00 bc 3c 40 00 19:33:18.994 WRITE FPDMA QUEUED
61 00 20 00 a0 3c 40 00 19:33:18.994 WRITE FPDMA QUEUED
or extended:
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 0 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 1 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
SMART log of other drive reads
Error 2 occurred at disk power-on lifetime: 19503 hours (812 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 80 00 80 80 00 40 00 18:18:52.230 WRITE FPDMA QUEUED
2f 00 01 10 00 00 00 00 18:18:52.230 READ LOG EXT
61 80 08 80 d6 6e 40 00 18:18:52.230 WRITE FPDMA QUEUED
ef 10 02 00 00 00 00 00 18:18:52.227 SET FEATURES [Enable SATA feature]
ef 02 00 00 00 00 00 00 18:18:52.224 SET FEATURES [Enable write cache]
and corresponding ext. log looks similar to the previous one.
The only difference I see between those drives that failed and those that did not fail, is the SMART attribute 199 (UDMA_CRC_Error_Count). In those that failed, this is non-zero. For those that are still fine, it is zero.
After I rebooted the system (there wasn't anything I was able to do with the system), the failed mark on all drives disappeared and I was able to assemble the arrays back and their automatic reconstruction started.
So my question is: Did really such unlikely event occur and did multiple drives just happen to fail at the same time? Or is it the HBA controller and/or backplane faulty and it messed up with so many drives at the same time?
In the case of broken controller, can the drives be trusted, in spite of their SMART log or should I just save the data and get rid of the drives?
In case of the broken controller, should I just replace it or does it make sense to try updating firmware/bios of the controller card or Linux driver?
I will be very thankful for any hint. The kernel version is 4.19.181 with the mpt3sas driver version 35.00.00.00. Thank you.
Edit: In the meantime, I realized that all the HDDs that reported some SMART issues (UDMA_CRC, errors in the log, etc) are on the back panel of the server. The drives on the front panel are all fine, no issues. The same HBA controller controls both backplanes.