Score:0

Disk issues: irq_stat 0x20000000, host bus error

bd flag

When copying large files (50+GB) from an NVMe disk to a SATA 7200rpm HDD disk I see the following error in the logs on a fully patched Ubuntu 20.04:

Aug 08 00:45:59 host kernel: ata6.00: exception Emask 0x20 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 08 00:45:59 host kernel: ata6.00: irq_stat 0x20000000, host bus error
Aug 08 00:45:59 host kernel: ata6.00: failed command: WRITE DMA EXT
Aug 08 00:45:59 host kernel: ata6.00: cmd 35/00:08:30:a2:e0/00:00:e8:00:00/e0 tag 23 dma 4096 out
                                    res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x20 (host bus error)
Aug 08 00:45:59 host kernel: ata6.00: status: { DRDY }
Aug 08 00:45:59 host kernel: ata6: hard resetting link
Aug 08 00:46:00 host kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 08 00:46:00 host kernel: ata6.00: configured for UDMA/133
Aug 08 00:46:00 host kernel: ata6: EH complete

ata6.00 is the disk which is being written to.
The issue is intermittent. Sometimes does not appear for 24 hours, sometimes a couple times per hour. Often times the disk recovers, but sometimes the filesystem just becomes corrupt, needs to be unmounted, repaired (if possible) and remounted.

What I tried:

  1. I tried 3 different brands of HDD. All have the same issue.
  2. I suspected hardware issue. I replaced the motherboard and SATA cables. None of this helped.
  3. I have another server with an identical configuration. The issue does not occur there. Same workload.
  4. I have yet another server with a completely different configuration (Intel vs. AMD). The issue occurs there. Same workload.
  5. I disabled NCQ via echo 1 > /sys/block/sda/device/queue_depth. Did not help.

I ran out of ideas...
These are all data center grade components. Given the steps I've taken, I suppose it's not a hardware manufacturing defect.
Could this be software/OS/BIOS related?
Any ideas what else should I try?

Michael Hampton avatar
cz flag
What are data center grade components? What is the HBA you are using? What is the motherboard? What is the RAM?
mike avatar
bd flag
There is no HBA. The disks connect directly to SATA ports on the MB. The motherboard is Supermicro MBD-X11SPM-F-O. RAM is Samsung DDR4-3200, 8GB, ECC RDIMM, 1Rx8, 288pin.
Michael Hampton avatar
cz flag
This still looks like a controller or cabling issue, but you might run `smartctl -a` on the disks to see if they have recorded errors.
mike avatar
bd flag
It does show errors, but they're cryptic to me. Not sure where to go from there. https://gist.github.com/ceecko/c74c2aafc7d0b7fa1f9ad9a71e7d4717. I suspected controller or cabling issue but since both were replaced, I think the chances of both being bad are slim...
Michael Hampton avatar
cz flag
You said you had multiple disks, but that gist shows the results for only one. Where are the rest of them?
mike avatar
bd flag
I have just updated the gist with all the disks, including nvme disk which is used as a source for copy.
Michael Hampton avatar
cz flag
Only _one_ of the three disks is showing these errors. You should try replacing this disk.
mike avatar
bd flag
It does not seem to be the disk though. The `/dev/sdc` is connected via `ata6` and is used as a boot disk. This disk has failed even though there's nothing in the smart log. At that time, the disk with errors was mounted but not used. Do you think `/dev/sda` could have caused `/dev/sdc` to fail in such a way? As mentioned previously, these disks are the 3rd type of disks I tried. It would be a great coincidence to have 3rd batch of disks with the same issues I guess.
Score:1
jo flag

Perhaps this is more a problem of operating temperature? As the disk becomes constantly in use, its physical position and heat gain to loss ratio gets too high leading to erratic behaviour?

On newer kernels like yours drive temperature can be put in sysfs at this path:

/sys/class/hwmon/*

Be sure to make sure that the drivetemp module is loaded with modprobe drivetemp.

You could consider monitoring the files in here and beginning a large file copy again, the kernel documentation here provides an indication of how these files are to be interpreted.

They include useful values like the operating min/max temperatures, some drivers can also offer alarm indicators too which are chip-dependant alarms that are triggered on a fault.

Score:0
bd flag

Seems to be resolved by upgrading to Ubuntu 21.04. No idea why though. The server runs stable now without any ATA issues.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.