Score:1

ZFS disk error on write

ck flag

our ZFS backup pool is producing strange disk errors when writing data. This pool is replicated via DRBD on a second server with identical hardware which is also experiencing the same errors. This is why I don't think it's a hardware problem.

The setup is the following (on both servers):

  • Debian 10 server with Adaptec ASR 71605 RAID controller card in HBA mode. All disks are exposed as RAW disks.
  • There are two pools (all disks are datacenter SSDs):
    1. RAID-Z3 using eight disks, working without problems
    2. MIRROR using two disks, getting disk errors
  • The pools each have one ZFS volume created on them (compression=lz4)
  • The volumes are synchronized to the second server via DRBD (protocol C)
  • The block device exposed by DRBD has LVM volumes on it which are exposed to our hypervisors via iSCSI. The hypervisors (XCP-ng) manage their disks transparently on the iSCSI volumes.

All disks on the mirrored pools have experienced the following errors (not simultaneously but at different times):

Nov 10 18:00:09 st41 kernel: [240970.603991] sd 0:1:8:0: [sdi] tag#977 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Nov 10 18:00:09 st41 kernel: [240970.603997] sd 0:1:8:0: [sdi] tag#977 CDB: Write(10) 2a 00 a8 20 31 67 00 01 00 00
Nov 10 18:00:09 st41 kernel: [240970.604000] print_req_error: I/O error, dev sdi, sector 2820682087
Nov 10 18:00:09 st41 kernel: [240970.604065] zio pool=tank2 vdev=/dev/disk/by-id/ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0R101407-part1 error=5 type=2 offset=1444188179968 size=131072 flags=180880
Nov 10 18:00:10 st41 kernel: [240970.675209] aacraid: Host bus reset request. SCSI hang ?
Nov 10 18:00:10 st41 kernel: [240970.675272] aacraid 0000:82:00.0: outstanding cmd: midlevel-1
Nov 10 18:00:10 st41 kernel: [240970.675275] aacraid 0000:82:00.0: outstanding cmd: lowlevel-0
Nov 10 18:00:10 st41 kernel: [240970.675278] aacraid 0000:82:00.0: outstanding cmd: error handler-0
Nov 10 18:00:10 st41 kernel: [240970.675280] aacraid 0000:82:00.0: outstanding cmd: firmware-0
Nov 10 18:00:10 st41 kernel: [240970.675283] aacraid 0000:82:00.0: outstanding cmd: kernel-0
Nov 10 18:00:10 st41 kernel: [240970.675317] aacraid 0000:82:00.0: Controller reset type is 3
Nov 10 18:00:10 st41 kernel: [240970.675358] aacraid 0000:82:00.0: Issuing IOP reset
Nov 10 18:00:45 st41 kernel: [241005.856763] aacraid 0000:82:00.0: IOP reset succeeded
Nov 10 18:00:45 st41 kernel: [241005.879733] aacraid: Comm Interface type2 enabled
Nov 10 18:00:54 st41 kernel: [241014.950498] aacraid 0000:82:00.0: Scheduling bus rescan

The first four lines of the above log appear several times with different sectors and CDB Write(10) data, but otherwise are the same. This always occurs on the top of the hour which is exactly when our backup scripts start writing to this pool.

I have tried updating the ZFSonlinux packages, the RAID controller firmware and tried plugging in the disks to different slots on the backplane. SMART reports of the disks show no errors at all (and the disks are relatively new).

Since this is occurring on both servers and with all four disks I don't think it is a hardware problem with the disks or the RAID controllers.

The only difference in configuration between the disks on both pools I have found is that ARCCONF reports Write Cache: Enabled (write-back) for the mirrored pool disks, but Write Cache: Disabled (write through) for the RAID-Z3 pool disks. I was unable to change this cache mode because ARCCONF sais the disks are in RAW mode and don't support caching, so I'm not sure if the config report can be trusted.

I am not sure what to do now, any help is appreciated.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.