Score:1

Replacement disk faults

mw flag

I had a disk in my pool fault (raise too many errors).

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 52
  class: statechange
  state: FAULTED
  host: databank-a
  time: 2021-12-11 16:36:33-0500
  vpath: /dev/disk02_old
  vphys: pci-0000:00:1f.2-ata-4
  vguid: 0x73F7B0B1D1B45864
  devid: /dev/disk02_old
  pool: 0x47B3E7C1336F1F4F

So, I replace it with a brand new disk (zpool replace pool /dev/foo /dev/bar) but then it faulted (my server kept going to sleep because I stupidly enabled x-windows), so I cleared the error (zpool clear pool /dev/bar) but then it happened again.

  pool: DATA01
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Dec 15 11:23:57 2021
        6.83T scanned at 256M/s, 5.80T issued at 217M/s, 9.08T total
        232G resilvered, 63.85% done, 0 days 04:24:05 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        DATA01                      DEGRADED     0     0     0
        raidz1-0                    DEGRADED     0     0     0
            /dev/disk01             ONLINE       0     0     0
            replacing-1             UNAVAIL      0     0     0  insufficient replicas
            8356341911383201892     UNAVAIL      0     0     0  was /dev/disk02_old
            /dev/disk02_new         FAULTED      0    81     0  too many errors  (resilvering)
            /dev/disk03             ONLINE       0     0     0
            /dev/disk04             ONLINE       0     0     0


errors: No known data errors

What are the chances that the drive is not at fault?

Score:0
cn flag

What are the chances that the drive is not at fault?

Possible the drive is faulty. If the error counter is correct, dozens of errors in the first couple TB of use is worse than expected. And you cleared errors already, so it is not a one-time transient event.

While Backblaze consumer drive failure data is not exactly what you have, it shows early failure still exists. Even at a low early death rate, you could be the unlucky one in a few thousand to get a less than perfect product.

Start a backup restore test of important data from separate media, in case that is needed in the worst case scenario. Ensure more spare disks are in stock. When the resilver finishes, examine disks again. Keep replacing them as necessary.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.