TLDR;
My ZFS mirror pool got some checksum errors. I replaced the controller, thinking that was the most likely cause, but the errors won't clear. pool clear temporarily resets them, but they come back the next time I run a scrub. How can I clear them for good?
Full story:
I have had a ZFS mirror-0 set up and running on ubuntu 20.04.2 LTS for some time. When one of the drives died, I took advantage of the failure to replace both drives with larger ones, as well as adding a SATA-III PCI card for the new drives (the old ones had been connected to the on-board SATA II controller, as I had no more SATA III ports available). After running on the new drives and controller for a few weeks, ZFS complained about checksum errors on both new drives, and put the array into a "degraded" state as a result.
Some research led me to the conclusion that since both drives were showing the exact same number of checksum errors, it was much more likely to be an issue with the controller than with the drives themselves. So I pulled the new controller and put the drives back on the onboard SATA II controller for now, intending to replace the controller card once I verify that is the issue. I then deleted the two files that zpool status -v
showed as having permanent errors, issued a zpool clear data
to reset the errors, and ran a scrub.
Unfortunately, after the scrub the errors re-appeared, only now a -v
no longer showed a file, but just the address (inode, I believe), presumably for one of the files I had deleted earlier. I tried again, with the same result. Every time I run a scrub, it comes back with the following result:
root@watchman:~# zpool status -v
pool: data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 16K in 0 days 09:10:20 with 1 errors on Sat Jul 24 15:48:21 2021
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-ST8000VE000-2P6101_WSD1M5NW DEGRADED 0 0 15 too many errors
ata-ST8000VE000-2P6101_WSD1HEJX DEGRADED 0 0 15 too many errors
errors: Permanent errors have been detected in the following files:
data:<0x380508>
From what I can tell, this is just the same issue that already existed due, presumably, to the bad controller, but I can't seem to clear it out. How can I restore my mirror to a fully-functioning state?
UPDATE: I finally gave up on the idea of clearing the errors, and instead started over. I created a new pool, stealing one of the drives from the existing mirror. I then ran a rsync
to copy all the data over from the old pool to the new. This did run into a few errors (zfs wasn't lying about data errors), but nothing significant or troubling, and excluding the errored files allowed rsync to complete successfully. I then added the second drive to the new pool, and after a resilver everything now looks good, and a scrub on the new pool completed without error.
So assuming everything continues to look good for the next week or so, I think it's safe to assume the SATA III card was the cause of the issue, and replace it with a better brand/option :)