Score:1

Clear ZFS Checksum errors?

in flag

TLDR; My ZFS mirror pool got some checksum errors. I replaced the controller, thinking that was the most likely cause, but the errors won't clear. pool clear temporarily resets them, but they come back the next time I run a scrub. How can I clear them for good?

Full story: I have had a ZFS mirror-0 set up and running on ubuntu 20.04.2 LTS for some time. When one of the drives died, I took advantage of the failure to replace both drives with larger ones, as well as adding a SATA-III PCI card for the new drives (the old ones had been connected to the on-board SATA II controller, as I had no more SATA III ports available). After running on the new drives and controller for a few weeks, ZFS complained about checksum errors on both new drives, and put the array into a "degraded" state as a result.

Some research led me to the conclusion that since both drives were showing the exact same number of checksum errors, it was much more likely to be an issue with the controller than with the drives themselves. So I pulled the new controller and put the drives back on the onboard SATA II controller for now, intending to replace the controller card once I verify that is the issue. I then deleted the two files that zpool status -v showed as having permanent errors, issued a zpool clear data to reset the errors, and ran a scrub.

Unfortunately, after the scrub the errors re-appeared, only now a -v no longer showed a file, but just the address (inode, I believe), presumably for one of the files I had deleted earlier. I tried again, with the same result. Every time I run a scrub, it comes back with the following result:

root@watchman:~# zpool status -v
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 16K in 0 days 09:10:20 with 1 errors on Sat Jul 24 15:48:21 2021
config:

    NAME                                 STATE     READ WRITE CKSUM
    data                                 DEGRADED     0     0     0
      mirror-0                           DEGRADED     0     0     0
        ata-ST8000VE000-2P6101_WSD1M5NW  DEGRADED     0     0    15  too many errors
        ata-ST8000VE000-2P6101_WSD1HEJX  DEGRADED     0     0    15  too many errors

errors: Permanent errors have been detected in the following files:

        data:<0x380508>

From what I can tell, this is just the same issue that already existed due, presumably, to the bad controller, but I can't seem to clear it out. How can I restore my mirror to a fully-functioning state?

UPDATE: I finally gave up on the idea of clearing the errors, and instead started over. I created a new pool, stealing one of the drives from the existing mirror. I then ran a rsync to copy all the data over from the old pool to the new. This did run into a few errors (zfs wasn't lying about data errors), but nothing significant or troubling, and excluding the errored files allowed rsync to complete successfully. I then added the second drive to the new pool, and after a resilver everything now looks good, and a scrub on the new pool completed without error.

So assuming everything continues to look good for the next week or so, I think it's safe to assume the SATA III card was the cause of the issue, and replace it with a better brand/option :)

djdomi avatar
za flag
I belive its time for a backup and check for faulty hardware
ibrewster avatar
in flag
@djdomi Yes, I believe the controller was faulty. I have pulled it, but without being able to clear the current errors, it is somewhat difficult to confirm if that was indeed the case.
Brian Thomas avatar
by flag
im going through a similar error in raidz-3. im upgrading to larger drives. both new larger drives now showing same checksum amount(26) after resilvering in 2nd new drive successfully, after a reboot, and scrub wont clear it. (no fault, only degraded because i have pulled another drive for replace, no errors with -v) (havent tried to clear it yet, trying to make sure of best course of action)
Score:0
bd flag

From time to time I've also some checksum error on a 0-mirror, mostly occurring after a reboot, and the status of the zfs pool is degraded.

zpool status <poolname>

enter image description here

To fix this and clean the errors I run:

zpool clear <poolname>
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.