Score:2

ZFS Pool Data Backup and Restore

cn flag

I currently have a zfs raidz2 pool stuck in a resilvering loop as I was trying to replace the 3TB disks with 8TB disks. After letting the first replacement disk resilver online for over a week it finally finished only to immediately start again. After marking the disk "OFFLINE" the second resilver completed in about 2 days. I marked the disk online and everything looked good (for a couple of minutes) so I replaced the second disk. Once the resilver started for the second disk it showed that the first disk was also resilvering again. I'm now on my 3rd or 4th cycle of resilvering for these two drives, and with two disks resilvering I have no fault tolerance. At this point I would like to back up the zpool to an nfs share and recreate it with the new drives, but I don't want to lose all my dataset configuration which includes all of my jails. Is there a way to export the whole zpool as a backup image that can somehow be restored? The other machine's file system with sufficient disk space to store all this data already has a different filesystem in use so zfs replication is probably not an option. This is a TrueNAS-12.0-U4 installation. The backup machine is running Ubuntu 21.04 with LVM/Ext4. Below is the current pool status.


  pool: pool0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 29 00:39:12 2021
    13.8T scanned at 273M/s, 13.0T issued at 256M/s, 13.8T total
    2.17G resilvered, 93.77% done, 00:58:48 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    pool0                                           DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/55bf3ad6-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
        gptid/55c837e3-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
        gptid/55f4786c-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
        gptid/60dcf0b8-eef3-11eb-92f9-3cecef030ab8  OFFLINE      0     0     0  (resilvering)
        gptid/56702d96-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
        gptid/5685b5f7-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
        gptid/8f041954-eef3-11eb-92f9-3cecef030ab8  OFFLINE      0     0     0  (resilvering)
        gptid/56920c3a-3747-11eb-a0da-3cecef030ab8  ONLINE       0     0     0
    cache
      gptid/56256b6a-3747-11eb-a0da-3cecef030ab8    ONLINE       0     0     0

errors: No known data errors
Score:1
ca flag

You can use zfs snapshot -r pool0@backup; zfs send -R pool0@backup > zfs.img to create a replicated send stream which you can restore with zfs recv.

That said, is seems similar to the issue described here You can also try to disable deferred resilver via the zfs_resilver_disable_defer tunable.

Jason avatar
cn flag
That's better than what I was going to do. I installed zfs on the other box and created a pool on a sparse file vdev that I was going to replicate to. This will save me space and a step. I'm trying the zfs resilver_disable_defer now. The weird thing is that once it finishes with the devices offline it shows as finished and the message changes to one or more devices were taken offline... but bringing them back online restarts the resilver. I'm also testing a scrub while the two drives are still offline and it will let me scrub.
shodanshok avatar
ca flag
@Jason using a sparse-vdev zpool is a good solution, as it enable you to immediately receive the pool and see if all files are present.
Jason avatar
cn flag
Does that tunable do anything if the feature is already enabled for the pool? I tried to enable it on the web-ui in the tunables page but it doesn't seem to have done anything. putting the disks online immediately turned my scrub into another resilver. Maybe I'm doing it wrong. How do I set that tunable properly?
shodanshok avatar
ca flag
It *should* be applicable with the relevant feature enabled for the pool, but I am not sure it will solve your issue. You can check for it being enabled via the command line, issuing `sysctl -a | grep disable_defer`
Jason avatar
cn flag
I think I misunderstood what was happening when a drive showed OFFLINE (resilvering). It looks like it was just scanning the online drives while that was happening, not leaving the disk out of read operations and resilvering it. After further research it looks like that's the intended functionality, so I'm going to let it finish the estimated 10 day online resilver for these drives and see where we land before taking down my main NAS.
Jason avatar
cn flag
I did kick off a snapshot backup over to the sparse file pool on the other machine, which will take a couple of days to complete, so we'll see if I lose my patience trying to replace 6 more disks 10 days at a time.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.