Score:0

btrfs - failing disk generated checksum errors, disk replaced, errors remain

ng flag

I had a pair of 3TB disks in a btrfs raid1 array.

One of these disks started failing (smartd shows bad sectors), and so I bought a pair of new 8TB drives to replace both disks in the array.

I replaced both with btrfs replace, and ran a btrfs balance afterwards - which fails on the following message:

[ 5063.136378] BTRFS error (device sdc): parent transid verify failed on 5153170751488 wanted 1433374 found 1417912
[ 5063.140428] BTRFS error (device sdc): parent transid verify failed on 5153170751488 wanted 1433374 found 1417912

Now, I've seen these messages precisely before replacing the disks, but now since both disks have been replaced I believe it has something to do with btrfs.

My data is fully backed up and the filesystem is online and working properly, but I cannot perform a balance due to this error. Running a scrub produces a small amount of uncorrectable errors, just as it did before I replaced the disks.

I was wondering how I could, perhaps:

  1. Find out which files are corrupted and restore them from a backup
  2. Reset the transaction on the filesystem to remove the errors
  3. Ignore the errors while balancing

...or any other reasonable solution.

Thanks!

paladin avatar
id flag
It might be a bit late, but I want to explain a bit about btrfs which you not seem to know. In contrast to many other filesystems btrfs is able to do checksum not only for the metadata, but also for the data itself. Usually when btrfs detects any filesystems errors, it will automatically try to fix those errors. Fixing an error means to use a backup copy from DUP or RAID1. If no such copy is available, btrfs will just notices the system that a file is corrupt. Usually the system admin should now use a real backup to restore the lost data. What you have done, is ignoring data loses.
paladin avatar
id flag
Next time when you see such error, it's not a btrfs error, but your data is corrupted and you should recover from backup, if possible. In contrast, ext4 and other filesystems only try to be happy around there metadata state. It's totally possible to lose data when using ext4 and not knowing it. btrfs on the other side, knows when it has lost data, that's an key advantage over ext4.
dkd6 avatar
ng flag
Hi, Thanks for clarifyng. What I eventually ended up doing was restoring the data from a backup onto the newly formatted filesystem. Looking at similar posts online, I could see that in most cases `dmesg` shows the path of the corrupt files discovered - yet in my case I could only see the`parent transid verify failed` errors, which I find confusing...
Score:0
ng flag

I've made a few extra attempts to solve this and eventually only a clean filesystem reformat solved the issue.

Once I transefered the data out of the disks I tried two dangerous commands - btrfs check --init-csum-tree and a btrfs check --repair - neither of which did any harm but did not solve the issue.

After reformatting, I transferred the data back on the filesystem again, ran a btrfs filesystem balance and a btrfs filesystem scrub, and now everything is working again.

Cheers!

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.