Score:0

My ZFS pool seems to be self-destructing, any ideas?

gb flag

Context

I recently noticed my freeNAS telling me it had issues with one drive. I had about 16 bad sectors, went through the smart tests etc. I bought a new drive, same capacity, went to install it, and for some reason a power adapter for one of the other drives came partly loose, so I was with 4 out of 6 drives in the RAID Z2 array, or basically no redundancy.

The array started resilvering, never completed, and always told me there were too many errors (14k+). I figured out that power adapter part as it was unlikely to actually have two drives fail, especially with the second one failing right after opening the case. I plugged it back in and ZFS couldn't do anything with it.

I ended up replacing the old drive (same drive but ZFS couldn't recognize it somehow, matched on gpt / smartctl / zpool) with itself, and ZFS went back to resilvering.

Of course, this still has all the same errors, now I also get a third drive resilvering for no reason, I did a few ZFS clears and scrubs, and it's still resilvering all day every day, failing, I clear, resilver some more and it's going nowhere.

Beyond the fact that I'm deeply disappointed in ZFS's inability to recover from this relatively low-risk situation where in fact only one drive has ever failed and was promptly replaced, the NAS and its main and only share are still usable, and I had done a backup after the first disk failure anyway.

Question

Is there any way to make ZFS understand that this pool is just fine and that it should just resilver the two new drives (one of which being an old one that I did wipe to help ZFS get that it could use it) and stop telling me about those errors ?

Like a resilver -force -scrub_later -everything_is_obviously_fine -or_i_couldnt_possibly_use_the_share -just_mark_it_all_online -lets_get_back_to_actual_work_now ?

Rambling

I'm kind of worried as right now it's pretending to me that it's resilvering 3 out of 6 drives in a raidz2 pool which clearly has usable data in it, which I seriously doubt anyone can even do.

I'm expecting it to bump that up to 4 drives soon, or maybe all 6 why not, recreating all my data out of residual magnetic dust buildup from the air surrounding the hard drives.

Any suggestion is appreciated. Thank you!

djdomi avatar
za flag
did you checked the smart values of all drives? for me this setup looks like a home equipment
Morg. avatar
gb flag
Smart values on all drives are normal, no errors. Just the one drive had its 16 bad sectors and that was that. Old drives were Seagate but not NAS-intended like the Ironwolf I used to replace the failing drive. You could call it home equipment, I believe it's the ideal hardware to go with 6 4TB drives and freeNAS although it does have the drawback of not having ECC RAM.
Score:1
gb flag

I never got an answer, and things got worse before they got better. Overall, after at least a dozen resilverings, scrubs, clears, removal of files that contained errors, and reboots, it ended up back online.

All in all, I think this mostly means that ZFS likes to give big warnings and the zpool status is not exactly clear, as resilvering 3 drives out of 6 in a raidz2 was not physically possible for one.

But mostly, as long as your data is still available and everything looks ok from a share usage standpoint, it'll probably end up ok like it did here, just keep on rebooting, scrubbing, clearing and dealing with files that have checksum errors.

ewwhite avatar
ng flag
This is clearly low-end equipment in a home environment. You shouldn't make assumptions about the inability of ZFS to recover when you didn't provide a robust or stable computing environment in the first place.
Morg. avatar
gb flag
@ewwhite You shouldn't make assumptions about the quality of hardware when the software running on it is designed to make up for hardware inadequacies. The only argument that would make sense is the lack of ECC RAM, anything else is absolutely irrelevant. The messages sent by ZFS were completely wrong and there is no excuse for that, no matter the hardware you're running it on. I have to wonder what you think hardware is made of when you're criticizing a chip that has zero defects and a motherboard that's better than most enterprise servers in SMD quality and comparable PCB.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.