I am in need of testing my Dell SAS Controller card. I am seeing too many errors on a Raidz2 installation, since last July, than to possibly be true. Its as if one drive after another keeps spinning-off-the-rails.
I have a supposed "dell" 9207-8l. I got it from ebay back in Jul/Aug 2020. https://www.ebay.com/itm/132663136462
I have never been able to enter the configuration of it. It says press Ctrl + C, to enter config. Ive tried left, and right crtl, plus c, also with C (capital) since its spelled capital. It says it will enter configuration after setup, but never does, just goes straight to bios if del
was pressed, or boots otherwise.
I run zfs-on-linux, on rhelx64. Yesterday took-the-cake. I had to pull out some 2TB devices (6) and make so far 3 3TB LVM's, to support the failing system, while go through a sort of RMA hell.
# zpool status
pool: nas
state: DEGRADED status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Dec 1 05:41:15 2021
665G scanned at 24.5M/s, 640G issued at 23.6M/s, 9.78T total
182G resilvered, 6.40% done, 4 days 16:52:09 to go
config:
NAME STATE READ WRITE CKSUM
nas DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
scsi-35000c50093a9052f DEGRADED 0 0 52 too many errors
replacing-1 DEGRADED 0 0 52
scsi-35000c50084818db7 OFFLINE 0 0 0
lvzfs2-lvzfsvol2 ONLINE 0 0 0 (resilvering)
scsi-35000c50093a9182b DEGRADED 235 636 52 too many errors
scsi-350000c0f01e5dabc DEGRADED 0 0 60 too many errors
scsi-35000c5008491a803 DEGRADED 0 0 53 too many errors (resilvering)
replacing-5 DEGRADED 0 0 52
scsi-35000c50084889cf3 OFFLINE 0 0 0
lvzfs1-lzfsvol1 ONLINE 0 0 0 (resilvering)
scsi-35000c50093a8dfe7 DEGRADED 0 0 52 too many errors
spares
lvzfs3-lvzfsvol3 AVAIL
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
root@merlin ~$
This resilvering has been going on for the last month or two, in one way or another. THings were actually looking good for short periods, when the next drive failed, or a previously known failed drive(dd dev/zero'd), failed again.
Its literally driving me nuts, and scaring me at the same time, since this data is most important. Its family photos back to 1970's and before, etc...
Help please?
EDIT: I added a comment as to what im actually using the drives, here https://www.reddit.com/r/audiophile/comments/bxw38m/bass_vibrations_and_computer_hard_drives/hnvbyj0/ as i was also concerned that the HardHouse and Tidy Tracks is rocking the drives apart with a few subwoofers. Will consider relocating the server out of office, into the garage. I've also managed to create a new zfs pool, using the sata ports and the old 2Tb drives, and no issues yet. still in the mid of resilver hell even though I have tuned, and even moved a few datasets off to the other pool.
root@merlin ~$ zpool status
pool: bak
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
bak ONLINE 0 0 0
ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332 ONLINE 0 0 0
ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P0891973 ONLINE 0 0 0
errors: No known data errors
pool: nas
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Dec 6 11:08:12 2021
7.84T scanned at 37.5M/s, 7.84T issued at 37.5M/s, 9.78T total
3.39T resilvered, 80.16% done, 0 days 15:03:25 to go
config:
config:
NAME STATE READ WRITE CKSUM
nas DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
scsi-35000c50093a9052f DEGRADED 0 0 0 too many errors
replacing-1 ONLINE 0 0 0
scsi-35000c50084818db7 ONLINE 0 0 0 (resilvering)
lvzfs2-lvzfsvol2 ONLINE 0 0 0 (resilvering)
replacing-2 DEGRADED 0 0 0
17084797086424522076 UNAVAIL 0 0 0 was /dev/disk/by-id/scsi-35000c50093a9182b-part1
scsi-350000c0f012efb7c ONLINE 0 0 0 (resilvering)
scsi-350000c0f01e5dabc DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c5008491a803 DEGRADED 0 0 0 too many errors
replacing-5 DEGRADED 0 0 0
scsi-35000c50084889cf3 DEGRADED 0 0 0 too many errors (resilvering)
lvzfs1-lzfsvol1 DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c50093a8dfe7 DEGRADED 0 0 0 too many errors
errors: 2 data errors, use '-v' for a list
FYI, there were checksum errors, but those cleared after reboot.
Errors are so minimal, that im seriously impressed about ZFS robustness, which is why I continue to use it for my main backups. Better than backing up to single disk..
Another mistake i made, i have 5 new drives sitting here for almost a week now, but i cant use them until the LVM vols finish resilvering, since im so close to data errors, that i want that to finish first, shooting myself in the foot for not waiting for the drives to arrive in mail, before i ran that replace operation haphazardly, not realizng it would take weeks to finish all these resilver ops. Crazy madness!
I had done some math for glacier storage (0.004/GB), that would cost $20 month for 5TB, not an option. If I ever needed that data forget it on the egress costs. Also I highly enjoy this, using ZFS on a home server for this. previous array drives were 8 yrs old, when taken down, and i only used known bad drives on the entire array, and still managed to z2 my way to success. I figured a fresh set of refurb or renewed drives would solve this issue. Sorry, guess im venting, I would like to hear more about what i may be doing wrong though...
By the way, i did get the company to pay (deepdiscountserver in this case) for the drive replacements, different models. no more IBM refurbs for me. going to test out the HGST, since those have worked good in the past for me.
Rest assured, a new SAS card will be on the way if those are showing issues, once this resilver madness ends, if it ever does. Ill have to do another full backup if not, which takes almost as long as the resilver is... Atleast i moved the live data off the array already, so no loss will occurr unless i lose my main drives, on another system during all this.... I guess i can say, the data is Majorly Important, but I have a copy of it still, but I can stand a loss for now. Clarifying the "important" part, It will be URGENT if the ZFS array does start spouting errors, because ill only have one drive here and there, that contains the master copies..