
How can i test my SAS Controller card?

by flag

I am in need of testing my Dell SAS Controller card. I am seeing too many errors on a Raidz2 installation, since last July, than to possibly be true. Its as if one drive after another keeps spinning-off-the-rails.

I have a supposed "dell" 9207-8l. I got it from ebay back in Jul/Aug 2020. I have never been able to enter the configuration of it. It says press Ctrl + C, to enter config. Ive tried left, and right crtl, plus c, also with C (capital) since its spelled capital. It says it will enter configuration after setup, but never does, just goes straight to bios if del was pressed, or boots otherwise.

I run zfs-on-linux, on rhelx64. Yesterday took-the-cake. I had to pull out some 2TB devices (6) and make so far 3 3TB LVM's, to support the failing system, while go through a sort of RMA hell.

# zpool status                                                                               
pool: nas
state: DEGRADED                                                                                          status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec  1 05:41:15 2021
        665G scanned at 24.5M/s, 640G issued at 23.6M/s, 9.78T total
        182G resilvered, 6.40% done, 4 days 16:52:09 to go

        NAME                          STATE     READ WRITE CKSUM
        nas                           DEGRADED     0     0     0
          raidz2-0                    DEGRADED     0     0     0
            scsi-35000c50093a9052f    DEGRADED     0     0    52  too many errors
            replacing-1               DEGRADED     0     0    52
              scsi-35000c50084818db7  OFFLINE      0     0     0
              lvzfs2-lvzfsvol2        ONLINE       0     0     0  (resilvering)
            scsi-35000c50093a9182b    DEGRADED   235   636    52  too many errors
            scsi-350000c0f01e5dabc    DEGRADED     0     0    60  too many errors
            scsi-35000c5008491a803    DEGRADED     0     0    53  too many errors  (resilvering)
            replacing-5               DEGRADED     0     0    52
              scsi-35000c50084889cf3  OFFLINE      0     0     0
              lvzfs1-lzfsvol1         ONLINE       0     0     0  (resilvering)
            scsi-35000c50093a8dfe7    DEGRADED     0     0    52  too many errors
          lvzfs3-lvzfsvol3            AVAIL

errors: Permanent errors have been detected in the following files:

root@merlin ~$

This resilvering has been going on for the last month or two, in one way or another. THings were actually looking good for short periods, when the next drive failed, or a previously known failed drive(dd dev/zero'd), failed again.

Its literally driving me nuts, and scaring me at the same time, since this data is most important. Its family photos back to 1970's and before, etc...

Help please?

EDIT: I added a comment as to what im actually using the drives, here as i was also concerned that the HardHouse and Tidy Tracks is rocking the drives apart with a few subwoofers. Will consider relocating the server out of office, into the garage. I've also managed to create a new zfs pool, using the sata ports and the old 2Tb drives, and no issues yet. still in the mid of resilver hell even though I have tuned, and even moved a few datasets off to the other pool.

root@merlin ~$ zpool status
  pool: bak
 state: ONLINE
  scan: none requested

        NAME                                          STATE     READ WRITE CKSUM
        bak                                           ONLINE       0     0     0
          ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332    ONLINE       0     0     0
          ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P0891973  ONLINE       0     0     0

errors: No known data errors

  pool: nas
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec  6 11:08:12 2021
        7.84T scanned at 37.5M/s, 7.84T issued at 37.5M/s, 9.78T total
        3.39T resilvered, 80.16% done, 0 days 15:03:25 to go


        NAME                          STATE     READ WRITE CKSUM
        nas                           DEGRADED     0     0     0
          raidz2-0                    DEGRADED     0     0     0
            scsi-35000c50093a9052f    DEGRADED     0     0     0  too many errors
            replacing-1               ONLINE       0     0     0
              scsi-35000c50084818db7  ONLINE       0     0     0  (resilvering)
              lvzfs2-lvzfsvol2        ONLINE       0     0     0  (resilvering)
            replacing-2               DEGRADED     0     0     0
              17084797086424522076    UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-35000c50093a9182b-part1
              scsi-350000c0f012efb7c  ONLINE       0     0     0  (resilvering)
            scsi-350000c0f01e5dabc    DEGRADED     0     0     0  too many errors  (resilvering)
            scsi-35000c5008491a803    DEGRADED     0     0     0  too many errors
            replacing-5               DEGRADED     0     0     0
              scsi-35000c50084889cf3  DEGRADED     0     0     0  too many errors  (resilvering)
              lvzfs1-lzfsvol1         DEGRADED     0     0     0  too many errors  (resilvering)
            scsi-35000c50093a8dfe7    DEGRADED     0     0     0  too many errors

errors: 2 data errors, use '-v' for a list

FYI, there were checksum errors, but those cleared after reboot.

Errors are so minimal, that im seriously impressed about ZFS robustness, which is why I continue to use it for my main backups. Better than backing up to single disk..

Another mistake i made, i have 5 new drives sitting here for almost a week now, but i cant use them until the LVM vols finish resilvering, since im so close to data errors, that i want that to finish first, shooting myself in the foot for not waiting for the drives to arrive in mail, before i ran that replace operation haphazardly, not realizng it would take weeks to finish all these resilver ops. Crazy madness!

I had done some math for glacier storage (0.004/GB), that would cost $20 month for 5TB, not an option. If I ever needed that data forget it on the egress costs. Also I highly enjoy this, using ZFS on a home server for this. previous array drives were 8 yrs old, when taken down, and i only used known bad drives on the entire array, and still managed to z2 my way to success. I figured a fresh set of refurb or renewed drives would solve this issue. Sorry, guess im venting, I would like to hear more about what i may be doing wrong though...

By the way, i did get the company to pay (deepdiscountserver in this case) for the drive replacements, different models. no more IBM refurbs for me. going to test out the HGST, since those have worked good in the past for me.

Rest assured, a new SAS card will be on the way if those are showing issues, once this resilver madness ends, if it ever does. Ill have to do another full backup if not, which takes almost as long as the resilver is... Atleast i moved the live data off the array already, so no loss will occurr unless i lose my main drives, on another system during all this.... I guess i can say, the data is Majorly Important, but I have a copy of it still, but I can stand a loss for now. Clarifying the "important" part, It will be URGENT if the ZFS array does start spouting errors, because ill only have one drive here and there, that contains the master copies..

cn flag

I am in need of testing my Dell SAS Controller card.

Simple. Replace with another one. Then you know whether or not the card has problems.

None available? Can we get back to "professionalism" and "best practices" in the site rules? Ask a company to do it (and pay). Replacement testing is pretty much the only (and definitely the most efficient) way to make sure it is not a part malfunctioning.

since this data is most important. Its family photos back to 1970's and before, etc...

Besides this being off topic here... is NOT IMPORTANT AT ALL TO YOU. I go by "put money where mouth is". If this WOULD be important to you, it would be backed up. I mean, I learned in school - more than 30 years ago - that backups are a think and a must. So, do not come with "important" when at the end you refuse to do what people do with important data. Start implementing a backup - plenty of quite low cost services around.

Brian Thomas avatar
by flag
thanks, replacing is not entirely helpful, i already considered that, wife is already riding butt for Christmas, were talking data loss here.. sometimes you just cant afford things if you know what i mean.. By the way, to your backups, this is a backup!! Im playing early, i dont wan to lose my main data (on a single drive) at the same time im fixing the array. It always amazes me when someone says throw money at it. already considered, believe me... I appreciate the advice that there is no way other than to replace it, but really? Im going to have to splurge... ouch...
cn flag
"thanks, replacing is not entirely helpful" - yeah, ok. So, in your world getting a schematics and an electronics lab and testing every single soldering connection AND running logical test equipment on all the chips would be helpful? The professional way (required per site rules, whether it helps you or not) is the efficient way: replace, check, then you KNOW where to look. And it is a LOT cheaper than doing a real test for days with a lab - then just to realize a cable is crap. Or your power supply. This is not even throwing money at it - important, have replacement ready.
cn flag
If that is your backup, get a reality check, and start using backup services. AWS, Azure, Backblaze have quite good services that are WAY more guaranteed than a low end SAS based Raid.
Brian Thomas avatar
by flag
pay a monthly subscription instead? psshhh. Share it with big data? psshhh.. Im in the field of SRE already. I introduce to you, ZFS on the home server as a backup... I'm pretty convinced its the card now since now pretty much all drives showing bad. Im goig to reseat it. I wasnt talking about a lab testing either, i was asking if there is other software based ways to test it, that i may not be privy of.

