Score:0

RAID0 on 16 8TB HDD's writes at speed 3-4 MB/s only

it flag

I have a pretty modern RAID hardware for this:

  1. Controller: Intel RS3SC008
  2. SAS Expander: Intel RES3FV288
  3. HDDs: Seagate ST8000AS0002-1NA17Z

For the moment, I don't have BBU, which should be Intel AXXRMFBU4.

SAS expander is properly connected with the controller to G port (according to manual).

All system parts have proper temperature and ventilation (for example temp at the controller ROC is around 43C, which is more than optimal).

Controller and Expander are flashed to the latest firmware.

HDDs are the latest firmware also.

My problem is whatever RAID level I configure (tried 0, 6) and whatever cache configurations I choose, I face errors, when on real load:

  1. In some configurations VD device goes offline, stating that some HDDs went offline.
  2. Assuming that these Hdds might be faulty, I've created another test without these HDDs, still failing.
  3. In the logs enter image description here I see warnings complaining about temp sensors which I don't have, and some phy device reset warnings. No real errors until VD went offline, because of one of Hdds were misbehaving and went offline. I've tried to exclude these faulty HDDs in consequent tests. That seemed to slightly recover from the problem, but in the end, I am at the beginning.

I suspect having 4 faulty HDDs in the bunch of 20 new HDDs is kind of strange.

What would you suggest in this situation?

What could be the problem?

HDD incompatibility?

Is there a way to recover from this situation?

Michael Hampton avatar
cz flag
Looks like a controller, cabling or backplane issue. Start moving things around and see where the errors move to.
it flag
@MichaelHampton I don't have a backplane. HDDs are simply connected to a SAS expander via SFF 8643 to 4 SATA cables. Would you suggest that SAS cables might be the problem?
Michael Hampton avatar
cz flag
It's entirely possible!
djdomi avatar
za flag
i would pinpoint it down using 2 drives, 4x 6x 8x if the speed is the same then it looks really for like
Nikita Kipriyanov avatar
za flag
Also are you sure you supply enough power to the system?
it flag
I've already validated power with different PSUs, and is 100% enough.
it flag
@MichaelHampton you mean HDD incompatibility? Do you have any idea how to be 100% sure?
it flag
@djdomi I'm going to test it tomorrow, also with other HDDs. I'll post results.
Score:0
cn flag

Use HD-tune on each drive to see if the have SMART problems (reallocated or bad sectors are a priority).

In a more practical test-like approach:

Test in sets of 4 drives. As in make sets of 4 disks in RAID 0.

Then do copies from one set to the others.

This way you can relatively quickly identify which ones have a problem.

Note: RAID 0'ing that many Seagates is suicide waiting to happen.

The 4-disk arrays you find good put them back into a single one if needed (or wait towards the end of testing so you can actually use all good drives).

For the ones not working well, swap some of the drives between or split into arrays of 2 disks so you can further filter them out. Try to identify if there are bad cables at fault by swapping cables from a good 2-set to a bad 2-set.

Also, note that error does identify the port at fault, so you could start by eliminating these signaled by the errors.

"Command timeout" error may imply an inaccessible HDD.

it flag
thank you for the tips. Of course, I'm not going to use R0 on so many drives. It was only for testing purposes. Initially, I wanted to test full load with all drives. I will make next tests tomorrow. For now even faulty HDDs excluded from VD do not show any problems in SMART. Can I assume that 4 HDDs working properly in RAID (whatever level) confirms that these HDDs with this controller are compatible?
Score:0
it flag

Final conclusion, unfortunately not a solution.

After several series of tests conluded, I can confirm that drives mentioned earlier:

  1. HDDs: Seagate ST8000AS0002-1NA17Z
  2. SSDs: Crucial CT1000BX500SSD1

are completely incompatible with RAID configurations and of very low performance.

As a side note, it is completely strange to me, why they introduced the same level of performance drop after few seconds of heavy operation. I suppose it was due to similar basic, slow, low-level components used.

I've lost a lot of time on this issue, so maybe this post will help anyone.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.