In addition to what was said, I want to add my penny, but this one is important consideration.
What a drive does when sector is slow to read?
Supposedly drives that designed to operate alone, e. g. typical "desktop" drives, presume there is no other way to retrieve the data stored in that bad sector. They will try retrieve data at all costs, repeating again and again, for an extended period of time. Of course, they will also mark that sector as failing, so they will remap it next time you write to it, but you must write for that. Until you rewrite it they will choke each time you read from that place. In a RAID setting this means for the RAID the drive still works and there is no reason to kick it out, but for application the array will slow down to a crawl.
On the other hand, "enterprise" drives, especially "branded" ones, often suppose they are always used in RAID setting. A "brand" controller, seeing "branded" drive, actually might even notify their firmware about RAID presence. So the drive will cease early and report I/O error, even if it was possible to do several more attempts and read the sector. Then the controller has the chance to reply faster, mirroring read instruction to a sibling drive (and kicking bad one out of array). When you pull out and explore/test that kicked drive thoroughly you find no apparent problems — is was just slowed down for a moment and that was enough to stop using it, according to a controller logic.
I speculate this may be the only difference between "desktop" drives and "branded"/"enterprise" NL-SAS and SATA ones. This is probably why you pay three times more when you buy "HPE" drive which was actually made by Toshiba, in comparison to buying the "Toshiba"-branded one.
However, some drives do support some generic controls of this. It is called SCT ERC which shands for SMART Command Transport Error Recovery Control. This is how it looks in smartctl
:
unsupported
# smartctl --all /dev/sda
=== START OF READ SMART DATA SECTION ===
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
supported
=== START OF READ SMART DATA SECTION ===
...
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
If you lucky, you can control this feauture with smartctl
. You may retrieve or set two timeouts, how long to try to re-read and how long to try to re-write:
# smartctl -l scterc /dev/sda
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
# smartctl -l scterc /dev/sde
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
# smartctl -l scterc /dev/sdd
Warning: device does not support SCT Error Recovery Control command
smartctl -l scterc,120,60 /dev/sde
Which means: 120 tenths of a second to retry read; 60 tenths of a second to retry write. Zero means "retry until you die". Different drives have different default settings for this.
So, if you use "RAID edition" drive alone, better set ERC timeouts to zero, or you may lose data. On the other hand, if you use drives in RAID, better set some reasonable low non-zero setting.
Source by amarao @ Habrahabr, in Russian.
P.S. And a note about SAS. Use sdparm
, it supports more controls of this.