Score:14

Why is mdadm unable to deal with an "almost failed" disk?

gb flag

Multiple times in my career now I've come across mdadm RAID sets (RAID1+0, 5, 6 etc) in various environments (e.g. CentOS/Debian boxes, Synology/QNAP NASes) which appear to be simply unable to handle failing disk. That is a disk that is not totally dead, but has tens of thousands of bad sectors and is simply unable to handle I/O. But, it isnt totally dead, it's still kind of working. The kernel log is typically full of UNC errors.

Sometimes, SMART will identify the disk as failing, other times there are no other symptoms other than slow I/O.

The slow I/O actually causes the entire system to freeze up. Connecting via ssh takes forever, the webGUI (if it is a NAS) stops working usually. Running commands over ssh takes forever as well. That is until I disconnect / purposely "fail" the disk out of the array, then things go back to "normal" - that is as normal as they can be with a degraded array.

I'm just wondering, if a disk is taking so long to read/write from, why not just knock it out of the array, drop a message in the log and keep going? It seems making the whole system grind to a halt because one disk is kinda screwy totally nullifies one of the main benefits of using RAID (fault tolerance - the ability to keep running if a disk fails). I can understand that in a single-disk scenario (e.g. your system has as single SATA disk connected and it is unable to execute read/writes properly) this is catastrophic, but in a RAID set (especially the fault tolerant "personalities") it seems not only annoying but also contrary to common sense.

Is there a very good reason the default behavior of mdadm is to basically cripple the box until someone remotes in and fixes it manually?

in flag
Part of this is not the fault of `mdadm`but the Linux kernel. Similar freezes are also known to happen with failed NFS mounts, etcetera. The root cause is that Linux, unlike Windows is mostly synchronous. With asynchronous I/O, the kernel issues a request and can do other things while the hardware does the I/O. Mdadm should not even be in the position where it can influence SSH, let alone freeze it.
Score:13
in flag

In general, the purpose of a RAID, depending on the chosen Raid level, provides a different balance among the key goals data redundancy, availability, performance and capacity.

Based on the specific requirements, it is the responsibility of the storage owner to decide which balance of the various factors is the right one for the given purpose, to create a reliable solution.

The job of the chosen Raid solution (here in this case we talk about the software mdadm) is to ensure data protection first and foremost. With that in mind, it becomes clear that it is not the job of the raid solution to weight business continuity higher than data integrity.

To put it in other words: The job of mdadm is to avoid a failed raid. As long as a "weird behaving disk" is not completely broken, it still contributes to the raid.

So why not just knocking a weirdly behaving disk out of the array, drop a message in the log and keep going? Because doing so would increase the risk of losing data.

I mean, you are right, for the given moment, from a business perspective, it seems the better solution just to continue. In reality however, the message which has been dropped to the log may remains undetected, so the degraded state of the raid remains undetected. Some time later, eventually another disk in the same raid fails, as result the stored data on the failed raid is eventually gone.


In addition to that: It is hard to exactly define what's a "weirdly behaving disk". Expressed the other way: What is still an acceptable operating behavior of a single disk, operated within an disk array?

Some of us may answer "disk shows some errors". Others may answer: "As long as the errors can be corrected, all is fine". Others may answer: "As long as the disk answers to all commands in a given time, all is fine". Others say "in case the disk temperature differs more than 5°C compared to the average temperature of all disks within the same array". Another answer could be "as long as a scrub reveals no errors", or "as long as SMART does not shows errors".

What is written is not a long and also not a complete list.

The point is that the definition of "acceptable behavior of a disk" is a matter of interpretation, and therefore also the responsibility of the storage owner, and not something that mdadm is supposed to decide on its own.

Score:7
ca flag

The key issue is that a failing SATA disk drive can sometime freeze the entire bus for the duration of its internal error recovery procedure. For this reason, one should use TLER-enabled drives only in RAID arrays (and preferably an enterprise-grade model).

SAS drives suffer less from this issue, but are not absolutely free from it either.

Michael Hampton avatar
cz flag
Good point about TLER. It is very easy to forget this.
Score:3
za flag

In addition to what was said, I want to add my penny, but this one is important consideration.

What a drive does when sector is slow to read?

Supposedly drives that designed to operate alone, e. g. typical "desktop" drives, presume there is no other way to retrieve the data stored in that bad sector. They will try retrieve data at all costs, repeating again and again, for an extended period of time. Of course, they will also mark that sector as failing, so they will remap it next time you write to it, but you must write for that. Until you rewrite it they will choke each time you read from that place. In a RAID setting this means for the RAID the drive still works and there is no reason to kick it out, but for application the array will slow down to a crawl.

On the other hand, "enterprise" drives, especially "branded" ones, often suppose they are always used in RAID setting. A "brand" controller, seeing "branded" drive, actually might even notify their firmware about RAID presence. So the drive will cease early and report I/O error, even if it was possible to do several more attempts and read the sector. Then the controller has the chance to reply faster, mirroring read instruction to a sibling drive (and kicking bad one out of array). When you pull out and explore/test that kicked drive thoroughly you find no apparent problems — is was just slowed down for a moment and that was enough to stop using it, according to a controller logic.

I speculate this may be the only difference between "desktop" drives and "branded"/"enterprise" NL-SAS and SATA ones. This is probably why you pay three times more when you buy "HPE" drive which was actually made by Toshiba, in comparison to buying the "Toshiba"-branded one.


However, some drives do support some generic controls of this. It is called SCT ERC which shands for SMART Command Transport Error Recovery Control. This is how it looks in smartctl:

unsupported

# smartctl --all /dev/sda
=== START OF READ SMART DATA SECTION ===
SCT capabilities:              (0x3037) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

supported

=== START OF READ SMART DATA SECTION ===
...
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

If you lucky, you can control this feauture with smartctl. You may retrieve or set two timeouts, how long to try to re-read and how long to try to re-write:

# smartctl -l scterc /dev/sda
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -l scterc /dev/sde
SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

# smartctl -l scterc /dev/sdd
Warning: device does not support SCT Error Recovery Control command
smartctl -l scterc,120,60 /dev/sde

Which means: 120 tenths of a second to retry read; 60 tenths of a second to retry write. Zero means "retry until you die". Different drives have different default settings for this.

So, if you use "RAID edition" drive alone, better set ERC timeouts to zero, or you may lose data. On the other hand, if you use drives in RAID, better set some reasonable low non-zero setting.

Source by amarao @ Habrahabr, in Russian.

P.S. And a note about SAS. Use sdparm, it supports more controls of this.

cn flag
This is the right answer. Don't use desktop drives for RAID (and vice versa).
Nikita Kipriyanov avatar
za flag
Actually the answer states almost exactly opposite. It says: you can use (some) desktop drives for RAID and vice versa, *provided* you are doing certain configuration. And it also suggests which configuration.
cn flag
You may be able to tweak a desktop drive for a RAID job, but you can't be sure until you tested it: some makers are known to have cheated with their firmware behaviour in the past. The general advice remains to buy the right drive for the right job. Choose IronWolf and not Barracuda, WD Red and not WD Blue for RAIDs, and all will be fine.
Nikita Kipriyanov avatar
za flag
I've tested some. The author of the article I linked to tested *lots* of them. The problem with drives in the "home-made" RAID is not only firmware in drives. For example, remember the vidieo https://www.youtube.com/watch?v=tDacjrSCeq4 where the guy shouted into hard disks and they all started missing tracks; so vibration and housing matters. // The RAID idea stemmed from desire to build reliable service upon unreliable inexpensive parts (that's what "I" stands for). Hardware marketers want money, so they want to spoil the idea of inexpensiveness, but don't help them. Don't advocate the devil.
cn flag
I advocate for reliability and professionalism. I set-up storage arrays as a living. The fact that HP or Dell are actually gouging their customers is quite orthogonal to this question, frankly. The price difference between Barracuda and Ironwolf or WD Blue vs Red is about 10%, which is pretty reasonable amount to buy peace of mind without additional work. People aren't even doing backups properly, and you want them to test-drive their disks? Be realist. If people were ready to do their homework, they wouldn't buy Windows-powered Dell servers.
Score:1
in flag

I've had situations where a disk has failed to work, but has taken out the controller in some way.

Historically this was possible with PATA, where the master and slave drives were on the same cable, and one drive failing could interfere with access to the other still-good drive. Removing the bad drive could reenable access to the remaining drive, or it may need a power-cycle but the raid could come up degraded and then recovery could start.

SATA is less vulnerable to this, but its still possible for the controller to be affected. From my experience of software raids, there is more of the gory innards exposed that would be hidden by a fancier dedicated raid controller.

I've not experienced this with SAS or NVME, but SAS often means hardware raid controllers that have more disk-handling smarts internally.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.