Score:3

Adding a disk to an existing RAID fails with "disk doesn't have enough capacity"

cn flag

I have a very old physical server using many disks in a number of RAID6 group. Those disks are not in the same chassis, instead is a JBOD system connected through the RAID controller on this server. In one of the RAID group, I had a disk failed. After replacing the failed disk, below is the storcli output:

$ storcli /c0 show
…
---------------------------------------------
DG Arr Row EID:Slot DID Type  State  BT      Size PDC  PI SED DS3  FSpace 
---------------------------------------------
 1 -   -   -        -   RAID6 Pdgd   N  27.285 TB dflt N  N   none N      
 1 0   -   -        -   RAID6 Dgrd   N  27.285 TB dflt N  N   none N      
 1 0   0   34:0     48  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   1   34:1     49  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   2   34:2     50  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   3   34:3     51  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   4   -        -   DRIVE Msng   -   2.728 TB -    -  -   -    -      
 1 0   5   34:5     53  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   6   34:6     55  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   7   34:7     54  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   8   34:8     56  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   9   34:9     57  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   10  34:10    58  DRIVE Onln   N   2.728 TB dflt N  N   none -      
 1 0   11  34:11    59  DRIVE Onln   N   2.728 TB dflt N  N   none -  

From above, we can see slot 34:4 is missing, as confirmed by the following command:

$ MegaCli -PdGetMissing -a0
                                     
    Adapter 0 - Missing Physical drives

    No.   Array   Row   Size Expected
    0     1       4     2861056 MB

When I tried to manually add this disk to its disk group, it threw the following error:

$ MegaCli -PdReplaceMissing -PhysDrv [34:4] -Array1 -Row4 -a0
                                     
Adapter: 0: Failed to replace Missing PD at Array 1, Row 4.

FW error description: 
 The specified physical disk doesn't have enough capacity to complete the requested command.  

Exit Code: 0x0d

As per this old megaCLI guide, the above error code is due to "Drive is too small for requested operation".

If I compare storcli /c0/e34/s3,4 show all output, the replacement disk has identical capacity and sector size:

Drive /c0/e34/s3 Device attributes :                   <<<< working disk
==================================
WWN = 5000c50090fb1146
Firmware Revision = SN06
Raw size = 2.728 TB [0x15d50a3b0 Sectors]
Coerced size = 2.728 TB [0x15d400000 Sectors]
Non Coerced size = 2.728 TB [0x15d40a3b0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Logical Sector Size = 512B


Drive /c0/e34/s4 Device attributes :                  <<<< new disk to add
==================================
WWN = 5000c50074a7f9cf
Firmware Revision = SN04
Raw size = 2.728 TB [0x15d50a3b0 Sectors]
Coerced size = 2.728 TB [0x15d400000 Sectors]
Non Coerced size = 2.728 TB [0x15d40a3b0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Logical Sector Size = 512B
Physical Sector Size = 512B

Any idea why it is complaining about capacity when they appear to be identical?


EDITED (based on suggestions of @djdomi and @U880D)

To see the firmware revision of all the disks, filtered the output of MegaCli -PDList -a0 as below:

$ MegaCli -PDList -a0 | awk '/^Enclosure Device ID/ {printf "%d", $4; next} /^Slot Number:/ {printf ":%d\t", $3; next} /Device Firmware Level/ {print $4}'
34:0    SN06
34:1    TN02
34:2    TN02
34:3    SN06
34:4    SN04
34:5    TN02
34:6    TN02
34:7    TN02
34:8    TN02
34:9    TN02
34:10   TN02
34:11   SN04

There is a mix of firmware. Disk 34:11 has the same firmware SN04 as 34:4. Wondering why its only affecting disk 34:4?

djdomi avatar
za flag
sometimes a Firmware update could solve a issue but unrelated or unsupported hardware will lead into off topic
U880D avatar
ca flag
If raw size, sectors, etc, are of the same number, indeed the firmware revision SN04 < SN06 might be the reason.
Heelara avatar
cn flag
Thanks @djdomi and U880D for looking into this and providing this suggestion. I have updated the question with the firmware details of all the disks.
Score:1
cn flag

Make Firmware Versions Match

Ideally, they should be a version identified in the controller's compatibility list as certified compatible but the first thing you want to do is update the firmware to versions that are identical between the drives.

Grab the file that's appropriate for the drive/serial and use storcli to download the firmware to each drive.

storcli64 /c0/e34/s3 download src=TN02.LOB
#                                 ^-- your firmware file here

It may solve this issue -- just from my own humble experience, it tends to solve a lot of other weirdness. I used to maintain a ridiculous number of servers for a telecom datacenter in a previous life (back when SANs were exotic and blade servers becoming a thing or nearly a thing) -- we rarely turned up storage on different firmware levels but when "some server's drive array is flaky" it resolved the flakiness frequently enough that "were there no obvious other factors/reasons going on with whatever drive went into a failure state in the array" and we noticed the firmware wasn't identical on all drives, we'd resolve the firmware issue and set the drive back to "Good"/rebuild the array.

Personally speaking, I had two drives on older, but "certified compatible" firmware and three on a single later versions. The two would drop off every few months under heavy load (one after the next), but would come back/rebuild/survive another month. I couldn't find any file for any of the existing drive's firmware so I updated all to the latest (not listed in the compatibility document) and the array was reliable until it was replaced.

When Firmware / Drives are All Identical

If you're lucky enough to rule out the above but unlucky enough to still have Error Code 13 when trying to add a disk, I discovered another setting that was causing that problem for me.

Check the Coercion mode:

storcli64 /c0 show coercion

If you have something other than:

Controller Properties :
=====================

-----------------------
Ctrl_Prop     Value
-----------------------
Coercion Mode Disabled
-----------------------

You may find that setting it to "Disabled" does the job. This command will handle that:

storcli64 /c0 set coercion=0

NOTE: I have no idea if this will result in loss of data as I'm working with an volume that has no data that I intend on trying to access on it.

Small Tip

If you believe everything is identical and want to confirm what the controller sees, running the following command (zsh) might lead you to differences that identify the culprit:

diff --color=always -y =(storcli64 /c0/e34/s1 show all) =(storcli64 /c0/e34/s6 show all)
           # of drive working in array  ----^  # of drive refusing to add ---^
Heelara avatar
cn flag
Thank you for your suggestions. It is good to know. For my case, I ended up upgrading the server OS itself, and post upgrade, the controller was able to add the replacement disk into the RAID group successfully without any change to its firmware.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.