Score:0

smartctl "Elements in grown defect list" vs. RAID controller "Media error count"

fr flag

I am using a hardware raid50 with PERC810 controller in my server and recently encountered a metric I am not sure about. Until now, I have been using a smartctl metric "Elements in grown defect list" as a hint that drive is failing and should be removed, but if I use perccli (or storcli/megacli) the drive is also showing a metric called "Media error count." The issue I am having with this is that, from what I've read about these metrics, they are basically the same thing - both shows reallocated sectors or physical defects on a disk. But some of my hdds are showing a number larger than zero at Elements in grown defects list, but a zero value at Media error count and vice versa.. For example this disk:

perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s7 :
================

---------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
---------------------------------------------
37:7     72 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
---------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s7 - Detailed Information :
=======================================

Drive /c0/e37/s7 State :
======================
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature =  41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s7 Device attributes :
==================================
SN = WMC1F0D41KD5
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01

Which shows Media Error Count = 3, but when I use smartctl for the same disk:

smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01f55dd0
Serial number:        WMC1F0D41KD5
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:14:51 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    2538437     9298     76289   2547735       9392     215124.761          94
write:   5550372  5405661   5407707  10956033    5405661     571404.363           0
verify:      184        0         0       184          0        352.277           0

Non-medium error count:   202249

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

It shows Elements in grown defect list: 0

Here is another example on the same server, just different hdd:

perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s4 :
================

---------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
---------------------------------------------
37:4     63 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
---------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s4 - Detailed Information :
=======================================

Drive /c0/e37/s4 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature =  35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s4 Device attributes :
==================================
SN = WMC1F0D222KF
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01 


Drive /c0/e37/s4 Policies/Settings :
==================================
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address        
-----------------------------------------
   0 Active 6.0Gb/s   0x50000c0f01352c36 
   1 Active Unknown   0x0                
-----------------------------------------


Inquiry Data = 
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20 
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33 
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32 
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0 
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Where Media Error Count = 0, but smartctl:

smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01352c34
Serial number:        WMC1F0D222KF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:39:52 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 44

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4899063        1         1   4899064          1     215489.217           0
write:   6593514      494       496   6594008        499     571584.348           0
verify:      345        0         0       345          0        349.197           0

Non-medium error count:   202287

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

Shows Elements in grown defect list: 44

Can you please explain the difference between these two metrics and which one to go by in determining a faulty drive? Thank you.

Score:0
ca flag

The discrepancy is due to the fact that, while measuring similar things, the two metrics operate at different layers.

Media Error Count measures the media errors as seen by the RAID card

Elements in grown defect list shows the size of the grown list, or the number of remapped sectors as seen by the drive itself

There are various reasons why the two values do not match:

  • a RAID array can be created after a disk accumulated many defect in another array or a standalone disk;
  • a disk background surface-scan test can detect and remap any number of sectors without letting the upper layer (ie: the RAID card) notice;
  • an incoming write to a defective sector is remapped "on-the-fly" by the disk itself, without intervention from the RAID card;
  • a RAID patrol scan can stumb on an unreadable sector (notice how many total uncorrected errors do you have on the first disk) and a rewrite of the same sector is successful - so the RAID array records a media error but the disk does not remap the sector (I would consider disks not remapping such sectors as flawed, but I saw them on the wild).
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.