I am using a hardware raid50 with PERC810 controller in my server and recently encountered a metric I am not sure about. Until now, I have been using a smartctl metric "Elements in grown defect list" as a hint that drive is failing and should be removed, but if I use perccli (or storcli/megacli) the drive is also showing a metric called "Media error count."
The issue I am having with this is that, from what I've read about these metrics, they are basically the same thing - both shows reallocated sectors or physical defects on a disk.
But some of my hdds are showing a number larger than zero at Elements in grown defects list, but a zero value at Media error count and vice versa..
For example this disk:
perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive /c0/e37/s7 :
================
---------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
---------------------------------------------
37:7 72 Onln 1 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U -
---------------------------------------------
EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
Drive /c0/e37/s7 - Detailed Information :
=======================================
Drive /c0/e37/s7 State :
======================
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature = 41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No
Drive /c0/e37/s7 Device attributes :
==================================
SN = WMC1F0D41KD5
Manufacturer Id = WD
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01
Which shows Media Error Count = 3
, but when I use smartctl for the same disk:
smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VR08
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01f55dd0
Serial number: WMC1F0D41KD5
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri Jan 28 14:14:51 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 41 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 18
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 118
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2538437 9298 76289 2547735 9392 215124.761 94
write: 5550372 5405661 5407707 10956033 5405661 571404.363 0
verify: 184 0 0 184 0 352.277 0
Non-medium error count: 202249
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 11 - [- - -]
Long (extended) Self-test duration: 31120 seconds [518.7 minutes]
It shows Elements in grown defect list: 0
Here is another example on the same server, just different hdd:
perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive /c0/e37/s4 :
================
---------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
---------------------------------------------
37:4 63 Onln 1 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U -
---------------------------------------------
EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
Drive /c0/e37/s4 - Detailed Information :
=======================================
Drive /c0/e37/s4 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature = 35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No
Drive /c0/e37/s4 Device attributes :
==================================
SN = WMC1F0D222KF
Manufacturer Id = WD
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01
Drive /c0/e37/s4 Policies/Settings :
==================================
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0)
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No
Port Information :
================
-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 6.0Gb/s 0x50000c0f01352c36
1 Active Unknown 0x0
-----------------------------------------
Inquiry Data =
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Where Media Error Count = 0
, but smartctl:
smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VR08
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01352c34
Serial number: WMC1F0D222KF
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri Jan 28 14:39:52 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 35 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 18
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 118
Elements in grown defect list: 44
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 4899063 1 1 4899064 1 215489.217 0
write: 6593514 494 496 6594008 499 571584.348 0
verify: 345 0 0 345 0 349.197 0
Non-medium error count: 202287
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 11 - [- - -]
Long (extended) Self-test duration: 31120 seconds [518.7 minutes]
Shows Elements in grown defect list: 44
Can you please explain the difference between these two metrics and which one to go by in determining a faulty drive?
Thank you.