Score:2

Nvme faulty drive? SMART Error Information Log Entries fastly increasing

cn flag

Running Ubuntu 22.04 LTS.

The Error Information Log Entries value showed by smartctl -a /dev/nvme0n1 in my NVMe is growing fast, by 1 per second. Is it indicative of a faulty driver?

At the same time, Media and Data Integrity Errors is currently showing a value of 0.

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SKC3000D4096G
Serial Number:                      xxxxx
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 282b2ba6c5
Local Time is:                      Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0620W       -        -    4  4  4  4     2500    7500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    213,006,510 [109 TB]
Data Units Written:                 549,370,112 [281 TB]
Host Read Commands:                 11,210,192,197
Host Write Commands:                20,687,602,229
Controller Busy Time:               14,055
Power Cycles:                       39
Power On Hours:                     4,204
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,479,242
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               75 Celsius
Thermal Temp. 1 Total Time:         58745

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0    1479242     0  0x2015  0x4004 0x102c            0     0     -
  1    1479241     0  0x2014  0x4004 0x102c            0     0     -
  2    1479240     0  0xd010  0x4004 0x102c            0     0     -
  3    1479239     0  0xc013  0x4004 0x102c            0     0     -
  4    1479238     0  0xb011  0x4004 0x102c            0     0     -
  5    1479237     0  0x8009  0x4004 0x102c            0     0     -
  6    1479236     0  0x0015  0x4004 0x102c            0     0     -
  7    1479235     0  0x0014  0x4004 0x102c            0     0     -
  8    1479234     0  0xa011  0x4004 0x102c            0     0     -
  9    1479233     0  0xa010  0x4004 0x102c            0     0     -
 10    1479232     0  0x9012  0x4004 0x102c            0     0     -
 11    1479231     0  0x9011  0x4004 0x102c            0     0     -
 12    1479230     0  0x6000  0x4004 0x102c            0     0     -
 13    1479229     0  0x5003  0x4004 0x102c            0     0     -
 14    1479228     0  0x4001  0x4004 0x102c            0     0     -
 15    1479227     0  0x4000  0x4004 0x102c            0     0     -
... (47 entries not read)

I uploaded the output of nvme error-log /dev/nvme0n1 too: https://pastebin.com/SQJM7KhV

Gotenks avatar
cn flag
Ubuntu 22.04 LTS
David avatar
cn flag
Clearly the drive is dead or dying. From your question Error Information Log Entries: 1,479,242
Gotenks avatar
cn flag
In my case, it was caused by Node Exporter (Prometheus). After stopping the process the `Error Information Log Entries` stopped increasing. Probably it's making queries which are not supported by the NVMe driver (will have to dig deeper).
Gotenks avatar
cn flag
It is not dying \(^O^)/
Score:3
ru flag

The drive is OK. The counter that really matters is the Media and Data Integrity Errors counter. The Error Information Log Entries counter on the nvme drive on my system increases by one during every system startup or reboot. The problem is caused by queries that are not supported by the drive and I have found this same problem in many other similar cases.

Back on November 12, 2022, I posted a similar question here.

This reading though: Temperature Sensor 2: 75 Celsius is too high. It's only nine degrees below the warning temperature. Maybe you should improve your system's ventilation.

Gotenks avatar
cn flag
In my case, it was caused by Node Exporter (Prometheus). Specifically by the hwmon collector. Running node_exporter --no-collector.hwmon stopped the counter increase.
Gotenks avatar
cn flag
As for the temperature I think it's not correctly reported. I have two entries in the SMART data. Temperature: 56 Celsius Temperature Sensor 2: 75 Celsius
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.