Score:2

RAID array says "critical medium error" but smartctl says disk is healthy - what to do next?

in flag

I have a RAID-1 array of SSDs (Samsung 970 EVO Plus), and errors are showing up in /var/log/syslog, but smartctl reports that the drive is healthy. I've done a bunch of diagnosis (below) and I'm wondering if there's anything else I can do. Is there a problem happening or not, and if so, what's the best course of action? (On Kubuntu 18.04.6 LTS.)

Here's the array:

$ cat /proc/mdstat
md1 : active raid1 nvme0n1p3[0] nvme1n1p3[2]
      1919724608 blocks super 1.2 [2/2] [UU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

It appears healthy, according to mdadm:

$ sudo mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Sat Feb 29 12:33:09 2020
        Raid Level : raid1
        Array Size : 1919724608 (1830.79 GiB 1965.80 GB)
     Used Dev Size : 1919724608 (1830.79 GiB 1965.80 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Dec 31 14:04:55 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : kubuntu:1
              UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
            Events : 41087

    Number   Major   Minor   RaidDevice State
       0     259        3        0      active sync   /dev/nvme0n1p3
       2     259        7        1      active sync   /dev/nvme1n1p3

However, some read errors have started appearing in /var/log/syslog, in triples:

Dec 31 12:32:56  kernel: [662973.969218] blk_update_request: critical medium error, dev nvme1n1, sector 2769948928 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Dec 31 12:32:56  kernel: [662973.969222] md/raid1:md1: nvme1n1p3: rescheduling sector 2702369024
Dec 31 12:32:56  kernel: [662973.978792] md/raid1:md1: redirecting sector 2702369024 to other mirror: nvme0n1p3

Dec 31 12:43:11  kernel: [663588.474940] blk_update_request: critical medium error, dev nvme0n1, sector 1815443200 op 0x0:(READ) flags 0x0 phys_seg 33 prio class 0
Dec 31 12:43:11  kernel: [663588.474943] md/raid1:md1: nvme0n1p3: rescheduling sector 1747863296
Dec 31 12:43:11  kernel: [663588.499466] md/raid1:md1: redirecting sector 1747863296 to other mirror: nvme0n1p3

sometimes followed by:

kernel: [313519.337578] md/raid1:md1: read error corrected (8 sectors at 1367197592 on nvme1n1p3)

I ran smartctl to look for problems. It indicates that errors have happened in the past, but it also says "SMART overall-health self-assessment test result: PASSED."

For /dev/nvme0n1:

$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M406242D
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,017,558,851,584 [1.01 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec 31 14:01:33 2021 EST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    73%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    232,548,547 [119 TB]
Data Units Written:                 58,761,625 [30.0 TB]
Host Read Commands:                 1,144,416,417
Host Write Commands:                1,551,430,546
Controller Busy Time:               7,250
Power Cycles:                       114
Power On Hours:                     6,365
Unsafe Shutdowns:                   73
Media and Data Integrity Errors:    694
Error Information Log Entries:      926
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               50 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        926    28  0x0370  0xc502  0x000   3738332404     1     -
  1        925     6  0x015b  0xc502  0x000   2503721366     1     -
  2        924    22  0x0000  0xc502  0x000   1963251598     1     -
  3        923    11  0x038a  0xc502  0x000   1862557082     1     -
  4        922    16  0x00d1  0xc502  0x000   1862557082     1     -
  5        921     6  0x0141  0xc502  0x000   1826459600     1     -
  6        920    20  0x03b5  0xc502  0x000   1815443442     1     -
  7        919     8  0x034d  0xc502  0x000   2588273810     1     -
  8        918    11  0x0315  0xc502  0x000   2583041964     1     -
  9        917     9  0x02e3  0xc502  0x000   2583041964     1     -
 10        916    11  0x030e  0xc502  0x000   2583023500     1     -
 11        915    11  0x0308  0xc502  0x000   2583023468     1     -
 12        914    11  0x033a  0xc502  0x000   2583023500     1     -
 13        913     9  0x02ec  0xc502  0x000   2583023468     1     -
 14        912    14  0x03d2  0xc502  0x000   2472005420     1     -
 15        911    23  0x00cd  0xc502  0x000   2444721868     1     -
... (32 entries not shown)

/dev/nvme1n1:

$ sudo smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M403333H
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,044,938,612,736 [1.04 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec 31 14:03:07 2021 EST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    81%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    180,057,901 [92.1 TB]
Data Units Written:                 77,700,415 [39.7 TB]
Host Read Commands:                 801,630,346
Host Write Commands:                1,566,190,001
Controller Busy Time:               6,925
Power Cycles:                       156
Power On Hours:                     6,260
Unsafe Shutdowns:                   86
Media and Data Integrity Errors:    721
Error Information Log Entries:      1,015
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               52 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1015    22  0x0178  0xc502  0x000   2395920012     1     -
  1       1014    31  0x02d6  0xc502  0x000   2065018576     1     -
  2       1013    10  0x004e  0xc502  0x000   1928508102     1     -
  3       1012     6  0x02aa  0xc502  0x000   2769949126     1     -
  4       1011    27  0x0204  0xc502  0x000   2180665946     1     -
  5       1010    27  0x023b  0xc502  0x000   2180598396     1     -
  6       1009    14  0x00ee  0xc502  0x000   2562333810     1     -
  7       1008    13  0x0075  0xc502  0x000   2423243572     1     -
  8       1007    30  0x03bb  0xc502  0x000   2326927278     1     -
  9       1006    24  0x03e6  0xc502  0x000   1775468746     1     -
 10       1005    16  0x0066  0xc502  0x000   1775468746     1     -
 11       1004    23  0x0148  0xc502  0x000   2813092280     1     -
 12       1003    26  0x02fa  0xc502  0x000   2452856518     1     -
 13       1002     5  0x03b1  0xc502  0x000   2119789206     1     -
 14       1001    27  0x009b  0xc502  0x000   3047371772     1     -
 15       1000     5  0x036c  0xc502  0x000   3047371772     1     -
... (5 entries not shown)

The two drives do not appear to support self-tests (smartctl -c does not list any self tests at all).

$ sudo smartctl -c /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

Updating my question:

Some of the errors appear to be attributable to the checkarray script that runs once a month, because the errors begin "on the first Sunday of each month, at 01:06 in the morning". "man md" adds:

[On] RAID1 it is possible for software issues to cause a mismatch to be reported [between the two disks]. This does not necessarily mean that the data on the array is corrupted. It could simply be that the system does not care what is stored on that part of the array - it is unused space. The most likely cause for an unexpected mismatch on RAID1 or RAID10 occurs if a swap partition or swap file is stored on the array.

What should I do next? Thank you very much.

Nmath avatar
ng flag
Always trust the worst report. Make sure that backups are in order. Remember that [RAID is not a backup](https://www.raidisnotabackup.com/). Plan for replacing the defective drive, sooner or later.
DanB avatar
in flag
Thanks. What does it mean that errors suddenly appeared on *both* SSDs in the array? (Some messages place the error on `/dev/nvme0n1` and others on `/dev/nvme1n1`.
Nmath avatar
ng flag
If they are mirrored it could be inconsistencies between the two disks.
DanB avatar
in flag
They are mirrored. Is there a command to check or correct the situation if the disks are inconsistent? PS: I just discovered (by searching old, archived logs) that these errors have been happening for many months on both drives, usually at the same time of day, when they are being backed up by rsync (to another drive).
Nmath avatar
ng flag
The errors that you are seeing are notifying you of the corrections being made. That's what's meant by the "rescheduling" and "redirecting".
DanB avatar
in flag
Thanks! Last question: what does it signify if the error is *not* followed by a scheduling/correcting message, such as this message by itself: "kernel: [905111.122813] blk_update_request: critical medium error, dev nvme1n1, sector 34055424 op 0x0:(READ) ..."?
DanB avatar
in flag
Ah, just discovered something in `/usr/share/doc/mdadm/README.checkarray`! "checkarray will run parity checks across all your redundant arrays. By default, it is configured to run on the first Sunday of each month, at 01:06 in the morning." That date & time period exactly corresponds with most of the error messages in my logs. (Not all the error messages though.)
Nmath avatar
ng flag
Not sure about errors that aren't followed by a correction. Makes sense to me that most corrections would take place when the system is checking for them
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.