I have a RAID-1 array of SSDs (Samsung 970 EVO Plus), and errors are showing up in /var/log/syslog
, but smartctl
reports that the drive is healthy. I've done a bunch of diagnosis (below) and I'm wondering if there's anything else I can do. Is there a problem happening or not, and if so, what's the best course of action? (On Kubuntu 18.04.6 LTS.)
Here's the array:
$ cat /proc/mdstat
md1 : active raid1 nvme0n1p3[0] nvme1n1p3[2]
1919724608 blocks super 1.2 [2/2] [UU]
bitmap: 5/15 pages [20KB], 65536KB chunk
It appears healthy, according to mdadm
:
$ sudo mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Sat Feb 29 12:33:09 2020
Raid Level : raid1
Array Size : 1919724608 (1830.79 GiB 1965.80 GB)
Used Dev Size : 1919724608 (1830.79 GiB 1965.80 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Dec 31 14:04:55 2021
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : kubuntu:1
UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
Events : 41087
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme0n1p3
2 259 7 1 active sync /dev/nvme1n1p3
However, some read errors have started appearing in /var/log/syslog
, in triples:
Dec 31 12:32:56 kernel: [662973.969218] blk_update_request: critical medium error, dev nvme1n1, sector 2769948928 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Dec 31 12:32:56 kernel: [662973.969222] md/raid1:md1: nvme1n1p3: rescheduling sector 2702369024
Dec 31 12:32:56 kernel: [662973.978792] md/raid1:md1: redirecting sector 2702369024 to other mirror: nvme0n1p3
Dec 31 12:43:11 kernel: [663588.474940] blk_update_request: critical medium error, dev nvme0n1, sector 1815443200 op 0x0:(READ) flags 0x0 phys_seg 33 prio class 0
Dec 31 12:43:11 kernel: [663588.474943] md/raid1:md1: nvme0n1p3: rescheduling sector 1747863296
Dec 31 12:43:11 kernel: [663588.499466] md/raid1:md1: redirecting sector 1747863296 to other mirror: nvme0n1p3
sometimes followed by:
kernel: [313519.337578] md/raid1:md1: read error corrected (8 sectors at 1367197592 on nvme1n1p3)
I ran smartctl
to look for problems. It indicates that errors have happened in the past, but it also says "SMART overall-health self-assessment test result: PASSED."
For /dev/nvme0n1:
$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 2TB
Serial Number: S464NB0M406242D
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,017,558,851,584 [1.01 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Dec 31 14:01:33 2021 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 46 Celsius
Available Spare: 73%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 232,548,547 [119 TB]
Data Units Written: 58,761,625 [30.0 TB]
Host Read Commands: 1,144,416,417
Host Write Commands: 1,551,430,546
Controller Busy Time: 7,250
Power Cycles: 114
Power On Hours: 6,365
Unsafe Shutdowns: 73
Media and Data Integrity Errors: 694
Error Information Log Entries: 926
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 50 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 926 28 0x0370 0xc502 0x000 3738332404 1 -
1 925 6 0x015b 0xc502 0x000 2503721366 1 -
2 924 22 0x0000 0xc502 0x000 1963251598 1 -
3 923 11 0x038a 0xc502 0x000 1862557082 1 -
4 922 16 0x00d1 0xc502 0x000 1862557082 1 -
5 921 6 0x0141 0xc502 0x000 1826459600 1 -
6 920 20 0x03b5 0xc502 0x000 1815443442 1 -
7 919 8 0x034d 0xc502 0x000 2588273810 1 -
8 918 11 0x0315 0xc502 0x000 2583041964 1 -
9 917 9 0x02e3 0xc502 0x000 2583041964 1 -
10 916 11 0x030e 0xc502 0x000 2583023500 1 -
11 915 11 0x0308 0xc502 0x000 2583023468 1 -
12 914 11 0x033a 0xc502 0x000 2583023500 1 -
13 913 9 0x02ec 0xc502 0x000 2583023468 1 -
14 912 14 0x03d2 0xc502 0x000 2472005420 1 -
15 911 23 0x00cd 0xc502 0x000 2444721868 1 -
... (32 entries not shown)
/dev/nvme1n1:
$ sudo smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 2TB
Serial Number: S464NB0M403333H
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,044,938,612,736 [1.04 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Dec 31 14:03:07 2021 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 81%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 180,057,901 [92.1 TB]
Data Units Written: 77,700,415 [39.7 TB]
Host Read Commands: 801,630,346
Host Write Commands: 1,566,190,001
Controller Busy Time: 6,925
Power Cycles: 156
Power On Hours: 6,260
Unsafe Shutdowns: 86
Media and Data Integrity Errors: 721
Error Information Log Entries: 1,015
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 45 Celsius
Temperature Sensor 2: 52 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1015 22 0x0178 0xc502 0x000 2395920012 1 -
1 1014 31 0x02d6 0xc502 0x000 2065018576 1 -
2 1013 10 0x004e 0xc502 0x000 1928508102 1 -
3 1012 6 0x02aa 0xc502 0x000 2769949126 1 -
4 1011 27 0x0204 0xc502 0x000 2180665946 1 -
5 1010 27 0x023b 0xc502 0x000 2180598396 1 -
6 1009 14 0x00ee 0xc502 0x000 2562333810 1 -
7 1008 13 0x0075 0xc502 0x000 2423243572 1 -
8 1007 30 0x03bb 0xc502 0x000 2326927278 1 -
9 1006 24 0x03e6 0xc502 0x000 1775468746 1 -
10 1005 16 0x0066 0xc502 0x000 1775468746 1 -
11 1004 23 0x0148 0xc502 0x000 2813092280 1 -
12 1003 26 0x02fa 0xc502 0x000 2452856518 1 -
13 1002 5 0x03b1 0xc502 0x000 2119789206 1 -
14 1001 27 0x009b 0xc502 0x000 3047371772 1 -
15 1000 5 0x036c 0xc502 0x000 3047371772 1 -
... (5 entries not shown)
The two drives do not appear to support self-tests (smartctl -c
does not list any self tests at all).
$ sudo smartctl -c /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
Updating my question:
Some of the errors appear to be attributable to the checkarray script that runs once a month, because the errors begin "on the first Sunday of each month, at 01:06 in the morning". "man md" adds:
[On] RAID1 it is possible for software issues to cause a mismatch to be reported [between the two disks]. This does not necessarily mean that the data on the array is corrupted. It could simply be that the system does not care what is stored on that part of the array - it is unused space. The most likely cause for an unexpected mismatch on RAID1 or RAID10 occurs if a swap partition or swap file is stored on the array.
What should I do next? Thank you very much.