Erratic SMART readings on one member of a RAID 1 array

Question

Score:1

Server

Erratic SMART readings on one member of a RAID 1 array

escozul

12/21/23, 9:56 AM

I am managing a server that uses 2 nvme ssds on RAID 1 connectivity. At once point I lost access to one of the 2 and got my normal raid array degraded mails from mdadm.

So I asked from the hosting company to check it out and they said that the array's contacts needed cleaning to make better contact and once they did that the machine picked up the nvme and started rebuilding the array.

When rebuilding finished I went in and checked the results. So the ssds are not new. They are used so SMART readings should reflect this.

when I ran nvme list I got the following result.

| => nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S************1       SAMSUNG MZVKW512HMJP-00000               1          36.70  GB / 512.11  GB    512   B +  0 B   CXA7500Q
/dev/nvme1n1          S************5       SAMSUNG MZVL2512HCJQ-00B00               1         511.95  GB / 512.11  GB    512   B +  0 B   GXA7801Q

Now the server is pretty old, but I got it second hand and reformated it a couple of weeks ago. So it's pretty empty right now. 36.7GB on Member 1 as a used space seem correct. The second member is the one that was rebuilt. It reports 511.95Gb used. This makes no sense on a raid 1 array (or does it?) please correct me if I'm wrong.

I mean, the system works just fine. When I run:

| => cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      465370432 blocks super 1.2 [2/2] [UU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

I see that the software raid array works just fine. Those two drives should be identical. What does that 511.96Gb Usage mean on the 2nd nvme? Is it normal?

I tried to see what the SMARTMONTOOLS will report and I got that:

| => smartctl -A /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    25,639 [13.1 GB]
Data Units Written:                 2,127,320 [1.08 TB]
Host Read Commands:                 101,600
Host Write Commands:                8,203,941
Controller Busy Time:               239
Power Cycles:                       7
Power On Hours:                     26
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Temperature Sensor 2:               31 Celsius

(yes I know, power on hours is 26. This nvme is brand new. I got a confirmation from the hosting company.)

Everything else on the drive seems just fine. The other drive is much older and it's smarmontools report is:

| => smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    26%
Data Units Read:                    115,783,912 [59.2 TB]
Data Units Written:                 281,087,251 [143 TB]
Host Read Commands:                 1,142,872,239
Host Write Commands:                8,039,604,613
Controller Busy Time:               38,359
Power Cycles:                       519
Power On Hours:                     16,843
Unsafe Shutdowns:                   496
Media and Data Integrity Errors:    0
Error Information Log Entries:      154
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               33 Celsius

Which also seems to be just fine and as expected. But for some reason nvme list shows that it's using 512Gb. How can this be the case? Was the rebuilding process not properly completed?

What do you think?

259

1 + 4

ssd

software-raid

raid1

nvme

Erratic SMART readings on one member of a RAID 1 array

Post an answer