Score:1

Erratic SMART readings on one member of a RAID 1 array

cn flag

I am managing a server that uses 2 nvme ssds on RAID 1 connectivity. At once point I lost access to one of the 2 and got my normal raid array degraded mails from mdadm.

So I asked from the hosting company to check it out and they said that the array's contacts needed cleaning to make better contact and once they did that the machine picked up the nvme and started rebuilding the array.

When rebuilding finished I went in and checked the results. So the ssds are not new. They are used so SMART readings should reflect this.

when I ran nvme list I got the following result.

| => nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S************1       SAMSUNG MZVKW512HMJP-00000               1          36.70  GB / 512.11  GB    512   B +  0 B   CXA7500Q
/dev/nvme1n1          S************5       SAMSUNG MZVL2512HCJQ-00B00               1         511.95  GB / 512.11  GB    512   B +  0 B   GXA7801Q

Now the server is pretty old, but I got it second hand and reformated it a couple of weeks ago. So it's pretty empty right now. 36.7GB on Member 1 as a used space seem correct. The second member is the one that was rebuilt. It reports 511.95Gb used. This makes no sense on a raid 1 array (or does it?) please correct me if I'm wrong.

I mean, the system works just fine. When I run:

| => cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      465370432 blocks super 1.2 [2/2] [UU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

I see that the software raid array works just fine. Those two drives should be identical. What does that 511.96Gb Usage mean on the 2nd nvme? Is it normal?

I tried to see what the SMARTMONTOOLS will report and I got that:

| => smartctl -A /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    25,639 [13.1 GB]
Data Units Written:                 2,127,320 [1.08 TB]
Host Read Commands:                 101,600
Host Write Commands:                8,203,941
Controller Busy Time:               239
Power Cycles:                       7
Power On Hours:                     26
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Temperature Sensor 2:               31 Celsius

(yes I know, power on hours is 26. This nvme is brand new. I got a confirmation from the hosting company.)

Everything else on the drive seems just fine. The other drive is much older and it's smarmontools report is:

| => smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-52-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    26%
Data Units Read:                    115,783,912 [59.2 TB]
Data Units Written:                 281,087,251 [143 TB]
Host Read Commands:                 1,142,872,239
Host Write Commands:                8,039,604,613
Controller Busy Time:               38,359
Power Cycles:                       519
Power On Hours:                     16,843
Unsafe Shutdowns:                   496
Media and Data Integrity Errors:    0
Error Information Log Entries:      154
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               33 Celsius

Which also seems to be just fine and as expected. But for some reason nvme list shows that it's using 512Gb. How can this be the case? Was the rebuilding process not properly completed?

What do you think?

br flag
Why are you using consumer-grade SSDs in a server? Is this just the boot drive?
escozul avatar
cn flag
This was offerred by Hetzner. I got it really cheap so it's done. Isn't it a bit irrelevant though?
br flag
No, not at all, serverfault is a site for professionals - who inherently wouldn't use consumer grade parts, with their much higher MTBFs, in a professional setting.
escozul avatar
cn flag
This is a server meant for a professional installation. It is meant to receive about 50 websites that are currently hosted in a VPS. Hetzner in turn is undeniably a professional hosting provider. The raid array is RAID 1 which reduces the chance for failure significantly. The Server is setup using a XEON and ECC memory. I myself do this for a living. I don't get your "not - professional" comment here. I only asked about the reading of one particular command that seemed obscure to me. Please let me clarify again: "The server works". I just get 512GB used when running `nvme list` on 1 disk
Score:0
st flag

I see now I also get such results:

    Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S69xxxxxxxxxxxxx      Samsung SSD 980 PRO 2TB                  1           2.00  TB /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme1n1          S69xxxxxxxxxxxxx      Samsung SSD 980 PRO 2TB                  1         381.65  GB /   2.00  TB    512   B +  0 B   5B2QGXA7

And mdstat looks ok:

    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
      1952279552 blocks super 1.2 [2/2] [UU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

Does anybody know why is that?

escozul avatar
cn flag
is the /dev/nvme0n1 disk much older than the /dev/nvme1m1?
Robert Hrovat avatar
st flag
Both were bought on same day. Production date on them differs for 1 month
escozul avatar
cn flag
Listen I never got an answer or a suggestion here. Instead, I got an irrelevant comment about whether I should be using what hardware. I failed to see the point of diverting the discussion from what that Usage column means to whether my SSD is Pro or Consumer... Eventually, I figured out what that "Usage" thingie means. Take it with a grain of salt please:
escozul avatar
cn flag
The Usage Column actually means what percentage of the available space on the SSD has been used. What percentage of the physical NANDS have been used at least once(?) It is ok if, at a certain point, these values are not the same for both drives. On my system, they eventually matched. Right now the actual NvME usage is about 137GB for both of my drives but if I do a `du -h` I see that only around 32 GB are occupied. Both of my drives though have used only 137/512 GB from their NvME physical address space. That's how I interpreted it
Robert Hrovat avatar
st flag
I also don't think its something wrong. It's just strange how it's shown. Maybe it's just a reading bug.
escozul avatar
cn flag
I believe we’re miss-interpreting it. As I understand it, it makes sense.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.