Both of our servers suffer from

Every start of the month we got this error and we have to repair the raid using

echo 'repair' >/sys/block/<md id>/md/sync_action

This check is caused by mdcheck_start.timer.service if I'm not mistaken.
It takes around 5 hours to repair it, after that time it repairs itself, or at least I think so.

The question is if this is a correct way to fix unsynchronized blocks of the raid? What causes it and how can I tell if it's a hardware/disk error? Thank you!

EDIT: /etc/fstab contains:

# /etc/fstab: static file system information.

# / was on /dev/md2p1 during curtin installation
/dev/disk/by-id/md-uuid-b0b68adb:353b70e8:fa806910:a78761e9-part1 / ext4 defaults 0 0

# /vol/data was on /dev/md3p1 during curtin installation
/dev/disk/by-id/md-uuid-2360fc63:991922f4:33aae17f:12f23590-part1 /vol/data ext4 defaults 0 0

# /boot was on /dev/md0p1 during curtin installation
/dev/disk/by-id/md-uuid-a76428ff:270597e7:70ed6c91:026d2441-part1 /boot ext4 defaults 0 0

UUID="5c389b41-007d-4893-b81c-5560cb2d6ff9" /vol/backup ext4 defaults 0 0    /vol/shared    nfs    defaults    0 0

Output of lsblk --discard:

loop0              0        4K       4G         0
loop1              0        4K       4G         0
loop2              0        4K       4G         0
loop3              0        4K       4G         0
loop4              0        4K       4G         0
loop5              0        4K       4G         0
loop6              0        4K       4G         0
loop7              0        4K       4G         0
loop8              0        4K       4G         0
sda                0        4K       2G         0
├─sda1             0        4K       2G         0
├─sda2             0        4K       2G         0
│ └─md0            0        4K       2G         0
│   └─md0p1        0        4K       2G         0
├─sda3             0        4K       2G         0
│ └─md1            0        4K       2G         0
│   └─md1p1        0        4K       2G         0
└─sda4             0        4K       2G         0
  └─md2            0        4K       2G         0
    └─md2p1        0        4K       2G         0
sdb                0        4K       2G         0
├─sdb1             0        4K       2G         0
├─sdb2             0        4K       2G         0
│ └─md0            0        4K       2G         0
│   └─md0p1        0        4K       2G         0
├─sdb3             0        4K       2G         0
│ └─md1            0        4K       2G         0
│   └─md1p1        0        4K       2G         0
└─sdb4             0        4K       2G         0
  └─md2            0        4K       2G         0
    └─md2p1        0        4K       2G         0
sdc                0        0B       0B         0
└─sdc1             0        0B       0B         0
nvme1n1            0      512B       2T         0
└─md3              0      512B       2T         0
  └─md3p1          0      512B       2T         0
nvme0n1            0      512B       2T         0
└─md3              0      512B       2T         0
  └─md3p1          0      512B       2T         0

Output of smartctl -i /dev/sd[ab]:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke,

Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KG960G8
Serial Number:    BTYG024601ZC960CGN
LU WWN Device Id: 5 5cd2e4 152b3fddf
Firmware Version: XCV10120
User Capacity:    960,197,124,096 bytes [960 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb  2 07:43:15 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Output of mdadm --detail /dev/md2:

           Version : 1.2
     Creation Time : Tue Nov 24 21:02:34 2020
        Raid Level : raid1
        Array Size : 919731200 (877.12 GiB 941.80 GB)
     Used Dev Size : 919731200 (877.12 GiB 941.80 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Feb  2 07:43:33 2022
             State : active
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : ubuntu-server:2
              UUID : b0b68adb:353b70e8:fa806910:a78761e9
            Events : 24281

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       8       20        1      active sync   /dev/sdb4

Output of smartctl -A -l error /dev/sda:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke,

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       10469
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       7
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2591 (8 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   079   075   000    Old_age   Always       -       21 (Min/Max 12/27)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       21
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1006057
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       419
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       52
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       628023
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2591 (8 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1006057
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       1112548
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1730576

SMART Error Log Version: 1
No Errors Logged

Output of smartctl -A -l error /dev/sdb:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke,

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       10469
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       7
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2479 (8 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   078   073   000    Old_age   Always       -       22 (Min/Max 12/29)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       22
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1064411
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       440
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       45
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       628005
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2479 (8 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1064411
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       876800
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1801020

SMART Error Log Version: 1
No Errors Logged
David Pivoňka

@anx Kernel version is 5.4.0-92-generic. I'm not sure if the filesystem is using the discard feature, how can I tell? We did not set anything like that during the installation. EDIT: Added /etc/fstab content to post.
Nikita Kipriyanov

show `lsblk --discard`
David Pivoňka

@NikitaKipriyanov added to the main post
Nikita Kipriyanov

so which one shows this behaviour?
David Pivoňka

We are repairing it using `echo 'repair' >/sys/block/md2/md/sync_action`. So it should be `md2 : active raid1 sdb4[1] sda4[0]` according to `cat /proc/mdstat`
Nikita Kipriyanov

Unfortunately, these MD indices aren't stable. They may switch after the reboot. Still, md2 is currently on the sda and sdb - what are those devices? Please, show `smartctl` for them. Also please show `mdadm --detail /dev/md2`.
David Pivoňka

Added. I should also mention that we have a secondary server which is identical to this one and the problem also occurs there.
Nikita Kipriyanov

Nice to see info about SSD. But you posted two identical outputs, only the serial differ. It is enough to retain only one copy. I wanted to see the attributes and the error log, `smartctl -A -l error /dev/sd[ab]`. // I fear MD RAID is not the best technology to use on these SSDs. This is the case when filesystem with integrated volume management might be more appropriate, for instance, zfs or btrfs..
David Pivoňka

Added output of smartctl error. So you are saying it might help to replace MD RAID with some kind of hardware raid?
Nikita Kipriyanov

I literally said that it may be better to replace block-level RAID with filesystem-level RAID. I expect HW RAID to show similar or even more strange symptoms. // We encountered the problem like this today with similar SSDs, S4610 series. So now I even have the problem like yours. But in my case there is Windows which doesn't have such filesystems. so we are exploring.

