Diagnosing very slow write speed on RAID / Fedora 14

Question

Score:-1

Server

Diagnosing very slow write speed on RAID / Fedora 14

harry courtice

12/19/23, 6:20 AM

We have a Dell PowerEdge T110 running Fedora 14 which operates as our build server for Embedded Linux and as our Subversion server.

Recently it has become very slow, failing to complete nightly backups before the new day begins.

[EDIT] Thanks User9517 - I've checked the log and there are multiple messages from MRMON (Mega Raid Monitor). Any guidance on interpreting these messages, next steps, and how to determine which drive needs replacing would help.

Dec 20 09:02:32 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2
Dec 20 09:06:44 localhost MR_MONITOR[2153]: <MRMON113> Controller ID:  0   Unexpected sense:   PD  #012    =   -:-:2No defect spare location available,   CDB   =    0x28 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ,   Sense   =    0x70 0x00 0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x32 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Dec 20 09:09:44 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2

[/EDIT]

I'm looking for assistance to track down the fault. I'm definitely no expert on this, the person who set up the system originally is no longer here.

The nightly backup is approximately 6 GByte tgz file, which starts at 8pm. This used to finish around 4am (including copying to an external drive). Weekly backup is around 45 GByte, this used to finish at 11am Saturday from a Friday 8pm start.

In addition to the backup, the machine is noticeably slow to respond, even when the backup process is not running.

Here is what I have gathered so far:

There is a RAID controller DELL PERC H200L with Four Seagate 1TB Drives attached (ST31000424SS). I think it's setup for RAID 10, but I don't know how to access the configuration for this controller. I believe RAID 10 because there are 4 drives, and vgdisplay shows 1.81 Terabytes allocated of the 4 terabytes total on the 4 drives.

[root@fedorabox backup]# vgdisplay
  --- Volume group ---
  VG Name               vg_fedorabox2
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  8
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                5
  Open LV               5
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.81 TiB
  PE Size               32.00 MiB
  Total PE              59263
  Alloc PE / Size       28480 / 890.00 GiB
  Free  PE / Size       30783 / 961.97 GiB

I can't see any other actual drives in the machine, so I guess the boot partition (/dev/sdb1) is somehow partitioned out of the 4 drives.

(/dev/sda is an external hard-drive for backups - but this isn't the problem. The backup is still being generated on the /backup partition when we arrive in the morning. The copy to the USB attached drive hasn't started)

[root@fedorabox backup]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_fedorabox2-LogVol00
                      9.9G  5.2G  4.3G  55% /
tmpfs                 2.0G  932K  2.0G   1% /dev/shm
/dev/sdb1             504M   56M  423M  12% /boot
/dev/mapper/vg_fedorabox2-LogVol03
                      394G  221G  153G  60% /home
/dev/mapper/vg_fedorabox2-LogVol02
                       99G   29G   65G  32% /shared
/dev/mapper/vg_fedorabox2-LogVol01
                       30G   11G   18G  37% /usr
/dev/sda2             5.5T  2.6T  3.0T  47% /mnt/root/usbbackup2
/dev/mapper/vg_fedorabox2-LogVol04
                      345G  363M  327G   1% /backup

like I said in the question, write speed is very slow:

[root@fedorabox backup]# dd if=/dev/zero of=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 40.382 s, 0.4 kB/s
[root@fedorabox backup]# dd of=/dev/null if=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 3.5087e-05 s, 467 MB/s

I can access the four drives using smartctl as /dev/sg2 through /dev/sg5. Output is listed below. I don't know what is normal reading for corrected errors here, but I note that the second and fourth drives (/dev/sg3, sg5) have listed uncorrected errors for reading and verifying.

Any advice on next steps - Are the uncorrected errors normal or worrisome? Is this the cause of the slowness, or is there something else I should be looking at?

Any advice on how to replace a drive and how to access the RAID configuration?

[root@fedorabox /]# smartctl -a /dev/sg2
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST1000NM0001     Version: PS06
Serial number: Z1N2LEDW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:20 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 33 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  71
Elements in grown defect list: 36
Vendor (Seagate) cache information
  Blocks sent to initiator = 2805494200
  Blocks received from initiator = 1072424796
  Blocks read from cache and sent to initiator = 19110177
  Number of read and write commands whose size <= segment size = 826634038
  Number of read and write commands whose size > segment size = 5264167
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 11183.37
  number of minutes until next internal SMART test = 43

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3996823525        0         0  3996823525          0        130.509           0
write:         0        0         0         0          0      62619.327           0
verify: 1594450892        0         0  1594450892          0      51866.259           0

Non-medium error count:        9

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  32   11182                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg3
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3JSJV
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:44 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 21
Vendor (Seagate) cache information
  Blocks sent to initiator = 1872227385
  Blocks received from initiator = 3603107317
  Blocks read from cache and sent to initiator = 53905772
  Number of read and write commands whose size <= segment size = 1041622488
  Number of read and write commands whose size > segment size = 5288254
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.02
  number of minutes until next internal SMART test = 16

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1454822558        3         0  1454822561   1454822585       2465.838          21
write:         0        0         0         0          0      64012.923           0
verify: 2113323340      143         0  2113323483   2113323510      49057.393          17

Non-medium error count:        4

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg4
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3H8DW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:02 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  76
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  76
Elements in grown defect list: 1
Vendor (Seagate) cache information
  Blocks sent to initiator = 1437832391
  Blocks received from initiator = 3080050213
  Blocks read from cache and sent to initiator = 2689371046
  Number of read and write commands whose size <= segment size = 3306395247
  Number of read and write commands whose size > segment size = 5018225
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.17
  number of minutes until next internal SMART test = 58

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1514637706     1007         0  1514638713   1514638713    1576907.538           0
write:         0        0         0         0          0      61240.330           0
verify: 1697580124       32         0  1697580156   1697580157      48889.638           0

Non-medium error count:       27

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16      18                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg5
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3FCZ6
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:41 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 4096
Vendor (Seagate) cache information
  Blocks sent to initiator = 923606853
  Blocks received from initiator = 3074269061
  Blocks read from cache and sent to initiator = 3237322768
  Number of read and write commands whose size <= segment size = 3044372010
  Number of read and write commands whose size > segment size = 5024782
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77336.67
  number of minutes until next internal SMART test = 53

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2058067359   277563         0  2058344922   2058345511    1420772.201         555
write:         0        0         0         0          0      62186.800           0
verify: 2750944424     2205         0  2750946629   2750946631      50834.359           1

Non-medium error count:      167

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]

155

1 + 5

raid

fedora

lvm

Diagnosing **very** slow write speed on RAID / Fedora 14