Score:0

What could prevent hdd hot-swap in linux ahci?

pk flag

I'm tearing my hair out over this issue.

I wanted to add a hotswap bay to my homeserver to easily add and remove HDD such as to easily rotate off-site backups. The mainboard in question is an Asrock J4105-ITX motherboard with four native SATA ports, which are divided between an ASM1062 and an Intel processor SATA controller. Both work fine and use the ahci kernel module. There is a hot-swap option in the BIOS but it seems to have no effect.

If a drive is disconnected (either via echo 1 > /sys/block/sdX/device/delete or by rudely removing the drive), no new device will be recognized after reconnecting. I've tried forcing a rescan (echo "- - -" > /sys/class/scsi_host/host<n>/scan) but to no avail, the SATA port is practically not usable anymore until the next reboot. I also tried some more extreme commands without any luck:

echo 1 > /sys/class/scsi_device/2:0:0:0/device/reset
echo 1 > /sys/devices/pci0000:00/0000:00:1f.2/rescan
echo 1 > /sys/devices/pci0000:00/0000:00:1f.2/reset

(taken from How do I make Linux recognize a new SATA /dev/sda drive I hot swapped in without rebooting?)

"Alright, probably the chipset does not support hot swap or the BIOS is messed up." So I ordered two PCIe SATA Controller (one uses an ASM1064, the other uses the Marvell 88SE9215). Both exhibit the same issue, although other buyers state that hot-swap works for them, so I guess the problem is either tied to software (my installation? I'm running an Arch OS, which is kept dutifully up to date).

Some hopefully useful information:

$ uname -a
Linux servername 5.14.14-arch1-1 #1 SMP PREEMPT Wed, 20 Oct 2021 21:35:18 +0000 x86_64 GNU/Linux

$ dmesg | grep ahci
[    0.447450] ahci 0000:00:12.0: version 3.0
[    0.447842] ahci 0000:00:12.0: SSS flag set, parallel bus scan disabled
[    0.457970] ahci 0000:00:12.0: AHCI 0001.0301 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
[    0.457981] ahci 0000:00:12.0: flags: 64bit ncq sntf stag pm clo only pmp pio slum part sxs deso sadm sds apst 
[    0.458750] scsi host0: ahci
[    0.459204] scsi host1: ahci
[    0.469788] ahci 0000:01:00.0: AHCI 0001.0000 32 slots 4 ports 6 Gbps 0xf impl SATA mode
[    0.469801] ahci 0000:01:00.0: flags: 64bit ncq sntf led only pmp fbs pio slum part sxs 
[    0.470767] scsi host2: ahci
[    0.471203] scsi host3: ahci
[    0.471562] scsi host4: ahci
[    0.471904] scsi host5: ahci
[    0.472341] ahci 0000:04:00.0: SSS flag set, parallel bus scan disabled
[    0.472376] ahci 0000:04:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
[    0.472382] ahci 0000:04:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc 
[    0.472803] scsi host6: ahci
[    0.473011] scsi host7: ahci

$ lspci -v
[...]
01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11) (prog-if 01 [AHCI 1.0])
    Subsystem: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
    Flags: bus master, fast devsel, latency 0, IRQ 127
    I/O ports at e050 [size=8]
    I/O ports at e040 [size=4]
    I/O ports at e030 [size=8]
    I/O ports at e020 [size=4]
    I/O ports at e000 [size=32]
    Memory at a1340000 (32-bit, non-prefetchable) [size=2K]
    Expansion ROM at a1300000 [disabled] [size=256K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [70] Express Legacy Endpoint, MSI 00
    Capabilities: [e0] SATA HBA v0.0
    Capabilities: [100] Advanced Error Reporting
    Kernel driver in use: ahci
[...]
Score:0
pk flag

I finally found the reason: My powertop-tuning was too aggressive!

Because this server is running 24/7 and electricity is kinda expensive around here I added a systemd service to automatically tune all powertop options:

$ cat /etc/systemd/system/powertop.service
[Unit]
Description=Powertop tunings

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/powertop --auto-tune

[Install]
WantedBy=multi-user.target

This is the same as opening the powertop tui and setting all options to 'Good'. The crucial bit are four lines about Runtime PM for port ataX:

   Good          Runtime PM for port ata3 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Bad           Runtime PM for port ata4 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Good          Runtime PM for port ata5 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
>> Good          Runtime PM for port ata6 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Good          Runtime PM for PCI Device Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller

They execute echo 'auto' > '/sys/bus/pci/devices/0000:01:00.0/ata4/power/control'; which aparently causes the SATA card to never recognize new devices on the port!

Only after setting power/control to on (the 'Bad' option according to powertop) will the card find new devices after executing echo 0 0 0 | sudo tee /sys/class/scsi_host/host*/scan

The only thing I'm missing is automatic rescans as my desktop PC will auto-find new devices without the need to write to hostX/scan, but I can kinda live with this for now. This has been an extremely frustrating experience so I hope this might help somebody facing the same issue.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.