Score:1

iscsid broken mount after recovery

kr flag
dna

I am playing with Open-iSCSI and came across some troubles. When the network link between my initiator and the target fails, iscsid will recover the connection, which is good. But my mount is broken and yields I/O error until it is remounted.

Is there a clean solution to remount the lun automatically? Something like a hook on post recovery or a config setting that I somehow missed? I am trying to avoid a polling script or something of the sort :)

iscsid log
Jan 14 08:03:45 localhost iscsid[1415]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Jan 14 08:04:22 localhost iscsid[1415]: iscsid: connect to 10.0.2.100:9999 failed (No route to host)
[...]
Jan 14 08:38:43 localhost iscsid[1415]: iscsid: connect to 10.0.2.100:9999 failed (No route to host)
Jan 14 08:38:47 localhost iscsid[1415]: iscsid: connection1:0 is operational after recovery (195 attempts)
Jan 14 08:39:52 localhost iscsid[1415]: iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jan 14 08:40:11 localhost iscsid[1415]: iscsid: connection1:0 is operational after recovery (2 attempts)
fstab
UUID=cf3d20cd-a8cd-4a9a-acbf-1c61289a37bb /data xfs defaults,_netdev,x-systemd.requires=iscsid.service 0 0
Score:0
ng flag

In short, no, there isn't a magically clean solution. The cleanest recovery is reboot.

The problem is that when the connection is down for more than iscsid's replacement_timeout, the filesystem starts getting I/O errors. Unless you have a very special application, there's typically no coming back from an I/O error. It wreaks all sorts of havoc with services. You're almost always better off rebooting than you are trying to sort out how all the programs stopped working.

Having said that, what you can do is push that replacement_timeout out as far as you're comfortable with. You'll find references on the web to applications like databases that recommend an hour timeout (3600 seconds), or even longer. This pushes the problem into something like a stuck NFS hard mount. If you need to design a system that rides out outages where a human has to intervene to fix it, a much longer timeout is a good thing. The system just hangs until the link comes back.

You can set the default for the entire system in /etc/iscsi/iscsid.conf. Edit this line:

node.session.timeo.replacement_timeout = 120
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.