Score:2

NFSv4.1 mount is extremly slow until remount

lc flag

I have an issue that I don't know how to debug. I hope you can help me further with this.

In my group, I administer a Linux compute cluster that consists of multiple compute machines and a Synology NAS server. Since the user homes need to be accessible on all machines, we store them on the NAS and mount them via NFS upon boot. This is the entry in /etc/fstab we use for that:

X.X.X.X:/path/on/nas /home  nfs      defaults,nolock    0       0

Weirdly, ever since we enabled NFSv4.1 on the NAS, for some of the compute machines, the access of /home has become extremely slow. So slow in fact, that sometimes the mounting timeouts during the boot process. And if that is not the case, calling ls on /home takes up to 10 seconds.

Now the really strange part is that if I manually umount /home and call mount -a afterward to mount it again, suddenly everything works well again, and the /home directory can be accessed as fast as it used to be. So it seems like the performance issue only occurs if I try to mount the directory during boot. Also, other nodes in the same network do not have this problem at all and function as expected. Since we install all nodes via PXE and FAI, they all have the same OS (Ubuntu 20.04) and configuration.

As I said, I have no idea how to debug this problem further. So if you have any idea where to look, I am happy to provide more information.

Thanks a lot in advance!

Best, Tim

Edit: I just enabled verbose logging with rpcdebug -m nfs -s all on the client and saw that it is full of errors:

[ 4446.566627] nfs4_reset_session: session reset failed with status -10022 for server X.X.X.X!
[ 4446.566628] nfs4_handle_reclaim_lease_error: handled error -10022 for server X.X.X.X
[ 4446.566739] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4446.566739] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4446.566740] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4446.566851] <-- nfs4_proc_create_session
[ 4447.590948] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4447.590949] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4447.590949] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4447.590950] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4447.591324] <-- nfs4_proc_create_session
[ 4448.614939] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4448.614940] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4448.614941] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4448.614941] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4448.615097] <-- nfs4_proc_create_session
[ 4449.638886] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4449.638887] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4449.638887] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4449.638888] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4449.641435] <-- nfs4_proc_create_session
[ 4450.662900] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4450.662901] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4450.662901] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4450.662902] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4450.663063] <-- nfs4_proc_create_session

It seems that the first error is 10022 (NFS4ERR_STALE_CLIENTID), followed by a series of 10008 (NFS4ERR_DELAY). The second error causes the client to wait for one second and then try again (https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/1360102042-10732-74-git-send-email-herton.krzesinski@canonical.com/), which explains the high response time of the NFS mount. The meaning of these errors is not clear to me, though.

wowbagger avatar
in flag
perhaps run tcpdump to get an idea what is going on
Homan avatar
lc flag
I just enabled verbose logging on the client side (see my edit). I know now what leads to the delays, but I am not sure why these errors occur in the first place.
Score:1
kz flag

There could be so many different reasons!

Here’s some decent and recent “slow NFS” troubleshooting guide by IBM. I’d suggest giving it a shot!

https://www.ibm.com/docs/en/aix/7.2?topic=troubleshooting-causes-slow-access-times-nfs

Homan avatar
lc flag
Thanks, I will take a look at that page. Though the weird thing is that my problem only appears if I mount the NFS share during boot. This makes me think it is probably somehow on the client side and not related to the server or network.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.