I have an issue that I don't know how to debug. I hope you can help me further with this.
In my group, I administer a Linux compute cluster that consists of multiple compute machines and a Synology NAS server. Since the user homes need to be accessible on all machines, we store them on the NAS and mount them via NFS upon boot. This is the entry in /etc/fstab
we use for that:
X.X.X.X:/path/on/nas /home nfs defaults,nolock 0 0
Weirdly, ever since we enabled NFSv4.1 on the NAS, for some of the compute machines, the access of /home
has become extremely slow. So slow in fact, that sometimes the mounting timeouts during the boot process. And if that is not the case, calling ls
on /home
takes up to 10 seconds.
Now the really strange part is that if I manually umount /home
and call mount -a
afterward to mount it again, suddenly everything works well again, and the /home
directory can be accessed as fast as it used to be. So it seems like the performance issue only occurs if I try to mount the directory during boot. Also, other nodes in the same network do not have this problem at all and function as expected. Since we install all nodes via PXE and FAI, they all have the same OS (Ubuntu 20.04) and configuration.
As I said, I have no idea how to debug this problem further. So if you have any idea where to look, I am happy to provide more information.
Thanks a lot in advance!
Best,
Tim
Edit: I just enabled verbose logging with rpcdebug -m nfs -s all
on the client and saw that it is full of errors:
[ 4446.566627] nfs4_reset_session: session reset failed with status -10022 for server X.X.X.X!
[ 4446.566628] nfs4_handle_reclaim_lease_error: handled error -10022 for server X.X.X.X
[ 4446.566739] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4446.566739] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4446.566740] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4446.566851] <-- nfs4_proc_create_session
[ 4447.590948] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4447.590949] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4447.590949] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4447.590950] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4447.591324] <-- nfs4_proc_create_session
[ 4448.614939] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4448.614940] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4448.614941] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4448.614941] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4448.615097] <-- nfs4_proc_create_session
[ 4449.638886] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4449.638887] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4449.638887] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4449.638888] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4449.641435] <-- nfs4_proc_create_session
[ 4450.662900] nfs4_handle_reclaim_lease_error: handled error -10008 for server X.X.X.X
[ 4450.662901] --> nfs4_proc_create_session clp=0000000080f0363b session=0000000014a4cf1c
[ 4450.662901] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[ 4450.662902] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[ 4450.663063] <-- nfs4_proc_create_session
It seems that the first error is 10022 (NFS4ERR_STALE_CLIENTID), followed by a series of 10008 (NFS4ERR_DELAY). The second error causes the client to wait for one second and then try again (https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/1360102042-10732-74-git-send-email-herton.krzesinski@canonical.com/), which explains the high response time of the NFS mount. The meaning of these errors is not clear to me, though.