I set up a NFSv4 server and a client both running Debian 11.3 on Linux 5.10.0-13. It basically works i.e., I see all files with correct permissions and can open, modify, etc. However, opening a file causes a 5 second delay.
The server exports directories like this in /etc/exports
:
/path/to gss/krb5(rw,sync,fsid=0,crossmnt,no_subtree_check)
/path/to/NFS gss/krb5(rw,sync,no_subtree_check)
The NFS4 export is mounted using this kind of /etc/fstab
entry:
server:/NFS /path/to/nfs nfs4 defaults,auto,sec=krb5,proto=tcp,port=2049,nolock 0 3
The following data refers to opening a file in emacs
and close emacs
as soon as it is displayed.
Analyzing strace
I see that the corresponding openat()
syscall takes quite stable 5.1 seconds! Other openat(), stat(), ...
syscalls pertaining to NFS execute within microseconds. This is what I see consistently when doing e.g. ls
, which doesn't feel different from the local file system. This is an excerpt of syscalls relating to the file in question prepended by the time spent in the syscall:
0.000014 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000005 readlinkat(AT_FDCWD, "/path/to/nfs/file", 0x7fff202f6880, 1024) = -1 EINVAL (Das Argument ist ungültig)
0.000006 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 14
0.000005 readlinkat(AT_FDCWD, "/path/to/nfs/file", 0x7fff202f64d0, 1024) = -1 EINVAL (Das Argument ist ungültig)
0.000005 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 14
5.108682 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC) = 14
0.000006 faccessat(AT_FDCWD, "/path/to/nfs/file", W_OK) = 0
0.000005 stat("/path/to/nfs/file", {st_mode=S_IFREG|0640, st_size=2192, ...}) = 0
0.000007 faccessat(AT_FDCWD, "/path/to/nfs/file,v", F_OK) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
0.000009 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008 openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000012 stat("/path/to/nfs/file", {st_mode=S_IFREG|0640, st_size=2192, ...}) = 0
0.000015 write(14, "/path/to/nfs/file"..., 90) = 90
In wireshark I see the following:
No. Time Source Destination Protocol Length Info
7 0.623097 client.ip server.ip NFS 276 V4 Call (Reply In 8) GETATTR FH: 0xecf8d891
8 0.624231 server.ip client.ip NFS 376 V4 Reply (Call In 7) GETATTR
9 0.624239 client.ip server.ip TCP 72 951 → 2049 [ACK] Seq=601 Ack=917 Win=4176 Len=0 TSval=1071244404 TSecr=3950562910
10 0.624696 client.ip server.ip NFS 344 V4 Call (Reply In 11) OPEN DH: 0xecf8d891/
11 0.625669 server.ip client.ip NFS 452 V4 Reply (Call In 10) OPEN StateID: 0xb42f
12 0.625693 client.ip server.ip TCP 72 951 → 2049 [ACK] Seq=873 Ack=1297 Win=4176 Len=0 TSval=1071244405 TSecr=3950562911
13 5.742166 client.ip server.ip NFS 340 V4 Call (Reply In 14) CLOSE StateID: 0xb42f
14 5.743331 server.ip client.ip NFS 232 V4 Reply (Call In 13) SEQUENCE | PUTFH Status: NFS4ERR_STALE
15 5.743359 client.ip server.ip TCP 72 951 → 2049 [ACK] Seq=1141 Ack=1457 Win=4176 Len=0 TSval=1071249523 TSecr=3950568029
I do not know whether this NFS4ERR_STALE
condition is a hint to the issue. According to the RFC7530 it indicates that the file system object has been removed. Well, the file in the delayed openat()
definitely is not deleted. So, I'm unsure what it refers to. However, it also shows the 5.1 second delay in between 12 and 13.
I must admit that I do not quite understand what I see here. However, I see the same kind of delays with other programs as well i.e., this is no quirk of emacs
. Saving in libreoffice
can even freeze until the file is accessed by another program.
Since I've seen somewhere that there are issues with krb5p
in some environments, I reduced to krb5
, but this did not change anything.
Both client and server run gssproxy
. On the client I see debugging entries of nfsidmap
, and after setting sysctl sunrpc.rpc_debug=0xFFFF
I see the following for the emacs
scenario:
[423544.865600] RPC: gss_get_mic_v2
[423544.865628] RPC: xs_tcp_send_request(200) = 0
[423544.867049] RPC: xs_data_ready...
[423544.867309] RPC: gss_verify_mic_v2
[423545.373665] RPC: gss_get_mic_v2
[423545.373691] RPC: xs_tcp_send_request(200) = 0
[423545.374692] RPC: xs_data_ready...
[423545.374748] RPC: gss_verify_mic_v2
[423545.375009] RPC: gss_get_mic_v2
[423545.375025] RPC: xs_tcp_send_request(268) = 0
[423545.375909] RPC: xs_data_ready...
[423545.375957] RPC: gss_verify_mic_v2
[423550.467227] RPC: gss_get_mic_v2
[423550.467307] RPC: xs_tcp_send_request(216) = 0
[423550.468395] RPC: xs_data_ready...
[423550.468513] RPC: gss_verify_mic_v2
[423550.468621] RPC: gss_get_mic_v2
[423550.468646] RPC: gss_get_mic_v2
[423550.468689] RPC: xs_tcp_send_request(264) = 0
[423550.469403] RPC: xs_data_ready...
[423550.469541] RPC: gss_verify_mic_v2
[423550.469564] RPC: gss_verify_mic_v2
[423550.472794] RPC: gss_get_mic_v2
[423550.472849] RPC: xs_tcp_send_request(208) = 0
[423550.473758] RPC: xs_data_ready...
[423550.473903] RPC: gss_verify_mic_v2
[423550.474234] RPC: gss_get_mic_v2
[423550.474290] RPC: xs_tcp_send_request(200) = 0
[423550.475147] RPC: xs_data_ready...
[423550.475257] RPC: gss_verify_mic_v2
I have no idea how to exactly interpret this log, but it also clearly shows the 5 second delay, and it seems to me like all RPC calls are processed within 100 ms.
Any ideas what is going on, or at least how to further pin down the issue?