We have an Ubuntu 20.04 host using ZFS and the sharenfs
option:
root@host:~# zfs get sharenfs pool/enc/esxi
NAME PROPERTY VALUE SOURCE
pool/enc/esxi sharenfs rw=x.x.x.x,no_subtree_check,async,anonuid=0,anongid=0,all_squash local
root@host:~# exportfs -v | grep esxi
/pool/enc/esxi x.x.x.x(rw,async,wdelay,root_squash,all_squash,no_subtree_check,mountpoint,anonuid=0,anongid=0,sec=sys,rw,secure,root_squash,all_squash)
When attempting to create a new VM using an OVA the operation fails:
The web UI says " Failed to deploy VM: postNFCData failed: ". It doesn't start uploading the disk, it seems to fail at the creation stage
vmkernel.log says:
2022-05-04T09:33:29.859Z cpu7:1051648 opID=85e2477a)NFS41: NFS41_VSIMountSet:405: Mount server: nfshost, port: 2049, path: /pool/enc/esxi, label: NFS, security: 1 user: , options: <none>
2022-05-04T09:33:29.859Z cpu7:1051648 opID=85e2477a)StorageApdHandler: 966: APD Handle Created with lock[StorageApd-0x4313e6003970]
2022-05-04T09:33:29.859Z cpu7:1051648 opID=85e2477a)NFS41: NFS41_ConnectionLookup:804: Created new connection for address tcp nfshost.8.1
2022-05-04T09:33:29.860Z cpu10:1049211)NFS41: NFS41ProcessExidResult:2314: clientid 4f2a53628e14edb1 roles 0x20000
2022-05-04T09:33:29.860Z cpu10:1049213)NFS41: NFS41ProcessSessionUp:2380: Cluster 0x4313e6004a40[2] clidValid:0 clusterAPDState:0 received clientID 4f2a53628e14edb1
2022-05-04T09:33:29.860Z cpu10:1049213)NFS41: NFS41ProcessSessionUp:2393: Cluster 0x4313e6004a40[2] set with new valid clientID 4f2a53628e14edb1
2022-05-04T09:33:29.860Z cpu10:1049213)NFS41: NFS41ProcessClusterProbeResult:4186: Reclaiming state, cluster 0x4313e6004a40 [2]
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSCompleteMount:3966: Lease time: 90
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSCompleteMount:3967: Max read xfer size: 0x3fc00
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSCompleteMount:3968: Max write xfer size: 0x3fc00
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSCompleteMount:3969: Max file size: 0x7fffffffffffffff
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSCompleteMount:3970: Max file name: 255
2022-05-04T09:33:29.872Z cpu7:1051648 opID=85e2477a)WARNING: NFS41: NFS41FSCompleteMount:3975: The max file name size (255) of file system is larger than that of FSS (128)
2022-05-04T09:33:29.873Z cpu7:1051648 opID=85e2477a)NFS41: NFS41FSAPDNotify:6188: Restored connection to the server nfshost mount point NFS, mounted as 507f1811-40137e33-0000-000000000000 ("/pool/enc/esxi")
2022-05-04T09:33:29.873Z cpu7:1051648 opID=85e2477a)NFS41: NFS41_VSIMountSet:417: NFS mounted successfully
2022-05-04T09:35:05.436Z cpu3:1048746)StorageDevice: 7059: End path evaluation for device t10.NVMe____WDC_CL_SN720_XXXXXXXXXXXXXXXXX__________XXXXXX448XXXXXXX
2022-05-04T09:35:05.437Z cpu3:1048746)StorageDevice: 7059: End path evaluation for device t10.NVMe____WDC_CL_SN720_XXXXXXXXXXXXXXXXX__________XXXXXX448XXXXXXX
2022-05-04T09:38:18.355Z cpu3:1051646 opID=a65fad89)World: 12075: VC opID esxui-8e02-4c35 maps to vmkernel opID a65fad89
2022-05-04T09:38:18.355Z cpu3:1051646 opID=a65fad89)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4305bc5cad10 failed: Stale file handle
2022-05-04T09:38:18.355Z cpu3:1051646 opID=a65fad89)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
2022-05-04T09:38:18.411Z cpu3:1051646 opID=a65fad89)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4305bc5aef70 failed: Stale file handle
2022-05-04T09:38:18.411Z cpu3:1051646 opID=a65fad89)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
2022-05-04T09:38:19.909Z cpu1:1054212 opID=6d39243b)World: 12075: VC opID esxui-e417-4c55 maps to vmkernel opID 6d39243b
2022-05-04T09:38:19.909Z cpu1:1054212 opID=6d39243b)VmMemXfer: vm 1054212: 2465: Evicting VM with path:/vmfs/volumes/507f1811-40137e33-0000-000000000000/x/x.vmx
2022-05-04T09:38:19.909Z cpu1:1054212 opID=6d39243b)VmMemXfer: 209: Creating crypto hash
2022-05-04T09:38:19.909Z cpu1:1054212 opID=6d39243b)VmMemXfer: vm 1054212: 2479: Could not find MemXferFS region for /vmfs/volumes/507f1811-40137e33-0000-000000000000/x/x.vmx
2022-05-04T09:38:19.929Z cpu1:1054212 opID=6d39243b)VmMemXfer: vm 1054212: 2465: Evicting VM with path:/vmfs/volumes/507f1811-40137e33-0000-000000000000/x/x.vmx
2022-05-04T09:38:19.929Z cpu1:1054212 opID=6d39243b)VmMemXfer: 209: Creating crypto hash
2022-05-04T09:38:19.930Z cpu1:1054212 opID=6d39243b)VmMemXfer: vm 1054212: 2479: Could not find MemXferFS region for /vmfs/volumes/507f1811-40137e33-0000-000000000000/x/x.vmx
Everything else works, the system has been running multiple VMs on NFS with no issues for a while now. We are able to work around the OVA breakage by provisioning to the local non-NFS datastore and then copying the resulting VM from the local datastore to the NFS datastore, it'd then boot without an issue.
I want to try and find out what the root cause is anyway.
So far I've tried (rebooting the ESXi after each time):
- setting the NFS share to be
sync
, not async
- setting the NFS share to be
no_wdelay
, not wdelay
- a combination of the above
None fixed the issue.
I then tried removing the NFS datastore and re-adding it but selecting NFS v3 instead and attempting to provision an OVA. It worked fine, I waited a bit for the OVA to finish uploading and it succeeded, the new VM then booted fine too!
I rebooted the ESXi to verify it wasn't a fluke and the OVA provisioning still worked.
I then removed the NFS datastore and re-added it with v4 selected this time as before and the issue came back.
So it seems it works fine on NFSv3 but not on NFSv4 for whatever reason...
How do I get OVA provisioning working on NFSv4 as it does on v3 ESXi datastores?