We have a dell host running several VMs that are all located on a dell power vault. Users have complained of intermittent slowdowns (like a long time to ls a folder, or taking longer to respond once we log in). They ALWAYS blame the network, but these VMS are connected via SAS cables to the hosts and via 10GB network to a core switch. I have myself seen slowdowns but no smoking gun, wasn't sure how to start tracking this down. I know I could do a top which I did once when slowdown was being complained about, and saw 4 different RSYNCS running to a mounted data drive from another VM. They said this shouldn't slow anything down oddly when those rsync processes were killed things moved faster.
I set these VMs and the powervault pools/volumes up for them and while I am very new to VMWare, VCenter, Powervaults etc I have down two such setups and the secondary one runs fine for another group.
So some information:
The slow data vm mounts its large drive with this (mount command):
/dev/sdb1 on /data type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
in the fstab:
/dev/sdb1 /data ntfs defaults 0 0
We have a hub vm that mounts a data vm's data drive for large amounts of space to write to:
mount command:
10.25.x.xxx:/data on /data type nfs4 (rw,noatime,nodiratime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.25.y.yyy,local_lock=none,addr=10.25.x.xxx)
the fstab:
10.25.x.xxx:/data /data nfs4 rw,nodiratime,noatime,rs
On the dataVM that has the drive directly mounted an iotop:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
735 be/4 root 0.00 B/s 308.45 K/s 0.00 % 0.21 % mount.ntfs /dev/sdb1 /data -o rw
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % systemd --switched-root --system --deserialize 22
No idea what the mount.ntfs is but it takes up a lot of resources it seems.
I tried ioping from the hub that access the data vm (thusly data drive) and dd to get more information:
(system) [root@hubVM ioping-1.0]# ./ioping -c 10 /data
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=1 time=177.5 us (warmup)
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=2 time=214.0 us
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=3 time=140.7 us
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=4 time=3.39 ms
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=5 time=140.7 us
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=6 time=187.5 us
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=7 time=184.0 us (fast)
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=8 time=154.6 us (fast)
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=9 time=111.8 us (fast)
4 KiB <<< /data (nfs4 10.25.x.xxx:/data): request=10 time=288.4 us
--- /data (nfs4 10.25.x.xxx:/data) ioping statistics ---
9 requests completed in 4.81 ms, 36 KiB read, 1.87 k iops, 7.31 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 111.8 us / 534.2 us / 3.39 ms / 1.01 ms
Also some dd data from hub vm to datavm
time dd if=/dev/zero of=/data/testfile2 bs=16k count=128k
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB) copied, 35.3667 s, 60.7 MB/s
real 0m35.538s
user 0m0.015s
sys 0m1.139s
pyenv: no such command `sh-activate'
I am not sure what all this is telling me.
I have another vm host on its own other network doing its thing for another group set up the same way and these are the numbers from that what I see (no complaints on speed by users on this one)
First how the large drive is mounted from that second groups hubvm
10.50.x.xxx:/data/archive /archive nfs4 rw,nodiratime,noatime,rsize=8192,wsize=8192,bg,intr,tcp 0 0
Then the ioping data from second groups hubvm to data vm:
[root@hubvm ioping-1.0]# ./ioping -c 10 /archive
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=1 time=277.9 us (warmup)
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=2 time=359.7 us
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=3 time=269.3 us
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=4 time=349.6 us
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=5 time=327.5 us
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=6 time=301.6 us
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=7 time=263.7 us (fast)
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=8 time=267.9 us (fast)
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=9 time=259.9 us (fast)
4 KiB <<< /archive (nfs4 10.50.x.xxx:/data/archive): request=10 time=408.5 us (slow)
--- /archive (nfs4 10.50.x.xxx:/data/archive) ioping statistics ---
9 requests completed in 2.81 ms, 36 KiB read, 3.21 k iops, 12.5 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 259.9 us / 312.0 us / 408.5 us / 49.6 us
dd data
time dd if=/dev/zero of=/archive/testfile2 bs=16k count=128k
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB) copied, 4.79677 s, 448 MB/s
real 0m4.798s
user 0m0.012s
sys 0m1.120s
Right off the bat it says 448MB/s for the one host with VMs that goes fast, and then 60.7 MB/s for the host that is slow and lags and things time out writing to it. So not sure how to track this down any further like what is the cause of 60MB/s vs 448 MB/s
Not sure where to look next? Network configuration? (should be the same on both vCenters, 10GB network connected to each host having on them the specific internal subnets we have.
Some configuration I missed on our vCenter? or cisco core switch? inside of the vm?