I have a customer that has a Ubuntu 20.04 VM running inside a qNAP NAS server providing information management functionality to windows 10 clients. The VM is joined to their domain but the qNAP server is not. The NAS server is only used as an easy place to run the VM. The VM is providing much more functionality than the NAS device can for this client, the details of which are a bit involved and would require more space than this forum really allows. (Would be happy to discuss this with anyone on a side channel...)
This configuration has been running successfully for months without issues until last Friday. These NAS servers have been targeted by a ransomware gang so qNAP decided to force an upgrade of their NAS OS. Probably not the best approach but that's what happened. I then remoted into their system and updated the Ubuntu VM. Then the overlying software also was updated. In hindsight, it was probably not a good idea to update everything at once but that worked here in our lab. However, that's when the fun started...
The software that runs in the VM has windows client software that is upgraded via a Samba share that is mapped to a drive letter on the windows 10 systems. The .exe file that was downloaded had some strange errors and would not run. The file appeared corrupted. I manually copied the file again from the CIFS share on the VM and it failed again but with a different error. Each time I copied the file, the size was correct but would fail in a different manner. I then uploaded a copy directly to the windows machine bypassing the VM and all was well. As you can see, there are a lot of places to point fingers but here is what I've done in an attempt to isolate the problem:
- Did some copy testing with the following results.
- File uploads of just about any size (a few bytes to 1GB) all uploaded to the VM correctly.
- File downloads behaved differently based upon their size. Small files of a few bytes downloaded correctly. Files of the size of the executable (~700k) were corrupted. Files larger than that (a few MB) gave a windows error 0x8007003B. "Unexpected Network Error" Google of that seems to indicate a windows firewall issue, which is turned off, or a antivirus getting in the way, which was removed. Still getting the error.
- Looked in the logs on the VM and the windows machine and don't see any issues.
- Ran a long ping session to see if there might be network congestion. No dropped packets after many hours of running. The problems still persist over the weekend when everyone was gone.
- Did a winscp from the VM to the windows machine bypassing samba. Still has the error. Would indicate it is not a samba issue but a lower level networking issue on the VM.
- Fired up a share on the NAS device directly. That works perfectly. That would seem to eliminate some type of external networking problems such as bad cables or misconfigured networking devices. That does not eliminate the internal network bridge within the NAS device used by the VM, however. Looked to see if there was some configuration issue there but found none.
- Made sure jumbo frames were not enabled. My experience with misconfigured jumbo frames is dropped packets, not corruption or errors like this.
- Did a smbstatus on the VM to ensure everyone is running SMB3. They are. Not running encryption but are using signing with AES-128-CMAC. (NOTE: 52 windows machines have the share mapped to a drive letter.)
- Made sure Samba was running on TCP port 445. It was.
- I have two qNAP devices here in my lab that are not experiencing these problems. About the only difference with these that I can see is that they have VMs that are not on a domain. I have a hard time believing the VM being on the domain could cause data corruption but I think we have all seen weirder issues. Today's project is to get them configured on my test domain here and see what happens.
- The IT support person at this customer is exceptional. I've worked with him over the years with great success. There are no other reported networking issues at their site.
- He is willing to help here including firing up wireshark to see if we can tell what's happening.
- We are both stumped...
Here are a few things that I've considered trying but thought I would ask for advice since these are a bit time consuming...
- Get a VM connected to my test domain in my lab.
- Run wireshark on their network. See if we can get a look at what is happening on the wire.
- See if we can find an additional Linux machine to talk to the VM in an attempt to eliminate the windows client part of the configuration.
- Create an additional VM in the NAS device but not update it to the latest ubuntu code. See if that can help isolate some sort of upgrade issue.
Can anyone offer other things to look for or do?
Thanks in advance!
Bruce