This one has proven to be a tough nut for me to crack. Too tough. I think this could be a nice challenge for the gurus out there.
We have a physical server in a datacenter somewhere. It has a single NIC with a public IP. It runs Proxmox and has a few VMs on it, some of which are running some services with a web server front-end.
The VM's are segregated into various subnets via bridges (vmbr0, vmbr1 and so on). The main NIC is bridged to vmbr0, which has the public IP.
I've set up IP forwarding and masquerading (gonna change that to SNAT after everything I've learned recently troubleshooting this). All VMs are online and almost everything is working perfectly.
The host has NGINX proxy manager installed on it, which helps forward requests to the right internal IP and to wrap them in SSL, which works great.
Now one of these web applications running on one of the Windows 2019 servers has an app that is for receiving alerts and notifications (this is for an RTLS system for those who know what that is). This app is for running on client machines (currently testing on a laptop at home and is connecting to the server's port 80. To bypass the reverse proxy I've forwarded port 8081 to the VM's internal IP:80 and the app is connecting, reporting all servers connected with a green light.
Now whenever an alert is received, it gets the alert, with some metadata, but then it is supposed to download a live map of the event,a long with some additional information and then it hangs for a while until spitting out timeout messages.
I thought this was because of the network setup and tried connecting the client laptop directly to the VM with a Wireguard VPN and then it works perfectly, all but proving this is a network issue.
I've ran tcpdump on the client laptop, the host and the VM, trying to locate where it goes wrong, but I can't really see anything wrong. It's all TCP btw. The closest thing to an issue I've seen is that sometimes I see some packets with the "F" flag, that has a return port different from the usual return port. Whenever a packet leaves the client with the usual return port, it gets an ACK and back and forth it goes, but this packet with the different return port and the F flag never seems to get a response, so the client sends them back out again and again, but never getting a reply. I can see that this packet usually is acknowledging the same seq number over and over.
Are packets flagged with F supposed to get an ACK in the first place? I don't know if this is the actual problem or not. All firwealls are currently off btw, for testing.
Some other weird things that may or may not help illuminate the issue: If I'm connected via the VPN, but tells the app to connect via the public IP and the forwarded port, I can see in tcpdump that it does as it's told, but whenever it receives an alert it starts to send traffic through the VPN! And then it works... tcpdump for some reason won't let me see inside the wg interface, so I can't see the ports etc that are being used there.
Another thing is that the app seems to either do a reverse IP lookup or a hostbame check, because when I connect via the public IP and hover the mouse over the server icon, it displays the datacenter's domain name for the server (ns-2412dfu.eu.blabla) and when I'm connected via VPN it shows the actual hostname of the VM.
I've been reading up on iptables, trying to find anything I mught try. I thought maybe connection tracking, but that onkly seems to apply when it's being used as a firewall (ie to allow connectins that would otherwise get blocked) or maybe packet fragmentation, but seemingly connection tracking handles that automatically.
How do I figure what's going wrong here?