It seems a part of the network doesn't allow Ethernet frames larger than 1510 bytes (excluding checksum), making the allowed MTU be 1496 bytes instead of the standard 1500 bytes and thus frames that originally contained 1500 bytes payload are dropped.
The exact reason for the reduced MTU is not relevant to this question. The question is here why a Windows Server 2019 is able to recover from these dropped packets for some connections but not others. As I understand it, what I'm seeing here seems to be "PMTU black hole detection", but why does PMTU black hole detection happen for some connections but not for others?
Clients 10.246.54.143 and 10.246.54.157 are both in the same LAN segment at a remote site, where the MTU problem exists. Both of them initiate connections to the server 10.8.4.45.
In both cases, the initial TCP handshake is fine, then the MTU issue is hit when the server sends a few large packets.
When this happens on a connection from 10.246.54.157, after a few retransmissions of the large packets, the server seems to give up and try instead with an IP payload of just 576 bytes, which works and everything proceeds from there:
(interestingly, large packets sent by the client does come through)
Then when 10.246.54.143 tries to connect, the large packets are dropped and retransmissions happen, but in this case the server never tries with a smaller packet size and therefore the connection can never fully form:
The server application (listening on port 5002) is the same in both cases. It's written in Java.
Why does the server never try with smaller packets for the connection from 10.246.54.143?
It's all routed over the same interfaces and routers except for the bit closest to the clients, where they are attached to different switches.