I am running multiple VMs in Azure. VMs are running in a subnet with NSG. NICs do not use NSGs, we do not use accelerated networking.
I notice that when a VM talks to another VM of the same subnet using TCP, the MSS value in the SYN packets is reduced by 42. That means if I send a TCP SYN with MSS=876 to another VM of the same network, the other VM will capture a TCP SYN with MSS=834:
Client:
18:49:27.526527 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [S], seq 3092614737, win 17520, options [mss 876,sackOK,TS val 2936204423 ecr 0,nop,wscale 7], length 0
18:49:27.528398 IP 10.56.142.108.ssh > 10.56.142.25.49614: Flags [S.], seq 1710658781, ack 3092614738, win 28960, options [mss 1418,sackOK,TS val 390195731 ecr 2936204423,nop,wscale 7], length 0
18:49:27.528430 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [.], ack 1, win 137, options [nop,nop,TS val 2936204425 ecr 390195731], length 0
Server:
18:49:27.527362 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [S], seq 3092614737, win 17520, options [mss 834,sackOK,TS val 2936204423 ecr 0,nop,wscale 7], length 0
18:49:27.527682 IP 10.56.142.108.ssh > 10.56.142.25.49614: Flags [S.], seq 1710658781, ack 3092614738, win 28960, options [mss 1460,sackOK,TS val 390195731 ecr 2936204423,nop,wscale 7], length 0
18:49:27.529167 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [.], ack 1, win 137, options [nop,nop,TS val 2936204425 ecr 390195731], length 0
We are using multiple NVAs, and our SYN packets travel through multiple hops, and we actually see the MSS being reduced multiple times, we originally measured a reduction by 84, we also measured a reduction of 138 in some cases (indeed not a multiple of 42), that means we reduce by more than 10% the efficiency of our network.
I have spent some time looking at how various network appliances play with the MSS. In most cases, the MSS is set to a fix amount, by being either clamped to a static value or to the path MTU. PaloAlto will use an "adjustment" that is relative to the MTU of a network interface, which is a fixed value. Arista will let you put a ceiling value on ingress or egress traffic, again absolute values. Some firewall vendors like PaloAlto, will reduce the MSS in case of DoS attack and SYN cookies is activated, but the MSS will be one of 8 possible values in that case.
I believe this MSS -= 42 mechanism breaks TCP: if client supports jumbo frames and send an MSS of 8860, server in Azure receives 8876, itself it replies 1330, but the client receives 1246, the client will agree that packets should have 1246 bytes payload, while the server will send 1330 bytes payload.
The biggest issue is that we have cases where traffic works "by chance". The clamping is not done properly on the express route side, yet because of this -42 here and there the MSS gets actually reduced to a value that "fits", until there is some slight change in the way packets are routed, and you discover all of a sudden that there was a misconfiguration somewhere.
Any idea how to explain this reduction? I believe this behavior is not documented anywhere.
EDIT
Just reading RFC879
The MSS can be used completely independently in each direction of data flow. The result may be quite different maximum sizes in the two directions.
So it looks legit as per RFC. Still, a weird behavior.