Score:2

What is reducing the MSS by 42?

jp flag

I am running multiple VMs in Azure. VMs are running in a subnet with NSG. NICs do not use NSGs, we do not use accelerated networking.

I notice that when a VM talks to another VM of the same subnet using TCP, the MSS value in the SYN packets is reduced by 42. That means if I send a TCP SYN with MSS=876 to another VM of the same network, the other VM will capture a TCP SYN with MSS=834:

Client:

18:49:27.526527 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [S], seq 3092614737, win 17520, options [mss 876,sackOK,TS val 2936204423 ecr 0,nop,wscale 7], length 0
18:49:27.528398 IP 10.56.142.108.ssh > 10.56.142.25.49614: Flags [S.], seq 1710658781, ack 3092614738, win 28960, options [mss 1418,sackOK,TS val 390195731 ecr 2936204423,nop,wscale 7], length 0
18:49:27.528430 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [.], ack 1, win 137, options [nop,nop,TS val 2936204425 ecr 390195731], length 0

Server:

18:49:27.527362 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [S], seq 3092614737, win 17520, options [mss 834,sackOK,TS val 2936204423 ecr 0,nop,wscale 7], length 0
18:49:27.527682 IP 10.56.142.108.ssh > 10.56.142.25.49614: Flags [S.], seq 1710658781, ack 3092614738, win 28960, options [mss 1460,sackOK,TS val 390195731 ecr 2936204423,nop,wscale 7], length 0
18:49:27.529167 IP 10.56.142.25.49614 > 10.56.142.108.ssh: Flags [.], ack 1, win 137, options [nop,nop,TS val 2936204425 ecr 390195731], length 0

We are using multiple NVAs, and our SYN packets travel through multiple hops, and we actually see the MSS being reduced multiple times, we originally measured a reduction by 84, we also measured a reduction of 138 in some cases (indeed not a multiple of 42), that means we reduce by more than 10% the efficiency of our network.

I have spent some time looking at how various network appliances play with the MSS. In most cases, the MSS is set to a fix amount, by being either clamped to a static value or to the path MTU. PaloAlto will use an "adjustment" that is relative to the MTU of a network interface, which is a fixed value. Arista will let you put a ceiling value on ingress or egress traffic, again absolute values. Some firewall vendors like PaloAlto, will reduce the MSS in case of DoS attack and SYN cookies is activated, but the MSS will be one of 8 possible values in that case.

I believe this MSS -= 42 mechanism breaks TCP: if client supports jumbo frames and send an MSS of 8860, server in Azure receives 8876, itself it replies 1330, but the client receives 1246, the client will agree that packets should have 1246 bytes payload, while the server will send 1330 bytes payload.

The biggest issue is that we have cases where traffic works "by chance". The clamping is not done properly on the express route side, yet because of this -42 here and there the MSS gets actually reduced to a value that "fits", until there is some slight change in the way packets are routed, and you discover all of a sudden that there was a misconfiguration somewhere.

Any idea how to explain this reduction? I believe this behavior is not documented anywhere.


EDIT

Just reading RFC879

The MSS can be used completely independently in each direction of data flow. The result may be quite different maximum sizes in the two directions.

So it looks legit as per RFC. Still, a weird behavior.

Massimo avatar
ng flag
This intrigued me enough to actually test it by creating a new VNet and various VMs with different Windows OSes (2012R2, 2016, 2019) and WireShark. Same VNet, same subnet, direct communication, no routing, no firewalls. I can confirm that this actually happens. A VM sends a SYN with MSS 1460 and the other one receives a SYN with MSS 1418. The second one replies sending an ACK with MSS 1460 and the first one receives an ACK with MSS 1418.
Massimo avatar
ng flag
Azure networking is well known to be quite... *peculiar*. If you want a real shock, have a look at the ARP table on an Azure VM.
cn flag
`I believe this MSS -= 42 mechanism breaks TCP`. I don't. But this would be easy to test and provide actual data.
Ken W MSFT avatar
gb flag
This may help https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-tcpip-performance-tuning#tcp-mss-window-scaling-and-pmtud
Score:4
in flag

As opposed to physical networking, SDN networking consumes additional "bytes" for encapsulation headers (GRE). The visible IPs are CA(customer address), but there is also PA(provider address) that requires routing in cloud provider. Hence you will see less MSS available, since cloud provider applies additional TCP clamping in the infra for backend routing to happen.

CA-PA explanation (hyper-V SDN)

https://docs.microsoft.com/en-us/windows-server/networking/sdn/technologies/hyper-v-network-virtualization/hyperv-network-virtualization-technical-details-windows-server

frigo avatar
jp flag
thanks. For packets already fitting, why clamping again and again? And for packets not fitting, why would reducing by 42 help? I thinks this reducing by 42 is a workaround to account for the incorrect MTU (1500) - maybe setting a MSS to 1418 is what the SDN prefers, in that case it should set to 1418 if needed and no less.
frigo avatar
jp flag
I accept this answer because it does answer the "what" question, but I still believe it's broken.
fr flag
It may be: 20 bytes of IPv4 header, 8 bytes of GRE header (4 bytes mandatory + 4 bytes optional with checksum), 14 bytes of inner ethernet header => giving exact 42 bytes.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.