Score:0

Causes of packet loss on multiple persistent tcp connections simultaneously?

br flag

The issue was detected while analyzing some application logs, which reported few seconds long spike periods when messages from multiple clients are received on the server with a substantial delay (up to a couple of seconds). The application itself utilizes persistent connections, over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).

In order to dig deeper, I recorded a tcpdump and figured out that random segments from multiple (but not all) clients get lost during those spikes (so the server sends out a lot of SACKs), and the retransmissions happen in about 300ms in best cases, hence the delay on the application level, while the server waits for the missing fragments. For a particular affected client, it's not just one retransmission per spike, but sort of a series of retransmissions. Commands like ifconfig -a don't report any packet loss, /var/log/syslog is clean. The channel is 10Gbit, while the incoming/outgoing traffic measures at barely 10Mbit in the peak hours.

The question is: what may cause this, which tools can help in spotting a potential problem, where to look? Can this have to do with the server provider?

Steffen Ullrich avatar
se flag
A packet loss can happen at any device in between client and server (i.e. router, firewall, load balancer ....) and also on the server. It is often connected with overload of the specific intermediary or end devices, but might also be caused by bugs. To find out where the loss happens you need to do a packet capture at the specific devices to see where exactly the packets get lost. Some self-reported statistics on these devices about packet load and packet loss might help too.
cn flag
This is far too broad, however in my experience, the most common cause of packet loss is insufficient capacity, specifically one pipe connecting to a smaller pipe.
tonso avatar
br flag
@SteffenUllrich The fact that if happens for multiple random clients at the same time suggests that is not an issue with individual clients' devices/routers imo...
Steffen Ullrich avatar
se flag
@tonso: *"not an issue with individual clients' devices/routers"* - I agree. But many clients usually share at least some part of the network path. For example clients using the same ISP will share most of the path, then several ISP might use the same upstream. And even if all come from different ISP and upstream they will share the last part of the path through the infrastructure where the server is located.
tonso avatar
br flag
@GregAskew Ok, but how to narrow down the search then?
tonso avatar
br flag
@SteffenUllrich I analyzed IPs of the clients affected during one spike, and not only they use random various ISPs, but sometimes come from outside of the US. So it seems like it should be a data center infrastructure problem, though it's unclear how do you even approach the server provider regarding this... Probably some built-in DDOS protection?
Steffen Ullrich avatar
se flag
There might be an overload due to traffic spikes - which might be caused by DDoS but might also be non-malicious traffic spikes facing a limited capacity of the provider.
cn flag
`how to narrow down the search then?` There is insufficient information. The only thing we know is the application design uses long-lived connections using a provider that probably doesn't offer an SLA. You could start by defining 'long-lived'.
tonso avatar
br flag
@GregAskew By long-lived I meant the ones that persists, without idling, not the typical one-time bulk download (this is not my definition of long-lived TCP connections). Mmm, but I don't understand why you saying that this is the only information provided. I clearly wrote 'over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).' I truly believe this contains *some* information about the application design. What else should be described?
Peter Zhabin avatar
cn flag
Two suggestions: 1) Use UDP for this type of traffic. 2) If you need to really troubleshoot this implement a few independent probes continuously sending some data back and forth (like echo ping) and monitor their statistics.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.