The client and server nodes are CentOS7.9/X86_64. If the HTTP POST requests were sent directly to the server, there are about 0.2% of cases that may timeout. If the HTTP POST requests were sent through an NGINX proxy on the client node, there are about 20% of cases will timeout. I've confirmed that only one backend node has this problem. All other nodes are 100% succeeded even with higher throughput.
After tcpdump on the backend nodes and with Wireshark analysis. for a successful request, the tcp package is received normally. as below:
That is to way, the TCP receiver sends ACK to each large tcp payload.
for a failed request, the tcp receiver only ack 1398 size for each tcp packet. 1398 is the minimum tcp payload as the MSS minus TCP/IP header. ( 1410 - 66 = 1398), as below:
The TCP sender sends TCP retransmission 8 times in 60 seconds but the TCP receiver never sends ack back again. The HTTP Server closed the connection for 60 seconds of reading timeout.
It looks like the packets were lost in the kernel TCP stack instead of in the network way since the packets were captured on the server-side node. And on the client node, it's observed with tcpdump that the client received each ack from the server quickly.
Can anyone help with this? thanks in advance.