Thanks for reading :)
This is a super difficult issue and would like to receive any ideas or suggestions to figure out this issue.
Problem: The application on a user logging in initiates ~20 api requests in parallel. The first request will do the SSL handshake and then around the 10th to 13th request, I see two requests initiate the SSL handshake at the same time with each handshake getting stuck and taking over 25 seconds to repeat. The issue manifests for users as a 30 second login.
Setup: I have a setup with hardware based load balancer and about 8 nginx nodes that reverse proxy for a java application running on the same node. FE is a SPA, and all traffic flowing through nginx is dynamic content.
Additional Details
- Tweaking the keepalive from 65s to 10s reduced the total SSL handshake time from >30s (which is the FE timeout) to 25s, so the issue is related to keepalive in some way.
- This issue was only present on FF, and has now spread to safari.
- Upgraded nginx to latest LTS
- Load balancer is distributing requests round robin.
- Nginx logs do omit any mention of the issue.
- The api requests are ordered, and usually affect 2 of the same 3 requests.