Our production environment contains 2 ALBs: a public facing ALB and a private one. Both of these ALBs support HTTP/2.
Now I have a target group which supports HTTP/1.1 containing an ECS service. The very strange thing I'm observing is that:
When requests are made to this service via either of the ALBs, approximately 1 out of 5 requests fail with a 504 gateway timeout.
When I make requests to the IP address of the service directly (via an EC2 instance in the same VPC), I don't get any such timeouts.
An older version of the same application works without 504s via any of the ALBs.
The timeout on the ALBs is set to 30s. In the application it is set to 60s (nginx) and the proxied service also has the same value.
I've compared the response headers in both servers, but they are identical.
My question here is, what should I be looking at as the potential culprit? I know the keep-alive caveats are a huge problem, but again, two different versions of the same application behave differently and I find there is very little to help me debug this.
The current architecture is:
Client -> [AWS ALB] -> [ AWS ECS: Docker Container ]
Within the [DockerContainer]
I have:
[ nginx ] -> [ application ]
Another notable point is: I cannot reproduce the issue on our staging environment which uses the same architecture: the only difference being that it uses AWS EC2 to host the docker container instead of ECS.
ECS CPU/Memory usage seems nominal.