Have read other posts; they do not yield much light.
Situation:
- Kubernetes cluster with ingress points to
- Several nginx containers that proxy-pass to a
- Node application on a specific URI via location /app/
What we see:
After days of working without problems, at the same time all 3 nginx containers start reporting upstream issues to the node app - that connection is unexpectedly closed by the client.
However, by going direct to the node app (direct ingress route), or even curling from the nginx containers, there is 100% success rate. i.e., the issue does not seem to exist in the app itself, or we would expect to see a similar failure rate, for the same reasons as going via the nginx.
- CPU - well below the max
- Memory - well below the max
- 1024 sockets configured
- 100k file descriptors (hard and soft limits).
This is all I have, but this is a pressing issue, and it is not clear what the issue is, or why going direct yields such different behavior. More-so, why does accessing the nginx container via docker exec, and curling from there not produce an issue?
Right now hypotheses extend to some form of resource is exhausted but not immediately clear what that resource is.
We are not maintaining keep-alive, but if socket / port exhaustion was the issue surely we would have seen the same behavior when logging into the container directly.
I am starting to run out of ideas - so any help is hugely appreciated.
Right now, I have a ticking time bomb; flawless service for days, and then suddenly - BOOM! Clients experiencing either 120 seconds or 60 seconds (request appears to fail over from one nginx to another on first failure) latencies 30% of the time.
Lastly - restart the containers: problem goes away.
Well, until it comes back, days sometimes weeks later.
As such, this is why we do not see it as an issue with the node app itself;
if it was the node app, why does hit it directly work 100% of the time, and how would restarting the nginx container / nginx reload (process remains active, but new worker process issued with new config changes) fix the issue with the node?
The issue is believed to be nginx because of this - but very unclear as to where. Resources don't appear to be wildly different post-restart when compared with pre-restart - but it is the fact that a restart completely solves the issue, that the issue takes days to appear, that we feel it is somehow resource related. Couldn't offer a decent suggestion as to what that resource is, tho.