HTTP 502 typically means that one server (the one originating the HTTP 502 response) tried to talk to another server and failed.
You mention that rebooting the "first" server (the one eventually handing out the 502) fixes the issue, which probably means there's some kind of non-persistent problem on that server.
Possible reasons:
- memory exhaustion: if your frontend server has to spawn a new process or thread to talk to the backend, it may not be able to do this.
Check RAM utilization (free -m, top) and RAM limits, both global (/etc/security/limits.conf) and per process (cat /proc/PID/limits, where PID is the PID of your process).
- number of open connections: maybe your frontend has a lot of open connections to the backend server, which means at some point it can't open a new one, and restarting closes those connections.
Run ss -tlpnao | grep <backend server IP>
(or any other port) and compare the number of connections with the values of sysctl net.ipv4.ip_local_port_range
and sysctl net.ipv4.tcp_fin_timeout
.
I would also run a tcpdump -nni any host <backend ip> -v
to check what's going on from a packet perspective. Do you get a reply? If so, what kind of reply? Or does the frontend simply never get a reply from the backend? This may help you find the root cause.