Background: Debian Stretch amd64 server on Google Cloud with Apache 2.4.25. It's running a PHP-based website via proxy_fcgi to PHP-FPM. Backend database is PostgreSQL 10. Postgres packages have been installed from the official Postgres apt repo, everything else is vanilla from the Debian repos. There's a port 80 redirect to 443 with Let's Encrypt certificates. HTTP/2 and Brotli are enabled. There is also a reverse proxy to a Server-Sent Event daemon on the same server (https://github.com/vgno/ssehub).
Server has been up for over 2 years, but in the last few months there is an intermittent fault where the site stops responding to requests. It usually clears up after a couple of minutes. I've done a lot of log analysis, and it doesn't seem to be related to the server processes. CPU usage is nominal, memory usage is low, no errors appear in logs for Apache, PostgreSQL, FPM, syslog, ssehub. The server also has fail2ban installed but there are no log entries for that either. I've put in extra diagnostic logging in Apache and FPM to check for requests that take a long time to process, but that hasn't turned anything up.
Here's the output from iptables -L
:
Chain INPUT (policy ACCEPT)
target prot opt source destination
f2b-sshd tcp -- anywhere anywhere multiport dports ssh
DROP udp -- anywhere anywhere udp dpt:l2f policy match dir in pol none
DROP all -- anywhere anywhere ctstate INVALID
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT udp -- anywhere anywhere multiport dports isakmp,ipsec-nat-t
ACCEPT udp -- anywhere anywhere udp dpt:l2f policy match dir in pol ipsec
DROP udp -- anywhere anywhere udp dpt:l2f
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DROP all -- anywhere anywhere ctstate INVALID
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
ACCEPT all -- 192.168.42.0/24 192.168.42.0/24
ACCEPT all -- anywhere 192.168.43.0/24 ctstate RELATED,ESTABLISHED
ACCEPT all -- 192.168.43.0/24 anywhere
DROP all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain f2b-sshd (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
Any suggestions for possible causes or things I should check? At the moment the only cause I can think of is network congestion, but that's very difficult to prove as it's an intermittent issue and usually clears up by the time I'm aware of it and start doing some tests. Plus it seems surprising that Google Cloud would have such frequent network issues. Do Google have some kind of traffic shaping policies that I'm not aware of? It's a very low traffic server and the problem frequently occurs out of hours when virtually no one is using the site.