My server is running on Ubuntu20.04, a pure LAMP stack with Apache 2.4.41.
In the last few weeks, there was a total of 2 occurrence where Apache2 was not responsive (users can't load our website), and we can't solve why, but it started working again after I restarted Apache2 (systemctl restart apache2). I checked and MySQL is up, so I feel it's purely due to Apache2 reaching the limit and being unresponsive.
So I started tracing around, and logging the processes count, namely, logging down the command below
ps aux | grep apache | wc -l
into a text file every 5 seconds.
The command will return the number of processes that has the word "apache", which serves a purpose to tell us the amount of active processes currently.
The usual process counts would range from 90 (off peak) to 250-300 (peak). But occasionally (twice now, since we started logging), it goes up to 700, the trend will be from 90 > 180 > 400 > 700, nearly doubling every 5 seconds.
I have checked apache error logs, syslogs, access logs and so on, and failed to find any useful informations. Initially I suspects it to be a DDOS, but i fail to find any useful information to "prove" that it is DDOS.
Little info about my server configs -
- uses the default mpm_prefork
- MaxKeepAliveRequest 100
- KeepAliveTimeout 5
- ServerLimit 1000
- MaxRequestWorkers 1000 (increased recently to "solve" the spike, it was 600 previously)
- MaxConnectionsPerChild 0
- MaxSpareServers 10
- No firewall (ufw) or mod_evasive enabled.
Here comes my questions,
Is there any way I can find out what is causing the spike, if there's no logs at all? I feel that it's due to certain apache processes getting stuck and kept on spawning child processes, if that's how it works (sorry, not very familiar with server stuffs).
I noticed that, after a spike, the number of processes doesn't goes down immediately, instead, it seemed like it decreases by 3-5 processes every 5 seconds, and took around 9-10 minutes to reach 100 processes, from 700 processes, not sure what was the reason, but which config should I tweak to make the processes "die" faster? I was hoping that, if the processes "die" fast enough, even if there is a sudden spike, my server will just be "down" for around 5-10seconds max. But upon reading some stuffs, my setting of KeepAliveTimeout 5 should kill it fast enough, why is it lingering for up to 10 minutes? Should i set MaxConnectionsPerChild to something other than 0 (unlimited)?
My current approach is hopefully to find ways to implement #2 and ways to "prove" that processes are dying faster than it used to be, during a spike. Secondly, maybe implement a firewall to prevent a DDOS, if it really is one.
Thanks