I'm writing here after weeks spent fighting an issue that cause Apache to stop responding until it is restarted.
It happens 3/4 times a day, sometimes after hours, sometimes after some minutes, sometimes after a day.
There's non relation (at least there's no evidence) with the number of concurrent connection to the server: it happens both during heavy traffic period (between 8.00am - 18.00pm) and during the night when accesses are very low.
Configuration:
VM on Vmware ESXi Rel. 7 - OS: Ubuntu 20.04, Apache 2.4.41, PHP 8.0.15, MSSQL Drivers 17.8.1.1-1.
6 CPU "Xeon(R) Gold 5218", 12Gb Ram.
3 website running in "pure" PHP (no CMS like Wordpress, Drupal, Ruby On Rails etc).
Awstats shows that the intranet's one with no external access serve < 10k page day, the others about 200k pages served a day.
Most of time CPU usage sits about 1% and memory used about 2Gb. When the issue happens, no CPU/Memory/network "spikes" are detected.
At then moment I installed and configured Monit that every 20 seconds test with curl this minimal PHP webpage:
<?php
echo "ok";
?>
Normally it prints "ok". During the "freeze", even this simple page isn't served; curl ends with timeout error and trigger monit to do a "service apache2 restart". After 2/3 seconds the website come back to normal functionality (till the next freeze).
Follows a list of unsuccessful remediation (not in chronological order):
- Removed certbot-Letsencrypt and used a Sectigo purchased SSL cerificate
- Switched Apache from mpm_worker to mpm_event
- Disabled a bunch of unused Apache's modules
- Disabled a bunch of unused PHP's modules
- Disabled most of non critical cron jobs (even there's no evidence that the freeze happens during cron jobs execution).
- Changed virtual network adapter from VMXNET3 to E1000
- Enabled verbose logging: no useful information/errors are recorded, simply there's a 25-30 sec time gap from the last page served just before the hang a the first served when the restart complete.
- Enabled for some days mod_log_forensic: no (!) errors are reported using check_forensic utility
- Double checked the few Rewrite rules in .conf and in .htaccess
- Changed Apache's configuration; relevant values are:
StartServers 10
MinSpareThreads 40
MaxSpareThreads 120
ThreadLimit 100
ThreadsPerChild 75
MaxRequestWorkers 450
MaxConnectionsPerChild 1000
There's no evident correlation between the "last" page/file served before the issue: sometimes is a PHP page (obviously not the same) sometimes a png/jpeg image.
Reading logs I cannot find abnormal/malformed/excessive client's requests.
The issue is 99,99% Apache related, the PHP-fpm service works perfectly and is not necessary to restart it after a freeze. All other server's running services are not affetced.
Before writing here, I read tons of webpage but I didn't found any useful (for me) hint.
Thanks in adv
Ciao
JYD