Score:0

What's causing a single Apache2 worker (using mod_jk) to not reload for weeks?

us flag

I've got a Debian 10 server running Apache2 2.4.38. Recently, I replaced the SSL certificate file used by all of the configured HTTPS vhosts and ran systemctl reload apache2.service, which runs /usr/sbin/apachectl graceful via the systemd unit file.

According to the Apache 2 docs,

The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.

On a busy server, it's understandable that delays in a child process being replaced will happen. However, today I noticed that rarely the server will still respond with the old, pre-replacement SSL cert. I went and had a look at the processes:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     13559  0.0  0.2  15640 10356 ?        Ss    2022  18:53 /usr/sbin/apache2 -k start
www-data 16834  0.7  0.6 1232452 27780 ?       Sl   06:00   3:47 /usr/sbin/apache2 -k start
www-data 17415  0.9  0.6 1231844 26532 ?       Sl   10:22   2:32 /usr/sbin/apache2 -k start
www-data 17552  0.7  0.6 1231736 26376 ?       Sl   10:53   1:47 /usr/sbin/apache2 -k start
www-data 17612  0.6  0.6 1232000 26840 ?       Sl   10:54   1:34 /usr/sbin/apache2 -k start
www-data 17641  0.6  0.5 1231980 22732 ?       Sl   10:54   1:36 /usr/sbin/apache2 -k start
www-data 17642  0.8  0.6 1231848 24728 ?       Sl   10:54   1:59 /usr/sbin/apache2 -k start
www-data 26704  0.5  0.6 1232216 24748 ?       Sl   Jan18  89:53 /usr/sbin/apache2 -k start

The last process in the list sticks out like a sore thumb on the START and TIME columns. I ran the reload command on Jan 24, but now, six days later, this one process is still going. The server is responding to requests just fine, though – it's unknown if this one worker is actually serving new requests or not, but the others are.

The server configuration is simple and default enough to not be included here (for now – if you need any specific information, do ask). The only interesting feature about it is that it's running mod_jk, i.e. the various VirtualHost directives have JkMount /* workername (where workername is defined in /etc/libapache2-mod-jk/workers.properties). mod_jk uses ajp13 to connect to one of two load balanced application servers running Tomcat.

This is not the first time that we've had a stuck Apache2 worker, but I've never been able to determine the reason it ended up that way. It might be something to do with mod_jk and the (very) legacy Java applications, with there possibly being some request that caused a rare edge case bug at the Java/mod_jk level and is preventing the worker from ever being able to exit via the USR1 signal. The Java application logs are not under my control; they're extremely verbose with pointless information, often missing timestamps, and are practically useless for troubleshooting purposes, unless perhaps if you happened to be looking at them at the exact moment when the error occurred.

I'll have to do a non-graceful restart even though it causes a minor production outage, but I'd be interested in further methods to debug this issue in the future so that we could get the workers to always reliably perform a graceful restart when needed. How can we better analyze what makes a worker stuck like this?


The results of apachectl -V:

Server version: Apache/2.4.38 (Debian)
Server built:   2021-12-21T16:50:43
Server's Module Magic Number: 20120211:84
Server loaded:  APR 1.6.5, APR-UTIL 1.6.1
Compiled using: APR 1.6.5, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

apachectl -S with production subdomains redacted:

VirtualHost configuration:
*:443                  is a NameVirtualHost
         default server vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example-ssl.conf:2)
         port 443 namevhost vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example-ssl.conf:2)
         port 443 namevhost vhost2.domain.example (/etc/apache2/sites-enabled/01_vhost2.domain.example-ssl.conf:2)
         port 443 namevhost vhost3.domain.example (/etc/apache2/sites-enabled/02_vhost3.domain.example-ssl.conf:2)
         port 443 namevhost vhost4.domain.example (/etc/apache2/sites-enabled/03_vhost4.domain.example-ssl.conf:2)
                 alias alias1.domain.example
*:80                   is a NameVirtualHost
         default server vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example.conf:1)
         port 80 namevhost vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example.conf:1)
         port 80 namevhost vhost2.domain.example (/etc/apache2/sites-enabled/01_vhost2.domain.example.conf:1)
                 alias 172.16.33.63
         port 80 namevhost vhost3.domain.example (/etc/apache2/sites-enabled/02_vhost3.domain.example.conf:1)
         port 80 namevhost vhost4.domain.example (/etc/apache2/sites-enabled/03_vhost4.domain.example.conf:1)
                 alias alias1.domain.example
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www/html"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex default: dir="/var/run/apache2/" mechanism=default 
Mutex watchdog-callback: using_defaults
Mutex rewrite-map: using_defaults
Mutex ssl-stapling-refresh: using_defaults
Mutex ssl-stapling: using_defaults
Mutex proxy: using_defaults
Mutex ssl-cache: using_defaults
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33
Score:1
ru flag

There have been several issues with third party modules not allowing httpd to gracefully restart child processes even to the point of making the server unresponsive.

I had a similar behaviour with mod_weblogic (mod_wl_24.so) and some others reported something similar with mod_security. Apache HTTPD devs had tried to work around these supposed faulty behaviour in third party modules.

For example 2.4.4X has several fixes to this. I would try to upgrade to at least 2.4.49 and above and try again. Check https://downloads.apache.org/httpd/CHANGES_2.4 and seek "mpm_event".

In 2.4.49
 *) mpm_event: Fix children processes possibly not stopped on graceful
     restart.  PR 63169.  [Joel Self <joelself gmail.com>]
 *) mpm_event: Fix graceful stop/restart of children processes if connections
     are in lingering close for too long.  [Yann Ylavic]

In 2.4.47
 *) mpm_event: Don't reset connections after lingering close, restoring prior
     to 2.4.28 behaviour.  [Yann Ylavic]

It is also possible that you would get better results with mod_proxy_ajp if you can't upgrade right away.

So briefly the answer is: try to upgrade version to at least 2.4.49 if you are using third party modules.


Edit:

Previously I forgot to mention a workaround to "hold tight" until you can upgrade or change the module, that is, force httpd to never graceful restart any child process. How?

Make MaxSpareThreads have the same value as MaxRequestWorkers. And MaxConnectionsPerChild 0 to allow infinite connections so httpd won't try to gracefully shut them down.

Example:

MaxSpareThreads 500
MaxRequestWorkers 500
MaxConnectionsPerChild 0

And do not ever issue a graceful restart (reload as some distros call it), just restart or full stop/start.

JK Laiho avatar
us flag
Thanks, good to know. We're using the stock Apache version from Debian 10, but a Debian 12 upgrade is looming with Apache 2.4.55 included, so we might be able to make do until then.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.