What's causing a single Apache2 worker (using mod_jk) to not reload for weeks?

Question

Score:0

Server

What's causing a single Apache2 worker (using mod_jk) to not reload for weeks?

JK Laiho

1/30/24, 1:24 PM

I've got a Debian 10 server running Apache2 2.4.38. Recently, I replaced the SSL certificate file used by all of the configured HTTPS vhosts and ran systemctl reload apache2.service, which runs /usr/sbin/apachectl graceful via the systemd unit file.

According to the Apache 2 docs,

The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.

On a busy server, it's understandable that delays in a child process being replaced will happen. However, today I noticed that rarely the server will still respond with the old, pre-replacement SSL cert. I went and had a look at the processes:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     13559  0.0  0.2  15640 10356 ?        Ss    2022  18:53 /usr/sbin/apache2 -k start
www-data 16834  0.7  0.6 1232452 27780 ?       Sl   06:00   3:47 /usr/sbin/apache2 -k start
www-data 17415  0.9  0.6 1231844 26532 ?       Sl   10:22   2:32 /usr/sbin/apache2 -k start
www-data 17552  0.7  0.6 1231736 26376 ?       Sl   10:53   1:47 /usr/sbin/apache2 -k start
www-data 17612  0.6  0.6 1232000 26840 ?       Sl   10:54   1:34 /usr/sbin/apache2 -k start
www-data 17641  0.6  0.5 1231980 22732 ?       Sl   10:54   1:36 /usr/sbin/apache2 -k start
www-data 17642  0.8  0.6 1231848 24728 ?       Sl   10:54   1:59 /usr/sbin/apache2 -k start
www-data 26704  0.5  0.6 1232216 24748 ?       Sl   Jan18  89:53 /usr/sbin/apache2 -k start

The last process in the list sticks out like a sore thumb on the START and TIME columns. I ran the reload command on Jan 24, but now, six days later, this one process is still going. The server is responding to requests just fine, though – it's unknown if this one worker is actually serving new requests or not, but the others are.

The server configuration is simple and default enough to not be included here (for now – if you need any specific information, do ask). The only interesting feature about it is that it's running mod_jk, i.e. the various VirtualHost directives have JkMount /* workername (where workername is defined in /etc/libapache2-mod-jk/workers.properties). mod_jk uses ajp13 to connect to one of two load balanced application servers running Tomcat.

This is not the first time that we've had a stuck Apache2 worker, but I've never been able to determine the reason it ended up that way. It might be something to do with mod_jk and the (very) legacy Java applications, with there possibly being some request that caused a rare edge case bug at the Java/mod_jk level and is preventing the worker from ever being able to exit via the USR1 signal. The Java application logs are not under my control; they're extremely verbose with pointless information, often missing timestamps, and are practically useless for troubleshooting purposes, unless perhaps if you happened to be looking at them at the exact moment when the error occurred.

I'll have to do a non-graceful restart even though it causes a minor production outage, but I'd be interested in further methods to debug this issue in the future so that we could get the workers to always reliably perform a graceful restart when needed. How can we better analyze what makes a worker stuck like this?

The results of apachectl -V:

Server version: Apache/2.4.38 (Debian)
Server built:   2021-12-21T16:50:43
Server's Module Magic Number: 20120211:84
Server loaded:  APR 1.6.5, APR-UTIL 1.6.1
Compiled using: APR 1.6.5, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

apachectl -S with production subdomains redacted:

VirtualHost configuration:
*:443                  is a NameVirtualHost
         default server vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example-ssl.conf:2)
         port 443 namevhost vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example-ssl.conf:2)
         port 443 namevhost vhost2.domain.example (/etc/apache2/sites-enabled/01_vhost2.domain.example-ssl.conf:2)
         port 443 namevhost vhost3.domain.example (/etc/apache2/sites-enabled/02_vhost3.domain.example-ssl.conf:2)
         port 443 namevhost vhost4.domain.example (/etc/apache2/sites-enabled/03_vhost4.domain.example-ssl.conf:2)
                 alias alias1.domain.example
*:80                   is a NameVirtualHost
         default server vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example.conf:1)
         port 80 namevhost vhost1.domain.example (/etc/apache2/sites-enabled/00_vhost1.domain.example.conf:1)
         port 80 namevhost vhost2.domain.example (/etc/apache2/sites-enabled/01_vhost2.domain.example.conf:1)
                 alias 172.16.33.63
         port 80 namevhost vhost3.domain.example (/etc/apache2/sites-enabled/02_vhost3.domain.example.conf:1)
         port 80 namevhost vhost4.domain.example (/etc/apache2/sites-enabled/03_vhost4.domain.example.conf:1)
                 alias alias1.domain.example
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www/html"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex default: dir="/var/run/apache2/" mechanism=default 
Mutex watchdog-callback: using_defaults
Mutex rewrite-map: using_defaults
Mutex ssl-stapling-refresh: using_defaults
Mutex ssl-stapling: using_defaults
Mutex proxy: using_defaults
Mutex ssl-cache: using_defaults
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33

103

1 + 0

mod-jk

apache-2.4

What's causing a single Apache2 worker (using mod_jk) to not reload for weeks?

Post an answer