Score:0

systemd oneshot service stops working after a few weeks

us flag

I have a systemd service that runs on shutdown. It runs a script purges all of the docker images stored on the Ubuntu 22.04 AWS VM if the host is running low on disk space.

# cat /lib/systemd/system/purge-docker.service
[Unit]
Description=Purge all Docker files on reboot if we're running low on disk space
After=syslog.service network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/usr/local/sbin/purge-docker.sh lowdisk
Restart=on-failure
RestartSec=1s

[Install]
WantedBy=multi-user.target

The service is started and enabled. It runs for a few weeks, successfully logging output to the system journal:

-- Boot c39eb40835574e229dddde806da40539 --
Jun 11 15:30:43 i-023416ba5deadbeef systemd[1]: Finished Purge all Docker files on reboot if we're running low on disk space.
Jun 11 15:33:25 i-023416ba5deadbeef systemd[1]: Stopping Purge all Docker files on reboot if we're running low on disk space...
Jun 11 15:33:25 i-023416ba5deadbeef purge-docker.sh[35434]: Plenty of free space, not purging docker files
Jun 11 15:33:27 i-023416ba5deadbeef systemd[1]: purge-docker.service: Deactivated successfully.
Jun 11 15:33:27 i-023416ba5deadbeef systemd[1]: Stopped Purge all Docker files on reboot if we're running low on disk space.

It works fine for a few weeks, doing exactly what it's supposed to do when the AWS instance is rebooted or powered off, then stops working and stops logging anything to the system journal. systemctl status shows:

# systemctl status purge-docker
○ purge-docker.service - Purge all Docker files on reboot if we're running low on disk space
     Loaded: loaded (/lib/systemd/system/purge-docker.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

So it's enabled but inactive. Since it's enabled I'd expect it to restart the next time the VM is booted, except that it doesn't start again, ever. The symlink for enabled services /etc/systemd/system/multi-user.target.wants/purge-docker.service has disappeared, and the symlink is what causes systemd to start the enabled services on bootup.

If I type systemctl enable purge-docker it re-adds the missing link to the service file:

# systemctl enable purge-docker
Created symlink /etc/systemd/system/multi-user.target.wants/purge-docker.service → /lib/systemd/system/purge-docker.service.

Afterwards status still shows the same thing (that it's enabled), except now it actually is enabled:

# systemctl status purge-docker
○ purge-docker.service - Purge all Docker files on reboot if we're running low on disk space
     Loaded: loaded (/lib/systemd/system/purge-docker.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

More weirdness: If I disable then enable the service, disable deletes a different symlink (shutdown.target.wants) than enable creates (multi-user.target.wants):

# systemctl disable purge-docker
Removed /etc/systemd/system/shutdown.target.wants/purge-docker.service.
# systemctl enable purge-docker
Created symlink /etc/systemd/system/multi-user.target.wants/purge-docker.service → /lib/systemd/system/purge-docker.service.

My questions are:

  • Why is the symlink disappearing?
  • Why does systemd say that the service is enabled even though the symlink is missing?
  • Why does a shutdown.target.wants symlink appear on VMs when this breaks?
Score:0
fr flag

Why is the symlink disappearing?

Something is deleting it.

If Ubuntu has kernel audit support, add an audit rule to track any changes to the file – either using auditctl directly (non-persistent) or adding it to /etc/audit/audit.rules (persistent):

-a always,exit -w /etc/systemd/system/multi-user.target.wants/purge-docker.service -S all

Your dmesg (or the journal under _TRANSPORT=audit) will then have entries for every syscall that touches this file, logging the program's command line and everything. The following will be useful to decode the messages in dmesg though:

perl -pe 's/(?<=proctitle=)([0-9A-F]+)/join(" ", map {"\"$_\""} split("\0", $1 =~ s![0-9A-F]{2}!chr hex $&!ger))/ge'

Why does systemd say that the service is enabled even though the symlink is missing?

The check is based on whether any unit (or any .target?) has a symlink for that service under .wants/ or .requires/, not necessarily whether it exactly matches the [Install] section.

Why does a shutdown.target.wants symlink appear on VMs when this breaks?

I don't know, but my wild guess would be that a coworker moves it there.


(Also, your custom unit files preferably should themselves be in /etc, not in /lib. Generally /lib is "package manager" territory.)

us flag
Good idea on `auditctl` and using `/etc` not `/lib`. Still haven't found the root cause but the problem seems to have stopped happening.
us flag
Unlikely that a coworker moved the file. The VMs in question are about 80 VMs in AWS used for automated testing. They're built from the same image and maintained/updated via Ansible playbooks. The problem was happening on many machines, almost no one logs into them other than automated tests, and the test code doesn't run with enough sudo privileges to make system changes. Since the tests fail if the disks fill up with Docker images moving the file would just break the tests of the person logging in.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.