Score:0

Assign systemd slice to a specific cset automatically

cn flag

On Debian under systemd, by default KVM virtual machines under libvirt get assigned to the "machine.slice" slice.

If I then add a cpuset for this slice with cset and some custom set of CPUs, and start a VM, the VM is added to the proper cpuset, i.e.

user@host ~ $ sudo cset set --list --recurse
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-31 y       0 y   610    1 /
 machine.slice 2-15,18-31 n       0 n     0    1 /machine.slice
 machine-qemu\x2d1\x2dweb1.scope 2-15,18-31 n       0 n     0    5 /ma....scope
        vcpu1 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu1
        vcpu2 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu2
        vcpu0 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu0
     emulator 2-15,18-31 n       0 n    82    0 /machine.sli...1.scope/emulator
        vcpu3 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu3

What I'm trying to do is replicate this behaviour with a separate slice and cpuset. However, it doesn't seem to work.

First I create the cset:

user@host ~ $ sudo cset set -c 0-1,16-17 osd.slice
cset: --> created cpuset "osd.slice"

Then I set the service I want to use the slice:

user@host ~ $ diff -u /lib/systemd/system/[email protected] /etc/systemd/system/[email protected]
--- /lib/systemd/system/[email protected]       2021-05-27 06:04:21.000000000 -0400
+++ /etc/systemd/system/[email protected]       2022-11-08 17:20:32.515087642 -0500
@@ -6,6 +6,7 @@
 Wants=network-online.target local-fs.target time-sync.target remote-fs-pre.target ceph-osd.target
 
 [Service]
+Slice=osd.slice
 LimitNOFILE=1048576
 LimitNPROC=1048576
 EnvironmentFile=-/etc/default/ceph

Then I start one of the services. If I check the service status, I do see that it's in the right slice/cgroup:

user@host ~ $ systemctl status [email protected][email protected] - Ceph object storage daemon osd.0
     Loaded: loaded (/etc/systemd/system/[email protected]; disabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-11-08 17:22:32 EST; 1s ago
    Process: 251238 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
   Main PID: 251245 (ceph-osd)
      Tasks: 25
     Memory: 29.5M
        CPU: 611ms
     CGroup: /osd.slice/[email protected]
             └─251245 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

And just for sanity, if I check the VM transient service, it looks basically the same:

$ systemctl status machine-qemu\\x2d1\\x2dweb1.scope 
● machine-qemu\x2d1\x2dweb1.scope - Virtual Machine qemu-1-web1
     Loaded: loaded (/run/systemd/transient/machine-qemu\x2d1\x2dweb1.scope; transient)
  Transient: yes
     Active: active (running) since Tue 2022-11-08 17:03:57 EST; 22min ago
      Tasks: 87 (limit: 16384)
     Memory: 1.7G
        CPU: 4min 33.514s
     CGroup: /machine.slice/machine-qemu\x2d1\x2dweb1.scope
             └─234638 /usr/bin/kvm -name guest=web1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-web1/master-key.aes -machine pc-i440fx-2.7,accel=kvm,usb=off,dump-guest-core=off,memory-ba>

However and this is where I'm stuck: if I then check cset again, the "tasks" are not assigned to the slice cset as I would expect; they are part of the root cset instead, and the slice cset has 0 tasks and 0 subs:

user@host ~ $ sudo cset set --list --recurse
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-31 y       0 y   622    2 /
    osd.slice  0-1,16-17 n       0 n     0    0 /osd.slice

I can see nothing obvious about how machine.slice is doing this, no reference to it in the actual machine.slice unit file, nor anything in the transient scope units.

How can I get this new, custom slice/cgroup to emulate what machine.slice is doing, and force anything under it into this cpuset?

As an addendum for the "why"/X-to-my-Y, I've tried do something like spawn the ceph-osd process in the cset manually using cset proc --exec command, but this doesn't work reliably (sometimes it just fails entirely with "cannot move"), and even if it does work, its threads end up stuck in the root cset afterwards even if the main process is moved. So it seems to be that I need a way to make systemd treat the entire unit as part of the cset, before the actual process ever starts (unlike the cset proc command which spawns it, forks it, then alters it), which looks like what is done with machine.slice here.

Score:0
cn flag

I ended up abandoning cset as the ideal way to do this. The fact that it requires the old v1 cgroup hierarchy and hasn't been significantly updated in years played a major part in that, as did this bug in particular causing me to look more into systemd's options.

I then found systemd's integrated AllowedCPUs directive, which also seems to do exactly what I wanted, especially when deployed at the slice level.

Going this way, I created several drop-in slice units in /etc/systemd/system for each of the various subsystems I wanted to isolate (system.slice for the majority of tasks to one cpuset, osd.slice for my OSD processes, and machine.slice for the VMs), each setting an AllowedCPUs with the specified limit as well as enabling Delegate to be sure. One reboot later and as far as I can tell it's working exactly as intended.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.