On Debian under systemd, by default KVM virtual machines under libvirt get assigned to the "machine.slice" slice.
If I then add a cpuset for this slice with cset
and some custom set of CPUs, and start a VM, the VM is added to the proper cpuset, i.e.
user@host ~ $ sudo cset set --list --recurse
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-31 y 0 y 610 1 /
machine.slice 2-15,18-31 n 0 n 0 1 /machine.slice
machine-qemu\x2d1\x2dweb1.scope 2-15,18-31 n 0 n 0 5 /ma....scope
vcpu1 2-15,18-31 n 0 n 1 0 /machine.sli...web1.scope/vcpu1
vcpu2 2-15,18-31 n 0 n 1 0 /machine.sli...web1.scope/vcpu2
vcpu0 2-15,18-31 n 0 n 1 0 /machine.sli...web1.scope/vcpu0
emulator 2-15,18-31 n 0 n 82 0 /machine.sli...1.scope/emulator
vcpu3 2-15,18-31 n 0 n 1 0 /machine.sli...web1.scope/vcpu3
What I'm trying to do is replicate this behaviour with a separate slice and cpuset. However, it doesn't seem to work.
First I create the cset:
user@host ~ $ sudo cset set -c 0-1,16-17 osd.slice
cset: --> created cpuset "osd.slice"
Then I set the service I want to use the slice:
user@host ~ $ diff -u /lib/systemd/system/[email protected] /etc/systemd/system/[email protected]
--- /lib/systemd/system/[email protected] 2021-05-27 06:04:21.000000000 -0400
+++ /etc/systemd/system/[email protected] 2022-11-08 17:20:32.515087642 -0500
@@ -6,6 +6,7 @@
Wants=network-online.target local-fs.target time-sync.target remote-fs-pre.target ceph-osd.target
[Service]
+Slice=osd.slice
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Then I start one of the services. If I check the service status, I do see that it's in the right slice/cgroup:
user@host ~ $ systemctl status [email protected]
● [email protected] - Ceph object storage daemon osd.0
Loaded: loaded (/etc/systemd/system/[email protected]; disabled; vendor preset: enabled)
Active: active (running) since Tue 2022-11-08 17:22:32 EST; 1s ago
Process: 251238 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Main PID: 251245 (ceph-osd)
Tasks: 25
Memory: 29.5M
CPU: 611ms
CGroup: /osd.slice/[email protected]
└─251245 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
And just for sanity, if I check the VM transient service, it looks basically the same:
$ systemctl status machine-qemu\\x2d1\\x2dweb1.scope
● machine-qemu\x2d1\x2dweb1.scope - Virtual Machine qemu-1-web1
Loaded: loaded (/run/systemd/transient/machine-qemu\x2d1\x2dweb1.scope; transient)
Transient: yes
Active: active (running) since Tue 2022-11-08 17:03:57 EST; 22min ago
Tasks: 87 (limit: 16384)
Memory: 1.7G
CPU: 4min 33.514s
CGroup: /machine.slice/machine-qemu\x2d1\x2dweb1.scope
└─234638 /usr/bin/kvm -name guest=web1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-web1/master-key.aes -machine pc-i440fx-2.7,accel=kvm,usb=off,dump-guest-core=off,memory-ba>
However and this is where I'm stuck: if I then check cset
again, the "tasks" are not assigned to the slice cset as I would expect; they are part of the root
cset instead, and the slice cset has 0 tasks and 0 subs:
user@host ~ $ sudo cset set --list --recurse
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-31 y 0 y 622 2 /
osd.slice 0-1,16-17 n 0 n 0 0 /osd.slice
I can see nothing obvious about how machine.slice
is doing this, no reference to it in the actual machine.slice
unit file, nor anything in the transient scope
units.
How can I get this new, custom slice/cgroup to emulate what machine.slice
is doing, and force anything under it into this cpuset?
As an addendum for the "why"/X-to-my-Y, I've tried do something like spawn the ceph-osd
process in the cset manually using cset proc --exec
command, but this doesn't work reliably (sometimes it just fails entirely with "cannot move"), and even if it does work, its threads end up stuck in the root cset afterwards even if the main process is moved. So it seems to be that I need a way to make systemd treat the entire unit as part of the cset, before the actual process ever starts (unlike the cset proc
command which spawns it, forks it, then alters it), which looks like what is done with machine.slice
here.