Score:2

Isolating CPUs on AWS/GCP: error mounting cpuset

ng flag

I have two 32 vCPU instances on AWS/GCP. I'm trying to set up cpu shielding so that CPUs 0, 1 are used by the system, and cpus 2-31 are shielded and only used explicitly by userspace threads.

System info:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:    22.04
Codename:   jammy
$ cat /proc/filesystems | grep cpuset
nodev   cpuset

However, when I try to run cset shield, I get an error to do with mounts:

mount: /cpusets: none already mounted on /run/credentials/systemd-sysusers.service.
cset: **> mount of cpuset filesystem failed, do you have permission?

I've dug a bit into the cset code, and it seems like the failing call is one to

$ sudo mount -t cpuset cpuset /cpusets
mount: /cpusets: cpuset already mounted or mount point busy.

/cpusets is a newly created folder, and $ cat /proc/mounts | grep cpuset is empty -- so cpuset doesn't seem to be mounted elsewhere.

Maybe relevant:

$ cat /proc/mounts | grep cgroup
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

My guess is that AWS/GCP use cpuset for the hypervisor, or something like that. Is it possible to isolate cpus on AWS/GCP? How can I go about it?

br flag
Out of interest - "I'm trying to set up cpu shielding so that CPUs 0, 1 are used by the system, and cpus 2-31 are shielded and only used explicitly by userspace threads" - why?
Score:2
fr flag
anx

You are using systemd which already mounted v2 ("unified") cgroups, so its not you managing the control groups - its systemd. Tell it to do so via the CPUAffinity= and related options in the [Manager] section of a /etc/systemd/system.conf.d/50-my-cpuset-options.conf file. You can then use the CPUAffinity= (empty to reset, non-empty to add) in for those specific unit.service files you wish to exempt from the global default.

You can even use systemd APIs to transiently (until reboot) modify resource options on already-running services via the systemctl --runtime set-property example.service ExampleOption=Value command. Use that to confirm the resulting cgroup settings and measure how it impacts your system performance. I imagine instead of global defaults, you will see measurably better system reliability under congestion if instead of damaging the scheduler ability to utilize 100% of the CPU, you improve its full abilities. More closely match your priorities using Nice= and IOSchedulingClass= on those specific low-priority asynchronous background tasks that you do want running but do not want to impact the rest of the system - but leave the affinity sledgehammer unused.


In theory, utilities like cset could be updated to instead interface with such cgroup2 system managers and offer effectively identical abstractions as before while in the background modifying systemds system.slice and unit defaults, but this discussion sounds like nobody has done that so far. And since the all-encompassing giant-chunk-of-C offers much richer, well-documented and arguably more versatile control of all the neat things the kernel has learned to to, there may no longer be a need to.

anx avatar
fr flag
anx
elaborating on the rant: The invention of isolcpus= and "setting affinity" was, is, and always has been a very strong smell of a serious lack in context separation and/or resource accounting. NUMA was not an excuse because the scheduler should know about it, L3 cache was not an excuse because the scheduler should know how it works, branch prediction was not an excuse because it *should not* work across contexts. Not using part of the CPU is a dirty workaround and using it should be seen as proof of lacking kernel features then - and lacking userland software or utilisation of those now.
ng flag
My use case is that I'm optimizing for latency of a specific hotpath (and it's p99) rather than general throughput. From what I've gathered, "niceness" and cpushares are evaluated by default every 100ms, and it can be set to be evaluated more frequently, but then you're dealing with more kernel interrupts. 100ms is orders of magnitude too slow to matter for our use case, so our workaround is to just have a dedicated core to processing only our hotpath. The other background compute (both system + userspace that is not latency-sensitive) can use the full CPUs
ng flag
In this case, if we want to run all of the system processes on cores 0 & 1 for example, the idea would be to create a file in `/etc/systemd/system.conf.d/whatever_filename.conf`, and just have the `[manager] CPUAffinity=0,1`entry. This would prevent tasks from being assigned to the other cores?
ng flag
(+ reboot the machine)
anx avatar
fr flag
anx
Check the uppercase spelling in the [Manager] and do call `systemctl daemon-reload` if it won't live update, you can confirm its effect on init by seeing it published in `/proc/1/status`, and its effect inherited as the unit default after restarting them (or rebooting).
simonhf avatar
bi flag
James, did you get this to work in the end somehow? How? And did you try the old 'isolcpus' too? They say they want to deprecate it, but it's still around... :-)
Score:1
bf flag

As mentioned, systemd creates its own cgroup2 hierarchies which, based on my needs, don't play well with cset.

I prepended the following into GRUB_CMDLINE_LINUX_DEFAULT value in /etc/default/grub to disable this behavior on Ubuntu 22.10.

The line will look something like this after you're done:

GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=false <your_other_params>"

After you're done you'll need to run as root:

update-grub
grub-install

and then reboot. After reboot I was able to succesfully shield the CCX's on my Ryzen CPU using CSET as well as migrate all userland and system tasks

Here's more references to this issue that I came across in case you want further background: https://github.com/systemd/systemd/issues/13477#issuecomment-528113009 https://github.com/lxc/lxd/issues/10441

anx avatar
fr flag
anx
While this still works.. early cgroup support was re-done in backwards-incompatible fashion for good reasons. Known deficiencies of the legacy system, while not justifying removal, are going to make this the less preferred way for most new deployments. Both distribution and systemd maintainers have before and likely will again not take action - not even updating docs - on **known malfunctions** because they are just not realistically resolvable in user space, and are unlikely to receive future attention by the more limited pool of people qualified to resolve them in the kernel.
Kash avatar
bf flag
I would definitely like to know more about this, is there any documentation anywhere you'd recommend? My use case is limited to just making sure my KVM guest has optimal performance so I don't know too much above and beyond whacking it to make it work :)
anx avatar
fr flag
anx
You'll find some issues hinted at in the [Microsoft issue tracker](https://github.com/systemd/systemd/issues?q=is%3Aissue+cgroupsv1). The [history section of this man pages](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html) specifically notes *some* options that won't work as expected, but will not list for every option added later the fact that it will plain not work / behave differently. AFAIK no deal-breakers for purely starting single-process privileged processes - yet. Canonical needs to retain basic support until old LTS releases go EoL anyway.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.