Score:1

Memory ballooning doesn't try to reclaim page cache

pl flag

I have a host running Proxmox with some VMs on it. Due to some unpredictability with the memory usage of some applications, and wanting to give some VMs (like a DB) memory for the page cache when there was available memory, I have overprovisioned memory.

I have tried to test the reliability of this setup by trying to use more memory in total than the host has, and this OOM killed the VM instead trying to reclaim the page cache of the other VMs.

Is there anything I can do to allow the balloon driver to reclaim memory, or am I misunderstanding how memory ballooning works?

Score:1
cn flag

Over provisioning memory to VM guests always has the risk of becoming a very bad problem in the event of shifting loads. On top of that, ballooning makes capacity planning more complicated.

First, balloon. Meaning an aware guest reserves the difference between its maximum and current memory, leaving some available to the host. Confirm your guests have the necessary drivers, server Linux distros probably do. Linux KVM requires the user to change the current size; I assume you have not been adjusting balloon size manually.

Proxmox distro is different. pvstatd is capable of auto balloon, where is adjusts guest size based on their configured memory shares and available host memory. Find out what the configured shares of the guests are, and read the logs for what balloon events happened.

Say a guest is started at the top end of memory capacity. Host allocates some GB of memory in the process. Even though Linux memory management is lazy in using physical memory pages, it won't be that long until a significant amount of guest memory is referenced. Meanwhile, it will be some time before Proxmox notices and auto adjust balloons. Linux guests in particular can give up caches very quickly, but this ballooning is moving at the speed of monitoring tools, not at the speed of the kernel. Not surprising that the host OS memory management can exhaust its reclaim options and OOM kill.

A safe option that you won't like, do not over provision guest memory, which also means no balloon. Size the DBs according to its fixed shared memory or whatever other algorithm. Size the app servers to approximately the maximum expected or observed memory. The expense in memory purchases predictable performance.

Target both guest and host utilization somewhere below maximum. Maybe 80% utilized, although workflows vary enormously so what you can get away with will be different. This buffer leaves space for admin things like the kernel's system memory, and the remainder for caches.

If you want to over provision, your capacity planning needs to be more sophisticated. Back off the guest sizes until you get a load that will not OOM the host. Adjust guest balloons before you start new guests at the edge of capacity. Study and tune Proxmox's auto system and test if it is helpful.

Nikita Kipriyanov avatar
za flag
What I thought when I first saw the question, is that better do not balloon the database. It reportedly hurts performance.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.