Answering my own questioun as I found a resolution .
The oom kills were happening even when free stats were looking good like on a 256G RAM only 140G was used and still around 100G shows up as free .
[root@serverxx ~]# free -g
total used free shared buff/cache available
Mem: 251 140 108 0 2 108
Swap: 19 6 13
oom kills were triggered by high %commit in the sar stats where the kernel starts targetting instances with high memory footprint to free up .
To avoid oom kills for the guest instances with higher memory footprints , I set the following .
vm.oom_kill_allocating_task=1
When I did a sar -r the %commit was way higher than the system can allocate and I figured from ps that it was a cinder-backup container that was created by default from kolla-ansible deployments but was not configured .
Cinder backup service stats that I didn't configure and it was just running , it turned out that the unconfigured container was taking up all the memory overtime as seen from the output of ps command in the vsz .
ps -eo args,comm,pid,ppid,rss,vsz --sort vsz column
VSZ is extremely high
COMMAND COMMAND PID PPID RSS VSZ
/usr/libexec/qemu-kvm -name qemu-kvm 1916998 47324 8094744 13747664
/var/lib/kolla/venv/bin/pyt cinder-backup 43689 43544 170999912 870274784
Sar stats for % commit coming back to normal after the backup container was stopped and now everything is back to normal . %commit highlighted from 1083.46 to 14.21 after the changes .
02:00:37 PM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
03:00:37 PM 48843576 49998184 82890508 62.92 9576 5949348 1427280428 1083.46 75646888 2797388 324
03:10:37 PM 48829248 49991284 82904836 62.93 9576 5956544 1427343664 1083.50 75653556 2804592 116
03:20:22 PM 120198612 121445516 11535472 8.76 9576 6042892 18733688 14.22 4887688 2854704 80
03:30:37 PM 120189464 121444176 11544620 8.76 9576 6050200 18725820 14.21 4887752 2862248 88