I'm testing my application on a bare-metal Kubernetes cluster (version 1.22.1) and having an issue when launching my application as a Job.
My cluster has two nodes (master and worker) but the worker is cordoned. On the master node, 21GB of memory is available for the application.
I tried to launch my application as three different Jobs at the same time. Since I set 16GB of memory as both resource request and limit, only a single Job was started and the remaining two are in a Pending state. I have set backoffLimit: 0 to the Jobs.
app1--1-8pp6l 0/1 Pending 0 42s
app2--1-42ssl 0/1 Pending 0 45s
app3--1-gxgwr 0/1 Running 0 46s
After the first Pod completes, only one of two Pods in a Pending state should be started. However, one was started, and the other one was in an OutOfMemory status even though no container has been started in the Pod.
app1--1-8pp6l 0/1 Running 0 90s
app2--1-42ssl 0/1 OutOfmemory 0 93s
app3--1-gxgwr 0/1 Completed 0 94s
The events of the OutOfMemory Pod is as follows:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m41s (x2 over 5m2s) default-scheduler 0/2 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable.
Normal Scheduled 3m38s default-scheduler Successfully assigned test/app2--1-42ssl to master
Warning OutOfmemory 3m38s kubelet Node didn't have enough resource: memory, requested: 16000000000, used: 31946743808, capacity: 37634150400
It seems that the Pod is assigned to the node even though there is not enough space for it as the other Pod has just been started.
I guess this isn't an expected behavior of Kubernetes, does anyone know the cause of this issue?