To deploy the ML models, we are use the AKS cluster. To deploy the models inside the AKS cluster, we are using airflow.
Two days prior, we manually renewed the AKS cluster's certificate. Nodes, disks, and scale sets are redeployed as per this doc(https://learn.microsoft.com/en-us/azure/aks/certificate-rotation)
After the certificate renewal, two deployments create pods indefinitely, and then all the created pods are evicted. when describing the pods it shows The node had condition: [MemoryPressure]
. I am aware that there is a memory issue, however before the certificate renewal, we were using the same number of nodes and configurations. Because the deployment is producing limitless pods, it is happening right now.
Cluster configuration
AKS cluster version: 1.19.11
Node pool count: 1
Node configuration as below
node count: 7
Node size: Standard_B2s
OS disk size: 128 GB
Node image: ubuntu
I've been looking for this for the past two days, but I haven't been able to figure out why a specific deployment creates infinite pods when ML models are deployed using airflow.
I want to know why the pods are created infinitely from particular deployments and pods are evicted. can anyone give me an idea to troubleshoot the issue? any guidance is helpful for me as it is a production environment.
I am not sure whether I provided all details. if anyone needs any details please let me know