Score:0

Pod stuck in pending state due to pod affinity/anti-affinity

lu flag

I have a problem, one of the replicas is stuck in a Pending state.

Problem: After another deployment one of the new replicas stacked and I have an empty node which satisfy all necessary requronmetns.

Deployment contains nodeSelector and affinity requirements:

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - vision-api-extract
            topologyKey: "kubernetes.io/hostname"
      nodeSelector:
        insttype: gpu

and there is 3 nodes with proper label

ip-10-0-11-16.ec2.internal                Ready    <none>   114d    v1.18.3    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g3.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,insttype=gpu,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-11-16,kubernetes.io/os=linux,node.kubernetes.io/instance-type=g3.4xlarge,topology.ebs.csi.aws.com/zone=us-east-1b,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b
ip-10-0-11-206.ec2.internal               Ready    <none>   342d    v1.18.3    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g3.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,insttype=gpu,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-11-206,kubernetes.io/os=linux,node.kubernetes.io/instance-type=g3.4xlarge,topology.ebs.csi.aws.com/zone=us-east-1b,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b
ip-10-0-11-44.ec2.internal                Ready    <none>   114d    v1.18.3    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g3.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,insttype=gpu,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-11-44,kubernetes.io/os=linux,node.kubernetes.io/instance-type=g3.4xlarge,topology.ebs.csi.aws.com/zone=us-east-1b,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b

And here is a description of the pending pod

 Warning  FailedScheduling  <unknown>  default-scheduler  0/13 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 10 node(s) didn't match node selector.

And empty node description as well

Name:               ip-10-0-11-44.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g3.4xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    insttype=gpu
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-11-44
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=g3.4xlarge
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-00919faca1e45926f","efs.csi.aws.com":"i-00919faca1e45926f"}
                    flannel.alpha.coreos.com/backend-data: {"VtepMAC":"ce:02:a2:a2:5e:a7"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.0.11.44
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 26 Mar 2021 08:54:41 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-11-44.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Sun, 18 Jul 2021 11:52:59 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 18 Jul 2021 11:51:26 +0000   Sat, 17 Jul 2021 14:00:36 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 18 Jul 2021 11:51:26 +0000   Sat, 17 Jul 2021 14:00:36 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 18 Jul 2021 11:51:26 +0000   Sat, 17 Jul 2021 14:00:36 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 18 Jul 2021 11:51:26 +0000   Sat, 17 Jul 2021 14:00:38 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.0.11.44
  Hostname:     ip-10-0-11-44.ec2.internal
  InternalDNS:  ip-10-0-11-44.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         16
  ephemeral-storage:           60923672Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      125709124Ki
  pods:                        110
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         16
  ephemeral-storage:           56147256023
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      125606724Ki
  pods:                        110
System Info:
  Machine ID:                 94c328b1fcaf4999b5de9f749ac998b8
  System UUID:                ec2c3806-d842-c53f-e93f-cf9059701bdd
  Boot ID:                    469aa16e-80f3-470b-9451-06078a78fa96
  Kernel Version:             5.4.0-1051-aws
  OS Image:                   Ubuntu 18.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://18.9.7
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
PodCIDR:                      10.244.8.0/24
PodCIDRs:                     10.244.8.0/24
ProviderID:                   aws:///us-east-1b/i-00919faca1e45926f
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 ebs-csi-controller-5b64f64f64-x97ng     0 (0%)        0 (0%)      0 (0%)           0 (0%)         24d
  kube-system                 ebs-csi-node-2rwm4                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         114d
  kube-system                 efs-csi-node-9dhb2                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         114d
  kube-system                 kube-flannel-ds-amd64-9xkjg             100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      114d
  kube-system                 kube-proxy-nrjmh                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         114d
  kube-system                 traefik-9mpzr                           500m (3%)     1 (6%)      500Mi (0%)       800Mi (0%)     24d
  monitoring                  node-exporter-gj2qw                     112m (0%)     270m (1%)   200Mi (0%)       220Mi (0%)     114d
  monitoring                  prometheus-operator-6f98f66b89-dnjqd    100m (0%)     200m (1%)   100Mi (0%)       200Mi (0%)     24d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         812m (5%)   1570m (9%)
  memory                      850Mi (0%)  1270Mi (1%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
p10l avatar
us flag
What do you mean by empty node? There are pods running in the node you included
Thunderbird avatar
lu flag
I mean only system daemonsets are running on this node. None of my services not scheduled on the node.
p10l avatar
us flag
Does any of the pods currently running on this node, have a label `app: vision-api-extract`?
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.