Kubernetes limit number of simultaneous pod restarts over whole cluster

mogoman

10/24/22, 3:19 PM

We have a 6 node Kubernetes cluster running around 20 large replica set workloads (Java services). Each workload pod (1 pod per workload) takes about 30 seconds on average to start and use a lot of CPU. This makes starting multiple pods/workloads at the same time a problem - to the point that when 2 or 3 start at the same time on the same node they take minutes to start and eventually get killed by the readiness probe. The readiness probe is fairly relaxed, but extending the grace time indefinitely doesn't seem like good practice.

As one can imagine, this makes cordoning and draining a node problematic - if we drain a node all the pods restart at the same time somewhere else and can overload a worker (or bring it to a standstill causing multiple restarts which eventually lead to database locks).

To get around this I've written a shell script which uses kubectl to list out the pods, restart each (by patching the meta data), wait for status to become available and move to the next one.

Scripts work fine for server patching or workload upgrades, but don't solve the problem of a node outage - everything runs in AWS and when a node fails a new one is created via autoscaling, but it means 4 pods try and restart at the same time (usually on Sunday morning at 3am of course).

One idea would be to have an init container which is aware of the other starting workloads - if no other workloads are currently starting on the same node, then the init container exits allowing the main container to start. This would require a service account and permissions, but could be a workaround, but I was wondering if there was a more standard way to do this via configuration (affinity rules etc)?

224

1 + 0

shell-scripting

kubernetes

kubectl

Score:2

Server

Spooler

10/24/22, 6:18 PM

This is the kind of problem one runs into when pods are schedulable anywhere. You're on the right track with affinity rules.

You could make these pods express an anti-affinity to each other by making pods within a deployment's replicaset express negative affinity for each other (so they spread among nodes). This makes scheduling somewhat heavy, but does accomplish keeping pods from causing cascading failures when a node is lost. It also does a pretty good job of making sure they're spread among failure domains, but that's more of a side-effect.

However, there is a better way to accomplish this - via pod topology spread constraints. By specifying a spread constraint, the scheduler will ensure that pods are either balanced among failure domains (be they AZs or nodes), and that failure to balance pods results in a failure to schedule.

One could write this in a way that guarantees pods are distributed among nodes, and that a node failure will not cause "bunching". Take a look at this example pod:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  - maxSkew: 1
    topologyKey: node
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1

This can be combined with affinity rules if you also do not want deployments and their replicasets to schedule with other deployments on the same node, further reducing the "bunching" effect. A soft anti-affinity is typically appropriate in such a case, so the scheduler will "try to not" colocate those workloads when possible.

0 + 1

mogoman

10/25/22, 10:38 AM

great, thanks for that I will test it out and see how it works

mogoman

7/9/23, 1:53 PM

I struggled to make it work as suggested, however this worked well: 1. added an extra label "group: mygroup" `labels.group: mygroup` 2. added this anti affinity rule: ` affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: group operator: In values: - mygroup topologyKey: kubernetes.io/hostname weight: 100 ` Now all deployments with the label group: mygroup get spread nicely.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Kubernetes limit number of simultaneous pod restarts over whole cluster

TH: Kubernetes จำกัดจำนวนการรีสตาร์ทพ็อดพร้อมกันทั่วทั้งคลัสเตอร์

RO: Kubernetes limitează numărul de reporniri simultane a podului pe întregul cluster

RU: Kubernetes ограничивает количество одновременных перезапусков модуля во всем кластере

VI: Kubernetes giới hạn số lần khởi động lại nhóm đồng thời trên toàn bộ cụm

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.