Score:0

How to handle kubernetes workload crash-loop under load

cn flag

Here is what I consider a quite generic situation for which I'm not sure what the best solution is.

Consider you have a k8s workload where pods need 10-30seconds to be Ready. If at some point you get a load spike that start to crash your pods for some reason (OOMKills, threadpool overload making the probe unresponsive, whatever). Even though you have an HPA configured, the traffic might only go up from client retries, and eventually, all you pods will crash as soon as they get Ready, because the service sends an important portion of not all of the requests to a single pods, all of the others being in the process of restarting.

EDIT : at this point, I assume the pods have correctly defined Liveness and Readiness probes configured. But if the ingress traffic requires at least N pods and the amount of ready probes is always < N because they crash as soon as they receive the traffic because it's toomuch, what do you do ?

Beside asking all clients to have a circuit breaker / exponential backoff set up on their side, is there a way to ask Kubernetes to stop sending traffic to your deployment until there's "enough" pods ready ? (Enough being either a static number or dynamic depending on the ingress traffic)

Today our solution is to have a circuit breaker on our side and manually stop the traffic until the workload is healthy enough and we manually turn the traffic back on, but I'm wondering if there could be a better way to respond automatically to that situation, when you are unable to prevent it from happening.

Thanks

cn flag
I reckon one possible solution would be a throttling ingress controller or reverse proxy that never sends more than a certain amount of requests per second per pod and responds a 5xx code to the rest of requests, but that sound sub-optimal.
Score:-1
in flag

The Kubernetes health probes (startup and readiness probes) is what you need. Your pods will start receiving traffic only once the health probes are successful. See here for more info

cn flag
Yeah I know that. But if the amount of traffic requires 2-3 healthy pods at the same time, and the situation makes that they start thenc rash one by one without ever being enough ready at the same time, what do yo do ?
faizan avatar
in flag
you should always mention complete information if you have probes setup already to give a better idea to the person answering your question. coming back to your question, if you know that amount of your traffic will require N number of healthy pods, then you need to have a combination of "pod disruption budget" along with HPA to better handle your traffic in case of any disruption caused by the traffic
cn flag
I guess I explained my problematic incorrectly then. I'm not talking about a normal behaviour when pods could get evicted for maintenance/node rollout/whatever. I'm talking about an even that would make the container inside the pod crash outside our control and you need to recover from that situation. Not even a PDB would help, in my opinion, given that the event I'm describing is neither an involuntary nor a voluntary disruption, but an instability caused by crashloops, themselves being maintained by an overload of traffic.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.