Score:0

GKE autoscaler sometimes doesn't scale pods

pt flag

We have a deployment configured with HPA based on the CPU metric. It can work fine for days, scaling pods up and down. And then at some point looks that it ignores metric and scales to some small number of pods. Usually we resolve it by setting manually minimal number of pods that could handle traffic. And after an hour or two it starts scale again. Here is the result of kubectl describe hpa command at the moment when autoscaler is not working for us:

                                                                 
Name:                                                  my-router-hpa
Namespace:                                             default
Labels:                                                label1=label1
                                                       label2=label2
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 15 Sep 2021 12:19:16 +0000
Reference:                                             Deployment/my-router-v001
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  188% (943m) / 85%
Min replicas:                                          10
Max replicas:                                          100
Deployment pods:                                       10 current / 10 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type    Reason             Age                  From                       Message
  ----    ------             ----                 ----                       -------
  Normal  SuccessfulRescale  60m                  horizontal-pod-autoscaler  New size: 15; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  50m (x2 over 158m)   horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  48m                  horizontal-pod-autoscaler  New size: 7; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  43m (x2 over 105m)   horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  43m                  horizontal-pod-autoscaler  New size: 12; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  37m (x2 over 48m)    horizontal-pod-autoscaler  New size: 6; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  34m (x2 over 47m)    horizontal-pod-autoscaler  New size: 5; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  29m (x2 over 46m)    horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 2; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  16m (x2 over 106m)   horizontal-pod-autoscaler  New size: 1; reason: cpu resource utilization (percentage of request) below target
  Normal  SuccessfulRescale  15m                  horizontal-pod-autoscaler  New size: 5; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  13m (x2 over 148m)   horizontal-pod-autoscaler  New size: 10; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  13m (x3 over 123m)   horizontal-pod-autoscaler  New size: 16; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  8m3s (x2 over 129m)  horizontal-pod-autoscaler  New size: 10; reason: cpu resource utilization (percentage of request) below target

It reports metric: "188% (943m) / 85%". But the last event is saying "below target".

Could you help me understand the behavior of GKE autoscaler or suggest the way to debug it?

mario avatar
cm flag
Could you provide a way of reproducing it on a test GKE cluster ?
Oleksandr Bushkovskyi avatar
pt flag
@mario I don't know how to reproduce this in test environment. I've observed this issue only on production and not too frequently, maybe a couple of times per month.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.