how so I troubleshoot intermittent node/kubelt reboots on a GKE

Rupert Lloyd

12/22/23, 4:53 PM

I am running workloads on a spot GPU node pool & intermittently getting 'NodeNotReady' followed by a reboot/restart of the node (& loss of the the workload pod), however the node does not go away but reboots & the kubelet and becomes ready again after a few minutes (see attached).

I am new to using the spot gpu node types so was wondering if this is to be expected?

If the underlying node is being prempted how can I surface the termination event? https://cloud.google.com/compute/docs/instances/spot#preemption

event log

[EDIT]

After trawling through the logs it looks like the underlying VM is pre-empted & immediately replaced with a new instance, while the k8s node identity remains the same:

so looks like i answered by own questions above, however, I am wondering how often I can expect these pre-emption events to occur? I have used the same spot instances outside of GKE (just as basic VMs) & didn't experience hourly pre-empting like this - in fact I have run workloads there for days without a pre-emption event - perhaps it works differently for GKE?

0 + 0

google-compute-engine

kubernetes

google-kubernetes-engine

gpu

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: how so I troubleshoot intermittent node/kubelt reboots on a GKE

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.