I have a k8 cluster using microk8s, and the cluster started to misbehave, unresponsive / providing stale states when scheduling pods, managing nodes, etc.
I have gone through and rebooted all manager nodes, and removed all worker nodes, to reduce the noise, and in the hopes of potentially evicting a problem node (as I was seeing long response times / timeouts in the /var/logs/syslogs
.
On restart, if I look at the kubectl get events
I see the kubelet
service being constantly restarted:
14m Normal Starting node/node1 Starting kubelet.
14m Warning InvalidDiskCapacity node/node1 invalid capacity 0 on image filesystem
14m Normal Starting node/node3 Starting kubelet.
14m Normal NodeHasSufficientMemory node/node1 Node node1 status is now: NodeHasSufficientMemory
14m Normal NodeHasNoDiskPressure node/node1 Node node1 status is now: NodeHasNoDiskPressure
14m Normal Starting node/node3
14m Normal NodeHasSufficientPID node/node1 Node node1 status is now: NodeHasSufficientPID
14m Warning InvalidDiskCapacity node/node3 invalid capacity 0 on image filesystem
14m Normal NodeHasSufficientMemory node/node3 Node node3 status is now: NodeHasSufficientMemory
14m Normal NodeAllocatableEnforced node/node1 Updated Node Allocatable limit across pods
14m Normal NodeHasNoDiskPressure node/node3 Node node3 status is now: NodeHasNoDiskPressure
14m Normal NodeHasSufficientPID node/node3 Node node3 status is now: NodeHasSufficientPID
14m Normal NodeAllocatableEnforced node/node3 Updated Node Allocatable limit across pods
11m Normal Starting node/node1
11m Normal Starting node/node1 Starting kubelet.
11m Warning InvalidDiskCapacity node/node1 invalid capacity 0 on image filesystem
11m Normal NodeHasSufficientMemory node/node1 Node node1 status is now: NodeHasSufficientMemory
11m Normal NodeHasNoDiskPressure node/node1 Node node1 status is now: NodeHasNoDiskPressure
11m Normal NodeHasSufficientPID node/node1 Node node1 status is now: NodeHasSufficientPID
11m Normal NodeAllocatableEnforced node/node1 Updated Node Allocatable limit across pods
10m Normal Starting node/node3 Starting kubelet.
10m Warning InvalidDiskCapacity node/node3 invalid capacity 0 on image filesystem
10m Normal NodeAllocatableEnforced node/node3 Updated Node Allocatable limit across pods
10m Normal NodeHasSufficientMemory node/node3 Node node3 status is now: NodeHasSufficientMemory
10m Normal NodeHasNoDiskPressure node/node3 Node node3 status is now: NodeHasNoDiskPressure
10m Normal NodeHasSufficientPID node/node3 Node node3 status is now: NodeHasSufficientPID
8m1s Normal Starting node/node1
7m57s Normal Starting node/node1 Starting kubelet.
7m57s Warning InvalidDiskCapacity node/node1 invalid capacity 0 on image filesystem
7m57s Normal NodeHasSufficientMemory node/node1 Node node1 status is now: NodeHasSufficientMemory
7m9s Normal Starting node/node3 Starting kubelet.
7m57s Normal NodeHasNoDiskPressure node/node1 Node node1 status is now: NodeHasNoDiskPressure
7m8s Normal Starting node/node3
7m57s Normal NodeHasSufficientPID node/node1 Node node1 status is now: NodeHasSufficientPID
7m9s Warning InvalidDiskCapacity node/node3 invalid capacity 0 on image filesystem
7m8s Normal NodeHasSufficientMemory node/node3 Node node3 status is now: NodeHasSufficientMemory
7m57s Normal NodeAllocatableEnforced node/node1 Updated Node Allocatable limit across pods
7m8s Normal NodeHasNoDiskPressure node/node3 Node node3 status is now: NodeHasNoDiskPressure
7m8s Normal NodeHasSufficientPID node/node3 Node node3 status is now: NodeHasSufficientPID
7m8s Normal NodeAllocatableEnforced node/node3 Updated Node Allocatable limit across pods
I'm not sure where to find the logs that will give me the reasons for these restarts, and googling the invalid capacity 0
warning, people seem to say it can be ignored. Also it's a warning and not an error, nor is it the final log before a restart, so I would assume it isn't preventing the startup. Though, I'm not sure.
I'm looking for more logs that will give me more details into why this service is failing on the nodes. I've looked in journalctl -f -u snap.microk8s.daemon-kubelite
as the docs say the kubelet logs have been consolidated into there.
I can't include the logs as it's past the char limit on ServerFault, and you can't seem to upload files. I can maybe provide snippets, or search for specific things.
Here's some things that stand out, but dunno if any of them would lead to the kubelet not starting:
...
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.025847 459922 server.go:1251] "Started kubelet"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.025894 459922 server.go:177] "Starting to listen read-only" address="0.0.0.0" port=10255
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.025926 459922 server.go:150] "Starting to listen" address="0.0.0.0" port=10250
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.026588 459922 server.go:410] "Adding debug handlers to kubelet server"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.027415 459922 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.027521 459922 volume_manager.go:294] "Starting Kubelet Volume Manager"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.027975 459922 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: E0818 12:25:46.028881 459922 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/snap/microk8s/common/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: E0818 12:25:46.029007 459922 kubelet.go:1351] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.061667 459922 kubelet_network_linux.go:57] "Initialized protocol iptables rules." protocol=IPv4
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.072440 459922 kubelet_network_linux.go:57] "Initialized protocol iptables rules." protocol=IPv6
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.072452 459922 status_manager.go:161] "Starting to sync pod status with apiserver"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.072461 459922 kubelet.go:2031] "Starting kubelet main sync loop"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: E0818 12:25:46.072492 459922 kubelet.go:2055] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.122991 459922 cpu_manager.go:213] "Starting CPU manager" policy="none"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.123003 459922 cpu_manager.go:214] "Reconciling" reconcilePeriod="10s"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.123016 459922 state_mem.go:36] "Initialized new in-memory state store"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.123144 459922 state_mem.go:88] "Updated default CPUSet" cpuSet=""
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.123156 459922 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.123162 459922 policy_none.go:49] "None policy: Start"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.124388 459922 memory_manager.go:168] "Starting memorymanager" policy="None"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.124403 459922 state_mem.go:35] "Initializing new in-memory state store"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.124496 459922 state_mem.go:75] "Updated machine memory state"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.125518 459922 manager.go:611] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.125689 459922 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.126270 459922 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: cstor.csi.openebs.io endpoint: /var/snap/microk8s/common/var/lib/kubelet/plugins/cstor.csi.openebs.io/csi.sock versions: 1.0.0
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.126286 459922 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: cstor.csi.openebs.io at endpoint: /var/snap/microk8s/common/var/lib/kubelet/plugins/cstor.csi.openebs.io/csi.sock
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.128371 459922 kubelet_node_status.go:70] "Attempting to register node" node="node3"
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.159655 459922 serving.go:348] Generated self-signed cert in-memory
Aug 18 12:25:46 node3 microk8s.daemon-kubelite[459922]: I0818 12:25:46.173125 459922 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="7d5dfde734a4022cf57816fb8fd2bfd9b30dc4c44fefb262d866854e9905fedd"
...
Would appreciate any help in finding the error that matters here.
Thanks.