Score:0

Containerd failed to start after Nvidia Config

ky flag

I've follow this official tutorial to allow a bare-metal k8s cluster to have GPU Access. However i received errors while doing so.

Kubernetes 1.21 containerd 1.4.11 and Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-91-generic x86_64).

Nvidia Driver is preinstalled on System OS with version 495 Headless

After pasting the following config inside /etc/containerd/config.toml and perform service restart, containerd would failed to start with exit 1.

Containerd Config.toml

systemd log here.

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"

# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]

# NVIDIA CONFIG START HERE

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

# NVIDIA CONFIG ENDS HERE

[debug]
  level = ""

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[plugins.linux]
  shim = "/usr/bin/containerd-shim"
  runtime = "/usr/bin/runc"

I can confirm that Nvidia Driver does detect the GPU (Nvidia GTX 750Ti) by running nvidia-smi and got the following output

+---------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+---------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+---------------------------------------------+

modified config.toml that got it to work.

in flag
Two things: you'll likely get better help if you post the logs from the container that exited non-zero, since the details matter. Secondly, don't use 1.4.11, there was a security fix in [1.4.12](https://github.com/containerd/containerd/releases/tag/v1.4.12)
XPLOT1ON avatar
ky flag
@mdaniel that you for notifying of such vulnerability, i've updated all nodes. also, I've updated the post above with system log.
Score:2
in flag

As best I can tell, it's this:

Dec 02 03:15:36 k8s-node0 containerd[2179737]: containerd: invalid disabled plugin URI "restart" expect io.containerd.x.vx

Dec 02 03:15:36 k8s-node0 systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE

So if you know that the restart-ish plugin is in fact enabled, you'll need to track down its new URI syntax, but I'd actually recommend just commenting out that stanza, or going with disabled_plugins = [], since the containerd ansible role we use doesn't mention anything about "reboot" and does have the = [] flavor


Tangentially, you may want to restrict your journalctl invocation in the future to just look at the containerd.service, since it will throw out a lot of text that is a distraction: journalctl -u containerd.service and you can even restrict it to just the last few lines, which sometimes can help further: journalctl -u containerd.service --lines=250

XPLOT1ON avatar
ky flag
Thank for the extensive reply, i've tried putting `disabled_plugins` as empty list. It gave me a different error `containerd: invalid plugin key URI "linux" expect io.containerd.x.vx`. I've attached a complete containerd `config.toml` in the original post. If you could have a look that would be great.
in flag
Yes, it seems to be the same problem; `linux` as an unqualified name is evidently the old style, so what you'll likely want is `[plugins."io.containerd.runtime.v1.linux"]` just like you see with the `[plugins]` members at the top of the file and [as shown in the template I linked to](https://github.com/particuleio/symplegma-containerd/blob/v1.4.3-rel.0/templates/config.toml.j2#L132)
XPLOT1ON avatar
ky flag
Thanks for the help, i can now boot up containerd with the integrated config based on nvidia docs. For future ref: I've updated my original post for the updated config.toml
in flag
I'm glad to hear it, and I'm always glad when it's something simple, and I wish you good luck on your journey running GPUs in k8s! Please consider putting the config inline in your question, since linking to external sites runs the risk of them being 404 for future generations
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.