Bind mount one container FS tree into another for debug or ephemeral containers?

Question

Score:3

Server

Bind mount one container FS tree into another for debug or ephemeral containers?

Craig Ringer

7/10/23, 3:39 AM

I'm testing out k8s debugging features including debug pods and ephemeral containers, and I just can't work out how to properly map a "target" pod's file system into the debug container.

I want to link two disjoint mount namespaces with a recursive bind mount* so container A sees container B's root as /containerB or vice versa. Including all volumes and other mounts.

Goal: Access to both debug and target container file systems at the same time

The goal is to have the target pod's full filesystem tree, including volumes and other mounts mapped to a subdir of the debug container e.g. /run/target. If the target container mounts persistent volumes, those mount points should be mapped, so e.g. if target container has /data then the debug container should have a mounted /run/target/data.

Alternately, it'd be ok to "inject" the debug container file system tree into the target container, so there's e.g. a /run/debug that exposes the debug container root available when nsentering the debug container. Including its mounts like procfs, so it's fully functional.

I want to be able to e.g. gdb -p $target_pid where gdb is provided by the debug container. gdb has to be able to find the process executables from the target container for this.

I've explored a few workaround approaches. But what I really want to do is mount --rbind the target container FS tree onto the guest or vice versa. Given a custom-built privileged debug container like:

apiVersion: v1
kind: Pod
metadata:
  name: debugcontainer
  namespace: default
spec:
  nodeName: TARGET_NODE_NAME_HERE
  enableServiceLinks: true
  hostIPC: true
  hostNetwork: true
  hostPID: true
  restartPolicy: Never
  containers:
  - image: DIAG_CONTAINER_IMAGE_HERE # you can experiment using something like ubuntu:20.04 
    name: debugger
    stdin: true
    tty: true
    volumeMounts:
    - mountPath: /target
      name: target
    #- mountPath: /host
    #  mountPropagation: None
    #  name: host-root
    securityContext:
      privileged: true
      runAsGroup: 0
      runAsUser: 0
  volumes:
  - emptyDir: {}
    name: target
  #- hostPath:
  #    path: "/"
  #    type: ""
  #  name: host-root

where the debug container is launched into the same node as the target container, I can:

See target container processes in ps
attach to processes with strace, gdb etc because the privileged debug container has CAP_SYS_PTRACE
nsenter -t $some_target_container_pid --all to "become" a proc in the target container, as if I'd done kubectl exec. I can no longer "see" or access the debug container files/tools.
nsenter -t $some_target_container_pid -m --root=/ --wd=/ to enter the target proc's mount namespace, but retain the privs of the debug container. I can no longer "see" or access the debug container files/tools.

But I cannot:

See files in the target container at the same time as having access to the tools in the debug container - e.g. gdb can't find the executables being debugged
See contents of volumes in the target container and apply debug container tools to them

Is there any recognised way to do this?

It's not totally k8s specific: the same issue applies with Docker, containerd, runc, etc.

You might expect this to be possible by using mount --rbind to "inject" the debug container into the target container via the host container namespace using a hostPath volume with mountPropagation: Bidirectional. But containerd mounts the container root image, sets mount propagation to private then mounts inner volumes. So the host mount namespace doesn't see the mounts made inside the container root image, and procs in the container don't see new mounts added by the host after the container's first process starts. See https://man7.org/linux/man-pages/man7/mount_namespaces.7.html for details.

I've tried using nsenter to "cross" mount namespaces, but I can't get a bind mount to work. E.g. in the debug container I can

nsenter -t $some_target_container_pid --root=/ -m /bin/bash

which gives me a shell in which . (CWD) is the debug container rootfs, and / is the target container rootfs. But I can't seem to bind-mount them:

$ mkdir /run/debug
$ mount --rbind . /run/debug
mount: /run/debug: wrong fs type, bad option, bad superblock on ., missing codepage or helper program, or other error.

The same occurs if I use nsenter --wd=/ without --root, and try to mount --rbind / ./run/debug.

I've tried using unshare -m to create a new inner mount namespace first. And I've tried mount --make-rprivate / on the debug container tree before the bind mount. Same deal.

I can't work out why: there's nothing in dmesg and the error is very generic. I'm guessing it's due to the disjoint roots and/or disjoint mount namespaces. It doesn't seem to be due to the kernel's protection against bind mount circularity. And I'm using recursive binds, so it shouldn't be due to the protection against mount tree escapes in linux user namespaces.

An alterative to --rbinding a FS tree would be if I had a way to mount --bind by mount id as shown in /proc/$target_pid/mountinfo. I could then clone all the mounts from the target pid into the debug container's mount namespace. But I can't mount --bind using a normal absolute path, because the target and debug container's mount namespaces are disjoint, and both have subtrees of mounts with private propagation.

I've tried using a target process's /proc/$pid/ns/mnt mount namespace, as I've seen reference to bind-mounting using it. But on my kernel 5.16 it's a tree of fake symlinks, not a fs tree:

$ readlink /proc/self/ns/mnt
mnt:[4026531840]
$ ls /proc/self/ns/mnt/
ls: cannot access '/proc/self/ns/mnt/': Not a directory

The closest thing I have to a workaround at the moment is the nsenter hack with the working directory. This offers very limited tooling injection into the target container. Where pid 1055 is a pid in the target container:

# nsenter -t 1055 -p -m --wd=/ /bin/bash
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

# ls /
...target container rootfs contents here...

# ls .
...debug container rootfs here...

# ls ..
...debug container rootfs here too because . is a root...

# pwd
pwd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

# ls usr/bin/gdb
usr/bin/gdb

# ls /usr/bin/gdb
ls: cannot access '/usr/bin/gdb': No such file or directory

but I can't bind mount like I want, from within the same nsenter session:

# mkdir /run/debug
# mount --rbind . /run/debug
mount: /run/debug: wrong fs type, bad option, bad superblock on ., missing codepage or helper program, or other error.

Hints?

Reference links:

129

0 + 0

mount

namespaces

linux-kernel

docker

kubernetes

Bind mount one container FS tree into another for debug or ephemeral containers?

Goal: Access to both debug and target container file systems at the same time

Post an answer