The scenario: you have a running kubernetes cluster, but suddenly some of your containers start to have problems.

Obviously now there must be a way to find out what exactly is causing these problems.

In this post I’ll highlight two ways to debug a container running on kubernetes:

  • ephemeral-containers (which is likely what you should use most of the time)
  • nsenter (which is a last resort debugging option that requires access to the node the Pod is running on)

Want to follow along? Setup a test cluster and run the provided command to get your very own failing container:

 1---
 2apiVersion: v1
 3kind: ConfigMap
 4metadata:
 5  name: myconfig
 6  namespace: default
 7data:
 8  config.toml: |
 9    [default]    
10  additional_file: ""
11---
12apiVersion: v1
13kind: Pod
14metadata:
15  name: my-broken-pod
16  namespace: default
17spec:
18  containers:
19  - name: working-container
20    image: gcr.io/google_containers/pause-amd64:3.0
21  - name: broken-container
22    image: quay.io/fbergman/broken-pod:latest
23    imagePullPolicy: IfNotPresent
24    resources:
25      limits:
26        memory: "128Mi"
27        cpu: "500m"
28    ports:
29      - containerPort: 3000
30    readinessProbe:
31      httpGet:
32        path: /readyz
33        port: 3000
34    volumeMounts:
35      - mountPath: /opt/
36        name: configuration
37  volumes:
38    - name: configuration
39      configMap:
40        name: myconfig
1minikube start
2kubectl apply -f /some-web-url

In this case one of the LivenessProbes starts failing.

In case of one container (working-container), there is no way to debug into it, because it does not contain any executable shell. This is a pretty common scenario when running minimal container images like distroless containers.

The other container allows spawning a shell, but is missing most utilities that would make debugging easier: there is no strace, no gdb and no default networking tools.

How can we now proceed to figure out what breaks the container starting?

Obviously we can hope that a kubectl describe pod or the logs of our application has some useful information, but let’s say this is not the case and running kubectl logs my-broken-pod simply returns nothing useful.

In this case the two already mentioned approaches can be used - as it should be your first stop let’s start with ephemeral containers:

Using ephemeral containers

When having to debug a pod, ephemeral containers are the frist tool to use, because in general access to cluster nodes might not be available.

When using ephemeral containers it is important to have some container images ready that contain tools you want to use for debugging.

A very good network debugging container is the netshoot container.

But if all you need is a basic shell (in case of a distroless container) - busybox might already be enough.

Starting an ephemeral container

Ephemeral containers are handled as an additional field in the spec of a container: see the apidocs.

Ephemeral containers can only be added to an already running pod - trying to add them to a Pod during creation will result in an error:

 1---
 2apiVersion: v1
 3kind: Pod
 4metadata:
 5  name: ephemeral-container-on-start
 6  labels:
 7    name: ephemeral-container-on-start
 8spec:
 9  containers:
10  - name: main-container
11    image: busybox:1.28
12    resources:
13      limits:
14        memory: "64"
15        cpu: "250m"
16  ephemeralContainers:
17  - name: debug-container
18    image: busybox:1.28
19    resources:
20      limits:
21        memory: "64Mi"
22        cpu: "250m"
1The Pod "ephemeral-container-on-start" is invalid:
2* spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container
3* spec.ephemeralContainers: Forbidden: cannot be set on create

That leaves two options to add an ephemeral container to a Pod:

  1. The kubectl debug command:
    1   kubectl debug -it [pod-name] --image=busybox:1.28 --target=[container-name]
    
  2. Patching the Pod using the API directly using the new resource:
     1   # Just for this experiment, give the default serviceaccount all permissions
     2   kubectl create clusterrolebinding --clusterrole cluster-admin --serviceaccount default:default default-all-permissions
     3   # APIURL
     4   APIURL=$(kubectl config view --minify --output jsonpath="{.clusters[*].cluster.server}")
     5   # Impersonate a serviceaccount with the required privileges
     6   TOKEN=$(kubectl create token default)
     7   # Call the API using curl as this serviceaccount
     8   curl --header "Authorization: Bearer $TOKEN" \
     9       --header "Content-Type: application/strategic-merge-patch+json" \
    10       --request PATCH \
    11       -d '
    12   {
    13       "spec":
    14       {
    15           "ephemeralContainers":
    16           [
    17               {
    18                   "name": "debugger",
    19                   "command": ["sh"],
    20                   "image": "busybox",
    21                   "targetContainerName": "broken-container",
    22                   "stdin": true,
    23                   "tty": true,
    24                   "volumeMounts": []
    25               }
    26           ]
    27       }
    28   }' \
    29       "${APIURL}/api/v1/namespaces/default/pods/my-broken-pod/ephemeralcontainers" -k
    
    Once the container has been added like that it can be used using kubectl exec -ti my-broken-pod -c debugger sh. Getting rid of the new container can be done, by killing the process that was already running when execing into the container: kill -9 <pid_of_initial_sh>.

What will not work is using kubectl edit, as this is explicitly not supported:

Ephemeral containers are created using a special ephemeralcontainers handler in the API rather than by adding them directly to pod.spec, so it’s not possible to add an ephemeral container using kubectl edit.

Once the container is launched (and you either already are inside the shell when using kubectl debug or have attached via kubectl exec) you can run whatever commands might be necessary.

Cleaning up ephemeral containers

Checking the apidocs also shows what might be an issue for some user:

Ephemeral containers may not be removed or restarted.

So - once a ephemeral container has been added to a Pod, the only way to completely get rid of it from the Pod spec is by recreating the Pod without it.

It’s also strongly recommended to not keep a process running in the ephemeral container, because it will work against the requests and limits of the entire pod:

The kubelet may evict a Pod if an ephemeral container causes the Pod to exceed its resource allocation.

Using nsenter

Using nsenter will not only work for kubernetes pods, so this is something you can also use for docker or podman containers.

Generally this is useful if you want to pick-and-choose which namespaces of the container you want to access.

In kubernetes this will only work, if there is a way to gain access to the cluster node the Pod you want to debug is running on.

So either via ssh or using a openshift debug node/NODE container.

Once access to the node is established the process running inside the pod needs to be found: a lot of these commands depend on the container runtime in use, so for the remainder I will assume the cluster is using CRIO.1

  1. Find the Pod ID for our my-broken-pod pod:
    1   POD_ID=$(crictl pods --name my-broken-pod -o json | jq -r '.items[] | select(.state == "SANDBOX_READY") | .id')
    
  2. Find all processes inside this pod:
    1   CONTAINERS=$(crictl ps --pod d718ccb51d73896035324f6d9d9d12cc6818027ab18e5f78b295b518c62b46bd -o json | jq -r '.containers[].id')
    
  3. Find all processes inside these containers (or just choose the one container that is running the crashing process):
    1   crictl inspect 154c67e44fcc843201aa582214395c82eb0e40acfda1ef1a9e12567f371aa13d | jq -r ".info.pid"
    

With the PID it is now possible to enter some of the namespaces of this process, while maintaining access to all binaries installed on the machine we SSHed to (as long as the mount namespace is not used with nsenter - indicated by the -m flag):

1PID=1234
2nsenter -t $PID -n -p -u

Often it can be useful to add namespaces incrementally to see if one of the namespaces might have an impact on the container’s behaviour.

In this case it’s now possible to inspect the process that is refusing to work with all tools available on the host - in case of a minikube cluster this includes tools like lsof and strace.

But first let’s check if the ReadinessProbe works right now:

1curl -I localhost:3000/readyz
1HTTP/1.1 500 Internal Server Error
2content-type: text/plain; charset=utf-8
3content-length: 7
4date: Fri, 01 Dec 2023 11:16:12 GMT

So it is returning a 500 error - maybe we can get some more information what the process is actually doing using strace:

1strace -f -p "$PID"
1[pid 128904] epoll_wait(3,  <unfinished ...>
2[pid 128893] futex(0x7f6985695940, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
3[pid 128904] <... epoll_wait resumed>[], 1024, 202) = 0
4[pid 128904] epoll_wait(3, [], 1024, 17) = 0
5[pid 128904] statx(AT_FDCWD, "/opt/config/additional_file", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=0, ...}) = 0
6[pid 128904] write(1, "app_state bad - configuration at"..., 61) = 61
7[pid 128904] write(4, "\1\0\0\0\0\0\0\0", 8) = 8

It seems to stat a configuration file at /opt/config/additional_file - that’s strange, as the configuration should live in another file (config.toml)2.

Time to actually use the mount namespace and see what’s in /opt/config

1nsenter -t $PID -n -p -u -m
2ls /opt/config
1additional_file  config.toml

Well the file is there - so let’s update our configuration to not contain it and see if it unbreaks our application:

1kubectl patch -n default configmaps myconfig --type=json -p '[{"op": "remove", "path": "/data/additional_file"}]'

After removing that key from the ConfigMap and a Pod restart it seems our application is finally happy!

1my-broken-pod   2/2     Running   0          10s

Obviously in this case the strace command could also have been run directly from the node, but knowing how to incrementally add container namespaces can still be useful - e.g. to check how a mounted filesystem really looks like inside the container.

References


  1. If the process is easy to identify running ps and grepping for the process commandline will also work. ↩︎

  2. Obviously in this example this file was mounted from the ConfigMap - in reality this might indicate problems with other software that might inject libraries into all processes via LD_PRELOAD modifications. ↩︎