The scenario: you have a running kubernetes cluster, but suddenly some of your containers start to have problems.
Obviously now there must be a way to find out what exactly is causing these problems.
In this post I’ll highlight two ways to debug a container running on kubernetes:
ephemeral-containers
(which is likely what you should use most of the time)nsenter
(which is a last resort debugging option that requires access to the node the Pod is running on)
Want to follow along? Setup a test cluster and run the provided command to get your very own failing container:
1--- 2apiVersion: v1 3kind: ConfigMap 4metadata: 5 name: myconfig 6 namespace: default 7data: 8 config.toml: | 9 [default] 10 additional_file: "" 11--- 12apiVersion: v1 13kind: Pod 14metadata: 15 name: my-broken-pod 16 namespace: default 17spec: 18 containers: 19 - name: working-container 20 image: gcr.io/google_containers/pause-amd64:3.0 21 - name: broken-container 22 image: quay.io/fbergman/broken-pod:latest 23 imagePullPolicy: IfNotPresent 24 resources: 25 limits: 26 memory: "128Mi" 27 cpu: "500m" 28 ports: 29 - containerPort: 3000 30 readinessProbe: 31 httpGet: 32 path: /readyz 33 port: 3000 34 volumeMounts: 35 - mountPath: /opt/ 36 name: configuration 37 volumes: 38 - name: configuration 39 configMap: 40 name: myconfig
1minikube start 2kubectl apply -f /some-web-url
In this case one of the LivenessProbes
starts failing.
In case of one container (working-container
), there is no way to debug into
it, because it does not contain any executable shell. This is a pretty common
scenario when running minimal container images like distroless containers.
The other container allows spawning a shell, but is missing most utilities that
would make debugging easier: there is no strace
, no gdb
and no default
networking tools.
How can we now proceed to figure out what breaks the container starting?
Obviously we can hope that a kubectl describe pod
or the logs of our
application has some useful information, but let’s say this is not the case and
running kubectl logs my-broken-pod
simply returns nothing useful.
In this case the two already mentioned approaches can be used - as it should be your first stop let’s start with ephemeral containers:
Using ephemeral containers
When having to debug a pod, ephemeral containers are the frist tool to use, because in general access to cluster nodes might not be available.
When using ephemeral containers it is important to have some container images ready that contain tools you want to use for debugging.
A very good network debugging container is the netshoot container.
But if all you need is a basic shell (in case of a distroless container) - busybox might already be enough.
Starting an ephemeral container
Ephemeral containers are handled as an additional field in the spec
of a
container: see the apidocs.
Ephemeral containers can only be added to an already running pod - trying to add them to a Pod during creation will result in an error:
1---
2apiVersion: v1
3kind: Pod
4metadata:
5 name: ephemeral-container-on-start
6 labels:
7 name: ephemeral-container-on-start
8spec:
9 containers:
10 - name: main-container
11 image: busybox:1.28
12 resources:
13 limits:
14 memory: "64"
15 cpu: "250m"
16 ephemeralContainers:
17 - name: debug-container
18 image: busybox:1.28
19 resources:
20 limits:
21 memory: "64Mi"
22 cpu: "250m"
1The Pod "ephemeral-container-on-start" is invalid:
2* spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container
3* spec.ephemeralContainers: Forbidden: cannot be set on create
That leaves two options to add an ephemeral container to a Pod:
- The
kubectl debug
command:1 kubectl debug -it [pod-name] --image=busybox:1.28 --target=[container-name]
- Patching the Pod using the API directly using the new resource:
Once the container has been added like that it can be used using
1 # Just for this experiment, give the default serviceaccount all permissions 2 kubectl create clusterrolebinding --clusterrole cluster-admin --serviceaccount default:default default-all-permissions 3 # APIURL 4 APIURL=$(kubectl config view --minify --output jsonpath="{.clusters[*].cluster.server}") 5 # Impersonate a serviceaccount with the required privileges 6 TOKEN=$(kubectl create token default) 7 # Call the API using curl as this serviceaccount 8 curl --header "Authorization: Bearer $TOKEN" \ 9 --header "Content-Type: application/strategic-merge-patch+json" \ 10 --request PATCH \ 11 -d ' 12 { 13 "spec": 14 { 15 "ephemeralContainers": 16 [ 17 { 18 "name": "debugger", 19 "command": ["sh"], 20 "image": "busybox", 21 "targetContainerName": "broken-container", 22 "stdin": true, 23 "tty": true, 24 "volumeMounts": [] 25 } 26 ] 27 } 28 }' \ 29 "${APIURL}/api/v1/namespaces/default/pods/my-broken-pod/ephemeralcontainers" -k
kubectl exec -ti my-broken-pod -c debugger sh
. Getting rid of the new container can be done, by killing the process that was already running whenexecing
into the container:kill -9 <pid_of_initial_sh>
.
What will not work is using kubectl edit
, as this is explicitly not supported:
Ephemeral containers are created using a special ephemeralcontainers handler in the API rather than by adding them directly to pod.spec, so it’s not possible to add an ephemeral container using kubectl edit.
Once the container is launched (and you either already are inside the shell when
using kubectl debug
or have attached via kubectl exec
) you can run whatever
commands might be necessary.
Cleaning up ephemeral containers
Checking the apidocs also shows what might be an issue for some user:
Ephemeral containers may not be removed or restarted.
So - once a ephemeral container has been added to a Pod, the only way to completely get rid of it from the Pod spec is by recreating the Pod without it.
It’s also strongly recommended to not keep a process running in the ephemeral container, because it will work against the requests and limits of the entire pod:
The kubelet may evict a Pod if an ephemeral container causes the Pod to exceed its resource allocation.
Using nsenter
Using nsenter
will not only work for kubernetes pods, so this is something you
can also use for docker
or podman
containers.
Generally this is useful if you want to pick-and-choose which namespaces of the container you want to access.
In kubernetes this will only work, if there is a way to gain access to the cluster node the Pod you want to debug is running on.
So either via
ssh
or using a openshift debug node/NODE container.
Once access to the node is established the process running inside the pod needs
to be found: a lot of these commands depend on the container runtime in use, so
for the remainder I will assume the cluster is using CRIO
.1
- Find the Pod ID for our
my-broken-pod
pod:1 POD_ID=$(crictl pods --name my-broken-pod -o json | jq -r '.items[] | select(.state == "SANDBOX_READY") | .id')
- Find all processes inside this pod:
1 CONTAINERS=$(crictl ps --pod d718ccb51d73896035324f6d9d9d12cc6818027ab18e5f78b295b518c62b46bd -o json | jq -r '.containers[].id')
- Find all processes inside these containers (or just choose the one container that is running the crashing process):
1 crictl inspect 154c67e44fcc843201aa582214395c82eb0e40acfda1ef1a9e12567f371aa13d | jq -r ".info.pid"
With the PID
it is now possible to enter some of the namespaces of this
process, while maintaining access to all binaries installed on the machine we
SSHed to (as long as the mount
namespace is not used with nsenter
-
indicated by the -m
flag):
1PID=1234
2nsenter -t $PID -n -p -u
Often it can be useful to add namespaces incrementally to see if one of the namespaces might have an impact on the container’s behaviour.
In this case it’s now possible to inspect the process that is refusing to work
with all tools available on the host - in case of a minikube
cluster this
includes tools like lsof
and strace
.
But first let’s check if the ReadinessProbe
works right now:
1curl -I localhost:3000/readyz
1HTTP/1.1 500 Internal Server Error
2content-type: text/plain; charset=utf-8
3content-length: 7
4date: Fri, 01 Dec 2023 11:16:12 GMT
So it is returning a 500 error - maybe we can get some more information what the process is actually doing using strace:
1strace -f -p "$PID"
1[pid 128904] epoll_wait(3, <unfinished ...>
2[pid 128893] futex(0x7f6985695940, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
3[pid 128904] <... epoll_wait resumed>[], 1024, 202) = 0
4[pid 128904] epoll_wait(3, [], 1024, 17) = 0
5[pid 128904] statx(AT_FDCWD, "/opt/config/additional_file", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=0, ...}) = 0
6[pid 128904] write(1, "app_state bad - configuration at"..., 61) = 61
7[pid 128904] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
It seems to stat
a configuration file at /opt/config/additional_file
-
that’s strange, as the configuration should live in another file
(config.toml
)2.
Time to actually use the mount namespace and see what’s in /opt/config
1nsenter -t $PID -n -p -u -m
2ls /opt/config
1additional_file config.toml
Well the file is there - so let’s update our configuration to not contain it and see if it unbreaks our application:
1kubectl patch -n default configmaps myconfig --type=json -p '[{"op": "remove", "path": "/data/additional_file"}]'
After removing that key from the ConfigMap
and a Pod restart it seems our
application is finally happy!
1my-broken-pod 2/2 Running 0 10s
Obviously in this case the
strace
command could also have been run directly from the node, but knowing how to incrementally add container namespaces can still be useful - e.g. to check how a mounted filesystem really looks like inside the container.
References
- Ephemeral containers (official documentation)
- Ephemeral containers (great blog post)
- nsenter man page
-
If the process is easy to identify running
ps
and grepping for the process commandline will also work. ↩︎ -
Obviously in this example this file was mounted from the
ConfigMap
- in reality this might indicate problems with other software that might inject libraries into all processes viaLD_PRELOAD
modifications. ↩︎