Debugging Kubernetes postStart Hooks

I recently ran into an issue where running a postStart hook in a Kubernetes Pod was failing. Kubernetes is normally pretty good about making errors obvious, but lifecycle hooks are a bit of a rough edge. My normal method for debugging them is to run the script locally in a Docker container, but the script was being injected into a Volume via a ConfigMap, so it seemed like the "run it on your laptop approach" wouldn't quite simulate reality well enough. Here's what I did instead.

Assumptions

The debugging steps here assume that you have a fairly high level of permissions in your Kubernetes cluster. You'll need to be able to run these commands against your Namespace: kubectl get pods, kubectl exec, and kubectl apply. You'll also need to be using a container image that has a shell in it.

Create a workspace

Debugging is largely a process of forming and testing hypotheses, and whenever I'm experimenting I like to start with a blank directory. I have an alias in my Bash profile for creating them: cd "$(mktemp -d)". This step is 100% optional, because hey, it's your filesystem. The other thing I like to do when debugging is set some environment variables to save me some typing. I prefix them with "THE" because it's fewer characters to type than "DEBUG": export THEPOD=$(kubectl get pod -oname <pod-that-is-failing>.

Make a test copy of the failing Pod

This debugging method tests hypotheses about failing postStart hooks by execing into a Pod and running them; the Pod has to be running for that to happen. postStart hooks fail Pod startup, so the first thing to do is create a modified version of the Pod that actually starts. There are a few of ways of doing this. If you're working in a true sandbox you can run kubectl edit on the Pod or its controlling Deployment, remove the postStart hook, and move onto the next step. If you're working in an environment where you don't want to edit things directly, do this instead:

  1. Make a copy of the Pod definition: kubectl get pod $THEPOD -o yaml > pod.yaml
  2. Edit the Pod definition to remove the postStart hook and any Kubernetes-generated information such as createdTimestamp, ownerReferences, and status.
  3. Make your modified Pod easily deletable by giving it a memorable name and adding a metadata.label entry to search on. I personally use jturner-delete-me-<four-random-letters> for a name and cleanup: "true" as a label.
  4. Apply the modified Pod definition: kubectl apply -f pod.yaml

Debug the postSart hook

Unfortunately, this is the "draw the rest of the owl" portion of the post. Every debugging session will start with kubectl exec -ti jturner-delete-me-xyza sh and running the postStart hook, but what's actually wrong will vary from hook to hook. In my case, I was trying to be too clever—any amount of clever in shell scripts is too clever—in the postStart script. Kubernetes seems to have not allow a command to exit non-zero in postStart hooks, even if that non-zero exit is expected.

Clean up

Be kind to your fellow cluster users and clean up after yourself: kubectl delete pod -l cleanup=true.

Takeaways

  • Resist the urge to be too clever in your postStart scripts. postStart scripts can fill in feature gaps around container topology, but they're one of the most blunt instruments in the Kubernetes toolchain. Keep them simple.
  • Making test copies of cluster resources is cheap and easy for the most part. Reach for it as an alternative to editing things directly on the cluster.

Hopefully this is helpful the next you find yourself staring at cryptic kubectl describe output about a failing postStart hook.

Previous
Previous

Using C4 Diagrams to Model Reliability

Next
Next

How (Some) Engineers Work