Cheveo Blog
debugging 9 min read ·

Fix CrashLoopBackOff Systematically: 7 Causes, 1 Workflow

The 7 most common causes of CrashLoopBackOff in Kubernetes - with kubectl commands, real outputs and a decision tree that finds each one in under 5 minutes.

Clemens Christen
Clemens Christen Certified Kubernetes Administrator (CKA)

TL;DR - CrashLoopBackOff isn’t a bug, it’s a state: the container started, crashed and is waiting for the next restart attempt. The cause is never in Kubernetes - always in the container, the manifest, or the environment. This workflow finds 90% of root causes in under 5 minutes: read the exit code, look at previous logs, check events, then test the most common of the 7 causes.

What CrashLoopBackOff really means

The term confuses more engineers than necessary. CrashLoopBackOff means:

  1. Kubernetes started the container
  2. The container terminated with a non-zero exit code
  3. Kubelet tried to restart it
  4. Crash again
  5. Kubelet now waits with exponential backoff until the next attempt (10s, 20s, 40s, 80s, … up to 5 minutes)

So the container isn’t running. The current attempt’s logs are empty because no restart has happened yet. You need the logs from the previous crash - this is the most important insight for everything that follows.

The 3-step workflow

Whatever the cause, these three commands always run first.

Step 1: Read the exit code

kubectl describe pod <name>

Scroll to the Last State section:

Last State:     Terminated
  Reason:       Error
  Exit Code:    137
  Started:      Mon, 04 May 2026 14:20:15 +0200
  Finished:     Mon, 04 May 2026 14:20:18 +0200

The exit code is the most important piece of information:

Exit codeMeaning
0Clean exit (shouldn’t end up in CrashLoop, check restartPolicy)
1Generic application error - read the logs
2Misuse of shell builtins - usually a typo in the command
126Command not executable - permission issue
127Command not found - wrong path or missing binary
137SIGKILL - usually OOMKilled (check the Reason)
139SIGSEGV - segmentation fault, native code crashed
143SIGTERM - cleanly terminated from outside, check liveness probe

Step 2: Read previous logs

kubectl logs <pod> --previous

The --previous switch isn’t optional. Without it you see the logs of the current (not yet started) container - i.e. nothing. With it you see the logs of the last crash.

Step 3: Check system events

kubectl get events --sort-by=.lastTimestamp -n <namespace> | tail -20

System events show things the container itself can’t log: OOMKilled by memory limit, FailedMount on volumes, BackOff counts.

The 7 most common causes

1. Application error (exit code 1)

By far the most common cause. The code throws an exception at startup. Reasons:

  • Missing or wrong environment variable
  • Database not reachable (typical race-condition symptom)
  • Migration script fails
  • Config file doesn’t exist or has the wrong format

Fix workflow: kubectl logs <pod> --previous, read the exception, fix it. For DB race conditions: an initContainer with wait-for-db in front of the app.

2. OOMKilled (exit code 137, Reason: OOMKilled)

Container exceeded its memory limit. Kernel killed it with SIGKILL.

kubectl describe pod <name> | grep -A 2 "Last State"
# Reason: OOMKilled
# Exit Code: 137

Fix: Raise the memory limit or find the memory leak in the code. Doubling the limit is symptomatic cosmetics - real leaks just crash the pod again later. kubectl top pod shows real-time consumption. For the full diagnosis including JVM/Node/Go gotchas, see the OOMKilled cheatsheet and the OOMKilled article.

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # based on real usage, not gut feeling

3. Liveness probe fails (exit code 143, often with empty logs)

The liveness probe fails, kubelet terminates the container with SIGTERM. If the probe fails on the very first run, you usually see no logs at all.

Classic fail: initialDelaySeconds is too low. A Spring Boot app needs 30-60s to start, the probe fires after 10s and kills the container.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 60   # not 10
  periodSeconds: 10
  failureThreshold: 3

Diagnosis:

kubectl describe pod <name> | grep -A 5 "Liveness"
kubectl get events | grep -i "liveness"
1-Day Intensive Workshop

Kubernetes Debugging - systematic, not guesswork

Replay real production incidents, internalise kubectl workflows, find root causes in minutes.

View workshop details

4. Command not found (exit code 127)

The container image doesn’t have the command specified in the manifest. Or the path is wrong:

command: ["/usr/local/bin/myapp"]   # does this actually exist?

Quick test:

kubectl debug -it <pod> --image=busybox --target=<container>
# in the debug container:
ls /proc/1/root/usr/local/bin/

The ephemeral debug container shares the filesystem namespace, so you can see exactly what paths the original container had.

5. ConfigMap or Secret missing (exit code 1, often “no such file”)

The pod manifest mounts a ConfigMap that doesn’t exist. The container starts, can’t find the file, crashes.

kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>

Common cause: namespace confusion. The pod is in prod, the ConfigMap is in default.

6. Wrong volume permissions (exit code 1, “permission denied”)

A mounted PV has UID/GID that the container user can’t read. Frequently happens when migrating from Docker Compose to Kubernetes.

Fix: set securityContext.fsGroup so kubelet adjusts permissions on mount:

spec:
  securityContext:
    fsGroup: 1000
  containers:
    - name: app
      image: ...

7. Race condition with the database (exit code 1, “connection refused”)

The app starts before the database, can’t connect, crashes. The restart might catch the DB - or not.

Fix: an initContainer with a wait script:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -zv postgres 5432; do echo waiting for db; sleep 2; done']

Or better: application-level retry with exponential backoff in the code itself. initContainer is the bandage, not the cure.

Decision tree

Exit code 137 + Reason OOMKilled?  ->  Cause 2 (memory)
                                  v no
Logs --previous completely empty?  ->  Cause 3 (liveness probe)
                                  v no
Logs say "no such file"?           ->  Cause 5 (ConfigMap/Secret)
                                  v no
Logs say "permission denied"?      ->  Cause 6 (volume permissions)
                                  v no
Logs say "command not found"?      ->  Cause 4 (command/path)
                                  v no
Logs say "connection refused"?     ->  Cause 7 (race condition)
                                  v no
                                       Cause 1 (application error)
                                       -> Read logs carefully, check stack trace

What the workshops cover that this article doesn’t

This workflow handles the most common cases. What isn’t in the 7 causes:

  • JVM container with hidden OOM: the Java heap explodes but the kernel doesn’t see it as OOM. You only see exit code 1 with Killed.
  • Network policy blocking init egress: the app wants to fetch configs from Vault at startup, NetworkPolicy only allows inbound, app crashes without a log.
  • Cluster autoscaler killing the node during pod start: pod gets evicted before liveness probe runs, restart on a new node.

These patterns need more than command order - they need system understanding. That’s the difference between “guess and try” and “find the root cause in 3 minutes”.

What’s next

In our Kubernetes Debugging Workshop we replay 8 real production incidents - including the three edge cases above - and drill the workflow until it sticks. 1 day, 8 hours, after which you solve CrashLoopBackOff systematically instead of by guessing.

Before you book: also have a look at our kubectl Debugging Cheatsheet for the 12 most important commands as a complete workflow. Also in the debugging series: OOMKilled in Kubernetes for exit-137 cases and Pod Pending: 23 causes for stuck pods.

1-Day Intensive Workshop

Kubernetes Debugging - systematic, not guesswork

Replay real production incidents, internalise kubectl workflows, find root causes in minutes.

View workshop details
Free · 30 minutes

Need a second opinion on your cluster?

Book a free 30-minute Kubernetes health check. We review your setup and give concrete recommendations, no sales pitch.

Book a slot