What does OOMKilled mean in Kubernetes?

OOMKilled means the Linux kernel's Out-Of-Memory killer terminated the container because it tried to use more memory than its limit allowed - or because the node ran out of memory entirely. It surfaces as exit code 137 (128+9, SIGKILL) plus 'Reason: OOMKilled' in 'kubectl describe pod'. The restart policy decides if the kubelet starts a fresh instance or leaves the pod failed.

How do I see if a pod was OOMKilled?

'kubectl describe pod ' shows 'Last State: Terminated, Reason: OOMKilled, Exit Code: 137' in the Last State section. Faster via JSON: 'kubectl get pod -o jsonpath="{.status.containerStatuses[*].lastState.terminated.reason}"'. If exit code is 137 but the reason is 'Error' instead of 'OOMKilled', the kill came from outside the cgroup - liveness probe SIGKILL, node eviction, or a sidecar.

How do I fix OOMKilled correctly?

Don't blindly double the memory limit - that just delays the next crash. Measure the real peak with 'kubectl top pod' or '/sys/fs/cgroup/memory.peak' (cgroups v2), then set the limit ~25-30% above peak. For JVM, Node.js and Go runtimes, also tune the runtime heap settings (-Xmx, --max-old-space-size, GOMEMLIMIT) - the container limit alone isn't enough if the runtime allocates beyond it.

OOMKilled in Kubernetes: 6 Causes, kubectl Workflow, Right-Sizing

TL;DR - OOMKilled isn’t a bug, it’s a diagnosis: the kernel’s OOM killer terminated the container because it crossed its memory limit. Doubling the limit is the most expensive fix - and usually wrong. This workflow finds the real cause in under 10 minutes: confirm, measure, compare against the limit, then fix the right thing.

🔖 Just want the commands? Here’s the interactive OOMKilled cheatsheet - with copy buttons, JVM/Node/Go/Python runtime limits, and a printable view. Bookmark recommended.

What OOMKilled really means

OOMKilled is the diagnosis, not the bug. The sequence:

Container allocates memory until it hits its cgroup memory limit
Kernel sends SIGKILL to the main container process
Container runtime reports exit code 137 (128 + signal 9)
Kubelet writes Reason: OOMKilled into pod status
Restart policy decides what happens next: Always → restart, OnFailure → restart, Never → pod stays Failed

Important: the OOM killer is a Linux kernel decision, not a Kubernetes feature. Kubernetes only sets the cgroup limit; the kernel enforces it. That also means an OOMKill is always hard and immediate. No graceful shutdown, no SIGTERM, no pre-stop hook. The process is dead in the same microsecond the limit gets crossed.

The 4-step workflow

Whatever the cause, these four steps always run.

Step 1: Confirm it really was OOMKilled

kubectl describe pod <name> | grep -A 5 "Last State"

Output for a real OOMKill:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Thu, 08 May 2026 09:14:22 +0200
  Finished:     Thu, 08 May 2026 09:18:47 +0200

If you see Reason: Error plus Exit Code: 137, it was not the cgroup OOM killer - it was an external SIGKILL (liveness probe death spiral, node eviction under memory pressure, or a sidecar). Different cause, different workflow.

Step 2: Measure real usage

kubectl top pod <name> --containers

That’s the live value. The peak matters more:

# cgroups v2 (Kubernetes 1.25+, all modern distros)
kubectl exec <pod> -- cat /sys/fs/cgroup/memory.peak

# cgroups v1 (legacy)
kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes

memory.peak shows the maximum since container start in bytes. That’s the only number that matters for limit calculations.

Step 3: Compare against the limit

kubectl describe pod <name> | grep -A 3 "Limits\|Requests"

Or structured via JSON:

kubectl get pod <name> -o jsonpath='{range .spec.containers[*]}{.name}: {.resources}{"\n"}{end}'

Three scenarios:

Peak ≈ limit → limit too small or real spike. Causes 1, 3, or 4.
Peak ≪ limit, but still OOM → node OOM. Cause 6.
Peak grows monotonically over days → memory leak. Cause 2.

Step 4: Fix the right thing

Only after steps 1-3 are done. Tweaking the limit without measuring either costs you in unused node capacity or buys you the next crash in 6 hours.

The 6 most common causes

1. Limit set too low

The most common case in practice. Someone set the limit by gut feeling, the app needs more.

Diagnosis: peak just below or at the limit, usage is stable (no growth), no memory leak.

Fix: set limit to peak × 1.3, set request to average usage.

resources:
  requests:
    memory: "512Mi"   # average usage
  limits:
    memory: "1Gi"     # peak * 1.3, no more

2. Memory leak in the application

Usage grows monotonically over hours or days, then OOMKill, restart, the curve starts over.

Diagnosis: capture kubectl top pod over several hours - the sawtooth pattern is unmistakable.

# 5 minutes, sample every 10 seconds
while true; do kubectl top pod <name> --no-headers; sleep 10; done | tee mem.log

Fix: profile inside the code (pprof for Go, jmap/heap dump for Java, —inspect for Node.js). Raising the limit is the most expensive workaround - the pod still crashes, just later.

3. Burst load from large requests

App is normally well under the limit, but a 100MB upload, a bulk import, or a heavy query overshoots briefly and dies.

Diagnosis: peak hits the limit, average is only 30-40% of the limit. Correlates with specific endpoints or cron jobs.

Fix: stream instead of in-memory buffering, paginate, or simply size the limit for the worst case. Caveat: if the worst case is 10x the average, the manifest is wrong - split heavy operations into a separate worker deployment.

4. JVM, Node.js, Go without a runtime limit

Java classic: without -Xmx, the JVM grabs 25% of host RAM - not of the container limit. On a 64GB node that’s 16GB heap; in a 1GB container that’s an instant OOMKill.

1-Day Intensive Workshop

Kubernetes Debugging - systematic, not guesswork

Replay real production incidents, internalise kubectl workflows, find root causes in minutes.

View workshop details

The right values per runtime:

Runtime	Setting	Example for 1Gi limit
Java 11+	`-XX:MaxRAMPercentage=75.0`	768Mi heap
Java (legacy)	`-Xmx<size>`	`-Xmx768m`
Node.js	`--max-old-space-size=<MB>`	`--max-old-space-size=768`
Go 1.19+	`GOMEMLIMIT` (env)	`GOMEMLIMIT=900MiB`
Python	`resource.setrlimit(RLIMIT_AS, ...)`	in init code

Rule of thumb: runtime heap = container limit × 0.75. The remaining 25% is for stack, native, JIT, metaspace, threads.

5. Off-heap or native memory

Heap dump shows 200MB usage, the container is at 1GB. The missing memory lives outside the managed heap:

JVM: direct ByteBuffers (Netty, Kafka clients), metaspace with many class loaders, JNI
Node.js: Buffer allocations outside V8, native addons (sharp, node-canvas)
Python: numpy/pandas, any C-extension code calling malloc

Diagnosis: the gap between kubectl top pod and the heap dump size is your off-heap usage.

Fix: enable Native Memory Tracking (-XX:NativeMemoryTracking=summary for JVM), cap off-heap caches, or just plan for it and size the limit accordingly.

6. Node OOM (the pod isn’t the culprit)

Pod limit is 2GB, pod uses 800MB, OOMKill anyway. What happened?

The node ran out of memory overall. Kubelet flags it MemoryPressure and starts evicting pods. Which pod gets killed depends on QoS class:

BestEffort (no requests/limits) first
Burstable (limits > requests) second - by OOM score (are they using more than their request?)
Guaranteed (limit == request) last

Diagnosis: another pod on the same node has a memory leak, or the node is generally over-provisioned.

kubectl describe node <node> | grep -A 10 "Conditions"
kubectl get events -A --field-selector reason=Evicted

Fix: Guaranteed QoS for critical workloads (limits == requests), priorityClassName: system-cluster-critical for infra pods, and node sizing based on real demand instead of the maximum.

Decision tree

Is the reason really OOMKilled (not just exit 137)?
  ↓ yes
Peak ≈ limit?
  ↓ yes                           ↓ no
Peak grows monotonically?           Other pods on the node also affected?
  ↓ yes      ↓ no                   ↓ yes → cause 6 (node pressure)
Cause 2     Correlates with         ↓ no
(leak)      specific reqs?          Container is JVM/Node/Go?
            ↓ yes  ↓ no             ↓ yes → cause 4 (heap limit missing)
            C. 3   C. 1                       or cause 5 (off-heap)
            (spike)(too small)

Right-sizing formula

After measuring:

memory request = ⌈avg_usage⌉                # for scheduler accuracy
memory limit   = ⌈peak_usage × 1.3⌉         # 30% safety buffer

runtime heap   = memory limit × 0.75         # 25% for stack, native, JIT

Example: peak is 700MB, average 400MB:

resources:
  requests:
    memory: "400Mi"
  limits:
    memory: "910Mi"
env:
  - name: JAVA_TOOL_OPTIONS
    value: "-XX:MaxRAMPercentage=75.0"   # → 682Mi heap

For very stable workloads (stateful sets, databases), Guaranteed QoS makes more sense:

resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "1Gi"   # limit == request → Guaranteed

Where VPA fits

The Vertical Pod Autoscaler automates steps 2-4. It watches real usage over days and recommends limits/requests. Three modes:

Off - recommendations only, no enforcement (good starting position)
Initial - sets values at pod creation, immutable after
Auto - recreates pods with new values (only for non-critical workloads)

VPA is not a substitute for cause analysis - on memory leaks (cause 2) it just delays the crash. But for causes 1 and 3 it’s the right answer.

What the workshops cover that this post doesn’t

This workflow handles 80% of OOMKill cases. What’s not in the 6 causes:

OOMKill from a sidecar in the same pod - resources are aggregated at pod level; a sidecar with a leak kills the main container
Eviction from wrong evictionHard thresholds in kubelet - pods die at 90% node memory while their own limit isn’t even close
Kernel page cache counts towards working_set - containers with large mmaped files show much higher usage than their actual heap

These patterns need systems thinking - exactly the difference between “double the limit and hope” and “find the root cause in 10 minutes”.

Where to go from here

In the Kubernetes Debugging Workshop we replay 8 real production incidents - including two OOMKill edge cases (JVM off-heap and node eviction under load) - and drill the workflow until it sticks. One day, eight hours, after which you fix OOMKill systematically rather than by guessing.

Related from our debugging series:

Fix CrashLoopBackOff systematically - 7 causes, 1 workflow for the second-most-common pod ailment
kubectl Debugging Cheatsheet - the 12 commands that cover any pod inspection
OOMKilled cheatsheet (interactive) - all commands from this post with copy buttons, runtime-limits table, and a printable view
Pod Pending: 23 causes, decision tree - the third most common pod state with a 5-category index and cheatsheet

OOMKilled in Kubernetes: 6 Causes, kubectl Workflow, Right-Sizing

What OOMKilled really means

The 4-step workflow

Step 1: Confirm it really was OOMKilled

Step 2: Measure real usage

Step 3: Compare against the limit

Step 4: Fix the right thing

The 6 most common causes

1. Limit set too low

2. Memory leak in the application

3. Burst load from large requests

4. JVM, Node.js, Go without a runtime limit

Kubernetes Debugging - systematic, not guesswork

5. Off-heap or native memory

6. Node OOM (the pod isn’t the culprit)

Decision tree

Right-sizing formula

Where VPA fits

What the workshops cover that this post doesn’t

Where to go from here

Kubernetes Debugging - systematic, not guesswork

Fix CrashLoopBackOff Systematically: 7 Causes, 1 Workflow

kubectl Debugging Cheatsheet: 12 Commands for Production Incidents

Fix CrashLoopBackOff Systematically: 7 Causes, 1 Workflow

Need a second opinion on your cluster?

What OOMKilled really means

The 4-step workflow

Step 1: Confirm it really was OOMKilled

Step 2: Measure real usage

Step 3: Compare against the limit

Step 4: Fix the right thing

The 6 most common causes

1. Limit set too low

2. Memory leak in the application

3. Burst load from large requests

4. JVM, Node.js, Go without a runtime limit

Kubernetes Debugging - systematic, not guesswork

5. Off-heap or native memory

6. Node OOM (the pod isn’t the culprit)

Decision tree

Right-sizing formula

Where VPA fits

What the workshops cover that this post doesn’t

Where to go from here

Kubernetes Debugging - systematic, not guesswork

Keep reading

Fix CrashLoopBackOff Systematically: 7 Causes, 1 Workflow

kubectl Debugging Cheatsheet: 12 Commands for Production Incidents

Fix CrashLoopBackOff Systematically: 7 Causes, 1 Workflow

Need a second opinion on your cluster?