The first GPU OOM I ever debugged made no sense to me: the error said the allocator failed to find 2 GiB, while nvidia-smi showed several gigabytes free. Nothing about that is a bug. It's just how CUDA memory management works once frameworks and serving engines start lying to you — productively — about what "free" means. This post is the mental model I wish I'd had.
cudaMalloc is slow, so everyone pools#
Raw cudaMalloc / cudaFree calls are expensive — they can synchronize the device and take milliseconds, which is an eternity when a decode step takes tens of milliseconds total. So no serious framework calls them per-tensor. PyTorch's caching allocator grabs large segments from CUDA, carves tensors out of them, and when a tensor dies it returns the block to PyTorch's own free lists, not to CUDA.
This creates the two numbers you must learn to read:
torch.cuda.memory_allocated()— bytes in live tensors right now.torch.cuda.memory_reserved()— bytes PyTorch has claimed from CUDA, including cached free blocks.
nvidia-smi sees reserved (plus the CUDA context itself, a few hundred MB). Your OOM happens when neither the free lists nor CUDA can satisfy a request. When the numbers look contradictory, torch.cuda.memory_summary() is the first thing I print — it breaks down reserved vs. allocated vs. inactive blocks and usually tells you which of the failure modes below you're in.
Fragmentation, the silent tax#
The caching allocator can only reuse a free block if the new request fits in it. Serve traffic with wildly varying sequence lengths and you end up with reserved memory shredded into odd-sized free blocks — 14 GiB "free" in pieces of 200 MB, and a 2 GiB allocation fails. That's the classic reserved but unallocated OOM:
CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total
capacity of 23.6 GiB of which 1.2 GiB is free. Of the allocated
memory 14.1 GiB is allocated by PyTorch, and 6.9 GiB is reserved
by PyTorch but unallocated.That last clause is the diagnosis: the memory exists, but no single block is big enough. Two practical levers:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True— lets PyTorch grow segments using CUDA's virtual memory APIs instead of hunting for new contiguous physical ranges. For workloads with variable shapes this alone has rescued me more than once.torch.cuda.empty_cache()— returns cached blocks to CUDA. It's a bandaid (the next allocation payscudaMalloccost again), but it's a useful experiment: if OOMs disappear after sprinkling it in, you've confirmed fragmentation and should fix the allocation pattern instead.
Why serving engines pre-plan instead#
Serving engines like SGLang and vLLM mostly opt out of this fight. Dynamic allocation is what fragments; so they allocate the big consumer — the KV cache — as one static pool at startup and manage it themselves in fixed-size pages. SGLang exposes this as --mem-fraction-static: the fraction of GPU memory claimed up front for weights plus the KV pool.
The budget arithmetic for Llama-3-8B on a 24 GiB card (an L4 or RTX 4090 class GPU) looks like this:
total: 24.0 GiB
CUDA context + NCCL + misc: −1.0 GiB
weights (8B params, BF16): −15.0 GiB
activations + CUDA graphs: −2.0 GiB
────────────────────────────────────────
KV cache pool: ~6.0 GiBLlama-3-8B uses GQA with 8 KV heads, which works out to 128 KiB of cache per token in FP16. A 6 GiB pool is therefore:
6 GiB ÷ 128 KiB/token ≈ 49,000 tokensThat's the engine's real capacity: six concurrent 8K-context requests, or twelve at 4K, shared among however many users are connected. When I size a deployment, I start from this number, not from tokens/sec. Set --mem-fraction-static too high and the slice left for activations is too thin — you OOM during a long prefill, the one allocation that still happens dynamically. Too low and you've donated KV capacity, which is concurrency, which is throughput.
Field-debugging OOMs that make no sense#
The ones that took me longest to learn, in rough order of frequency:
- Reserved-but-unallocated. Covered above. Read the error text carefully — it literally tells you.
- The graph you didn't know you kept. Run forward passes without
torch.no_grad()(orinference_mode()) and every activation is retained for a backward pass that never comes. Keep a reference to the outputs — atotal_loss += lossaccumulator, a list of logits — and the graphs pile up too; memory grows linearly with steps until it doesn't. Any eval or scoring loop that creeps upward over hours gets checked for this first. - CUDA graphs holding their pools. Engines capture decode steps as CUDA graphs per batch size; each capture pins workspace memory for replay. The cost shows up at startup, looks like a leak in
nvidia-smi, and is neither — but it's why "it fit yesterday" can fail after enabling more capture batch sizes. - Multiprocess contention. PyTorch's allocator only knows about its own process. A stray Jupyter kernel holding 3 GiB, a second engine instance, even a crashed run whose process didn't die — your process just sees CUDA refusing allocations.
nvidia-smilists per-process usage; believe it over your framework's numbers. - Allocations outside PyTorch's view. NCCL communicators and cuBLAS workspaces allocate directly from CUDA. The gap between
memory_reserved()and whatnvidia-smireports for your process is them.
Rules of thumb#
- Know your per-token KV cost (
2 × layers × kv_heads × head_dim × dtype). Capacity questions reduce to it. - Weights are the floor, KV cache is the budget, activations are the spike. Plan in that order.
- Read OOM messages to the end. PyTorch's allocator tells you whether it's fragmentation.
memory_summary()before theories.nvidia-smibefore blaming your own process.- Static pools beat clever dynamic allocation in serving. If you're fighting the allocator per-request, you're playing the wrong game.
None of this is exotic. It's bookkeeping — but at 24 GiB, bookkeeping is the job.