Understanding Transformer Inference Optimization

LLM inference optimization looks like a zoo of unrelated tricks until you sort them by which bottleneck they attack. There are really only two bottlenecks, and every technique on this map — batching, quantization, speculative decoding, CUDA graphs — is a bet about which one you're hitting. This post is the map I keep in my head.

Two regimes: prefill and decode #

A generation request has two phases with opposite performance characters.

Prefill processes the whole prompt at once. Thousands of tokens flow through every layer in parallel, so the GPU runs big, dense matmuls — high arithmetic intensity, tensor cores busy. Prefill is compute-bound.

Decode generates one token at a time. Each step reads every weight of the model (and the request's whole KV cache) to produce a single token's worth of math. In roofline terms: a BF16 model does roughly 2 FLOPs per parameter per generated token while reading 2 bytes per parameter — about 1 FLOP per byte. An A100 needs ~150 FLOPs per byte to be compute-limited; an H100 closer to 300. Decode at small batch isn't 10% under that line, it's two orders of magnitude under it. Decode is bandwidth-bound, and it's not close.

That gives you the speed-of-light calculation I do before profiling anything: an 8B model in BF16 is ~16 GB of weights; on a 2 TB/s A100, a batch-of-one decode can't exceed ~125 tokens/sec no matter how good the kernels are. If you're at 100, stop optimizing kernels — you're nearly done. If you're at 30, something other than bandwidth is broken (probably launch overhead; see CUDA graphs below).

Batching: from static to continuous #

The bandwidth bound has a beautiful corollary: if one decode step must read all 16 GB of weights anyway, computing 8 tokens for 8 different requests during that same read costs almost nothing extra. Until you approach the compute roofline, decode batching is nearly free throughput.

The naive version — static batching — forms a batch, runs all requests to completion, then admits the next batch. One 2,000-token response holds the whole batch hostage while finished requests' slots sit idle. Continuous batching fixes this by making scheduling decisions every step: a request that finishes leaves the batch at the next token boundary, a queued request joins immediately, and new requests' prefills are interleaved with everyone else's decodes. This is the single biggest serving win of the last few years — typically several-fold throughput over static batching, from scheduling alone.

What stops you from batching forever isn't compute. Each concurrent request carries its KV cache (128 KiB per token for Llama-3-8B), and the per-step bandwidth bill grows: weights amortize across the batch, but every request's cache must be read individually. Batch size in practice is capped by KV memory — which is why the previous two posts in this series obsess over cache size.

Quantization: three different decisions #

"Quantize the model" is actually three separate decisions, attacking different terms of the bill:

Weights (INT8/INT4 via GPTQ, AWQ, or similar). Decode reads weights every step, so weight-only quantization cuts the dominant bandwidth term — INT4 turns a 16 GB read into 4 GB, and small-batch decode speeds up nearly proportionally. The dequantization math is extra compute, which is fine while you're bandwidth-bound and a real cost at large batch.
KV cache (FP8/INT8 cache). Halving cache bytes doubles the tokens that fit in the pool — i.e., doubles your achievable concurrency — and cuts decode's per-request bandwidth. For long-context serving this is often worth more than weight quantization.
Activations (W8A8, FP8 on Hopper-class hardware). This is the only tier that speeds up compute, because the matmuls themselves run on faster low-precision tensor cores. It's what helps prefill, where the other two tiers do little.

Accuracy cost runs in the same order: weight-only is mostly safe, KV cache quantization is usually fine but shows up on long-context retrieval tasks, aggressive activation quantization needs real evals.

Speculative decoding #

Decode leaves the GPU's compute idle — we established that. Speculative decoding spends the idle compute to buy latency: a small draft model proposes the next k tokens, and the target model verifies all k in a single parallel pass that costs about the same as one normal decode step (the bandwidth is the same weight-read either way). Accepted tokens are free speedup; the first rejection truncates the proposal and the target's own token takes over, so the output distribution is exactly the target model's. With a well-matched draft and ~70% acceptance you see 2–3× on interactive workloads.

When does it lose? When the spare compute it spends isn't actually spare. At high batch sizes the verification passes compete with other requests' work, and the same GPU produces more total tokens by just batching normally. It also degrades when the draft disagrees with the target — heavy code or domain-specific traffic with a generic draft model can see acceptance fall low enough that you're paying draft cost for nothing. Speculation is a latency optimization for under-utilized GPUs, not a throughput feature.

Kernel-level wins #

Two items at the kernel level pay off out of proportion to their glamour:

CUDA graphs. A decode step is hundreds of small kernel launches, each costing several microseconds of CPU-side overhead. At batch 1, the kernels themselves are so fast that launch overhead can be a third of step time — the GPU literally waits for the CPU. Capturing the decode step once and replaying it as a single graph launch removes that, which is why engines maintain captured graphs per batch size and why startup takes that extra moment.
Fused kernels. Every separate elementwise op is a round trip through HBM. Fusing the residual add into RMSNorm, rotary embeddings into attention, and the SwiGLU activation into one kernel removes whole reads and writes of activation tensors. FlashAttention is the same idea applied to attention's score matrix. None of these change the math; they change how many times the math touches memory.

The metrics that keep you honest #

Three numbers describe a serving system, and they fight each other:

TTFT (time to first token) — prefill latency; what "feels instant" measures.
TPOT / ITL (time per output token / inter-token latency) — decode cadence; what "feels smooth" measures.
Throughput — aggregate tokens/sec; what your cost per million tokens measures.

Almost every technique above moves two of these in opposite directions. Bigger batches raise throughput and worsen ITL. Speculative decoding cuts ITL at low load and costs throughput at high load. Interleaving a long prefill helps that request's TTFT and stalls everyone else's ITL — which is why engines chunk prefills into pieces and schedule them between decode steps. The production metric that resolves the fight is goodput: requests per second that meet your latency SLO. Optimizing raw throughput while goodput falls is the classic serving mistake, and I've made it.

The map, compressed: figure out which regime you're in (prefill/compute vs decode/bandwidth), batch until KV memory stops you, quantize the tier that matches your bottleneck, speculate only with idle compute, and let goodput — not any single number — decide if you actually improved things.

Weights (INT8/INT4 via GPTQ, AWQ, or similar). Decode reads weights every step, so weight-only quantization cuts the dominant bandwidth term — INT4 turns a 16 GB read into 4 GB, and small-batch decode speeds up nearly proportionally. The dequantization math is extra compute, which is fine while you're bandwidth-bound and a real cost at large batch.
KV cache (FP8/INT8 cache). Halving cache bytes doubles the tokens that fit in the pool — i.e., doubles your achievable concurrency — and cuts decode's per-request bandwidth. For long-context serving this is often worth more than weight quantization.
Activations (W8A8, FP8 on Hopper-class hardware). This is the only tier that speeds up compute, because the matmuls themselves run on faster low-precision tensor cores. It's what helps prefill, where the other two tiers do little.

Speculative decoding #

Kernel-level wins #

Two items at the kernel level pay off out of proportion to their glamour:

CUDA graphs. A decode step is hundreds of small kernel launches, each costing several microseconds of CPU-side overhead. At batch 1, the kernels themselves are so fast that launch overhead can be a third of step time — the GPU literally waits for the CPU. Capturing the decode step once and replaying it as a single graph launch removes that, which is why engines maintain captured graphs per batch size and why startup takes that extra moment.
Fused kernels. Every separate elementwise op is a round trip through HBM. Fusing the residual add into RMSNorm, rotary embeddings into attention, and the SwiGLU activation into one kernel removes whole reads and writes of activation tensors. FlashAttention is the same idea applied to attention's score matrix. None of these change the math; they change how many times the math touches memory.

The metrics that keep you honest #

Three numbers describe a serving system, and they fight each other:

TTFT (time to first token) — prefill latency; what "feels instant" measures.
TPOT / ITL (time per output token / inter-token latency) — decode cadence; what "feels smooth" measures.
Throughput — aggregate tokens/sec; what your cost per million tokens measures.