Understanding vLLM Performance Issues Under Load: A Deep Dive
When deploying large language models (LLMs) at scale, you quickly discover that inference performance isn't just about GPU speed or model size. It's about how efficiently you manage memory, schedule requests, and handle the complex dance of concurrent requests competing for limited resources. If you've ever deployed vLLM in production and watched request latencies balloon from milliseconds to hours during peak traffic, you're not alone.
In this post, we'll dissect exactly why vLLM requests can take 2-3 hours under heavy load, explore the brilliant KV cache architecture that usually makes vLLM so fast, and understand what happens when that architecture reaches its breaking point.
Problem Statement: The 2-3 Hour Request Mystery
Picture this scenario: You've deployed vLLM to serve your LLM application. Under normal load, requests complete in seconds. Your monitoring looks healthy, GPU utilization is high, and everything seems great. Then traffic spikes and suddenly some requests start taking minutes. Then tens of minutes. Eventually, you see requests stuck for hours.
What's happening here isn't a bug. It's a consequence of resource exhaustion combined with fair scheduling. The root causes typically fall into three categories:
Block Exhaustion: vLLM runs out of KV cache blocks, which are the fundamental memory units for storing attention states. When blocks are exhausted, new requests can't be scheduled until existing requests release their blocks.
Queue Starvation: Even when blocks are available, the scheduler must decide which of potentially hundreds of waiting requests to execute next. With certain scheduling policies and request patterns, some requests can be repeatedly deprioritized, leading to extreme wait times.
Memory Fragmentation: As requests of varying lengths complete and release blocks, the free block pool can become fragmented.
KV Cache Architecture: The Foundation of Efficient Inference
To understand vLLM's performance characteristics, we must first understand the Key-Value (KV) cache and why it exists.
Why We Need KV Cache
Transformer models use attention mechanisms that compute relationships between all tokens in a sequence. During autoregressive generation, we face a choice:
- Recompute from scratch: Every time we generate a new token, recompute attention across all previous tokens
- Cache previous computations: Store the key and value vectors from previous tokens and reuse them
Without caching, generating a 1000-token response would require computing attention over 1 token, then 2 tokens, then 3 tokens—resulting in O(n²) computational complexity. With KV caching, we compute each token's keys and values once and reuse them, reducing this to O(n) complexity.
The catch? KV cache consumes substantial memory. For a model like LLaMA-2 70B, each token's KV cache requires approximately 0.5 MB per token.
vLLM's Block-Based Solution
vLLM solves these problems with a brilliant insight borrowed from operating systems: paged memory management. Instead of allocating memory contiguously, vLLM divides KV cache into fixed-size blocks.
class KVBlock:
"""A fixed-size block storing KV cache for multiple tokens."""
def __init__(self, block_size: int = 16):
self.block_size = block_size
self.num_layers = 80 # LLaMA-2 70B has 80 layers
self.data = allocate_gpu_memory()
self.ref_count = 0 # For sharing blocks between requests
Each request is allocated blocks dynamically as it generates tokens. A request needing 100 tokens will be allocated approximately 7 blocks (100 ÷ 16, rounded up).
Key Benefits:
- Near-zero waste: Memory is allocated as needed, not upfront
- High concurrency: Memory is shared efficiently across many requests
- Block sharing: Multiple requests with identical prefixes can share blocks
PagedAttention: Memory-Efficient Attention at Scale
PagedAttention is the kernel-level innovation that makes block-based KV cache practical. Traditional attention implementations assume contiguous memory; PagedAttention is specifically designed to work with scattered, block-based memory.
The key insight is that attention is fundamentally a series of independent dot products between query vectors and key vectors, followed by a weighted sum over value vectors. PagedAttention directly operates on block-based storage without expensive memory copies.
Request Scheduling: The Brain of vLLM
vLLM divides request processing into two phases:
Prefill Phase: Process the input prompt in one shot, computing KV cache for all prompt tokens. This is compute-bound.
Decode Phase: Generate output tokens one at a time, each requiring a forward pass with just one new token plus the cached KV states. This is memory-bound.
class Scheduler:
def schedule(self) -> Tuple[List[SequenceGroup], List[SequenceGroup]]:
"""Main scheduling decision each iteration."""
# Phase 1: Schedule running requests (decode)
decodes = self._schedule_running(num_free_gpu_blocks)
# Phase 2: Schedule swapped requests (restore from CPU)
restored = self._schedule_swapped(num_free_gpu_blocks)
# Phase 3: Schedule waiting requests (prefill)
prefills = self._schedule_waiting(num_free_gpu_blocks)
return prefills, decodes
Queue Starvation: When Fair Scheduling Becomes Unfair
Queue starvation occurs when a request waits for an extremely long time—hours, in severe cases—despite the scheduler continuously making progress. The key insight: even though the system is making progress, the waiting queue grows when arrivals exceed completions.
Factors Amplifying Starvation
- Long-running requests: If running requests generate thousands of tokens, they hold blocks for extended periods
- Bursty arrivals: Traffic spikes fill the waiting queue faster than it can drain
- Large prompts: Requests with very long prompts consume many blocks when prefilled
Block Exhaustion: Running Out of Memory
Block exhaustion is the other major cause of extreme latency. When blocks are exhausted:
- No new prefills: Waiting requests cannot be scheduled
- Running requests continue: Already-running requests keep generating tokens
- Preemption may occur: If running requests need more blocks and none are available, they may be preempted
Solutions and Mitigations
1. Request Prioritization
Implement priority scheduling to ensure latency-sensitive requests aren't stuck behind bulk jobs:
class PriorityScheduler(Scheduler):
def _sort_by_priority(self, waiting: List[SequenceGroup]) -> List[SequenceGroup]:
return sorted(waiting, key=lambda x: (-x.priority, x.arrival_time))
2. Request Admission Control
Prevent queue buildup by rejecting requests when the system is overloaded:
def should_admit(scheduler: Scheduler, estimated_tokens: int) -> Tuple[bool, Optional[str]]:
queue_length = len(scheduler.waiting)
if queue_length >= max_queue_length:
return False, f"Queue full ({queue_length} requests waiting)"
return True, None
3. Aggressive Timeouts
Set reasonable timeouts to prevent requests from hogging resources indefinitely.
4. Monitoring and Alerting
Implement comprehensive monitoring to detect issues early—track queue length, wait times, and block utilization.
5. Horizontal Scaling
Scale out to multiple vLLM instances with intelligent load balancing.
Conclusion
vLLM's architecture is a masterpiece of systems engineering—PagedAttention and block-based memory management enable unprecedented GPU memory efficiency. But under heavy load, that complexity can manifest as multi-hour request latencies.
The key takeaways:
- Understand your bottleneck: Is it block exhaustion or queue starvation?
- Plan for overload: Use admission control, timeouts, and prioritization
- Right-size your hardware: More GPU memory means more concurrent requests
- Scale horizontally: Don't rely on a single instance
- Monitor relentlessly: Set up alerts before problems escalate
By combining these strategies, you can build vLLM deployments that maintain low latencies even under extreme load.