vLLM - Why Requests Take Hours Under Load

When deploying large language models (LLMs) at scale, you quickly discover that inference performance isn't just about GPU speed or model size. It's about how efficiently you manage memory, schedule requests, and handle the complex dance of concurrent requests competing for limited resources. If you've ever deployed vLLM in production and watched request latencies balloon from milliseconds to hours during peak traffic, you're not alone.

In this post, we'll dissect exactly why vLLM requests can take 2-3 hours under heavy load, explore the brilliant KV cache architecture that usually makes vLLM so fast, and understand what happens when that architecture reaches its breaking point.

Problem Statement: The 2-3 Hour Request Mystery #

Picture this scenario: You've deployed vLLM to serve your LLM application. Under normal load, requests complete in seconds. Your monitoring looks healthy, GPU utilization is high, and everything seems great. Then traffic spikes and suddenly some requests start taking minutes. Then tens of minutes. Eventually, you see requests stuck for hours.

What's happening here isn't a bug. It's a consequence of resource exhaustion combined with fair scheduling. The root causes typically fall into three categories:

Block Exhaustion: vLLM runs out of KV cache blocks, which are the fundamental memory units for storing attention states. When blocks are exhausted, new requests can't be scheduled until existing requests release their blocks.

Queue Starvation: Even when blocks are available, the scheduler must decide which of potentially hundreds of waiting requests to execute next. With certain scheduling policies and request patterns, some requests can be repeatedly deprioritized, leading to extreme wait times.

Memory Fragmentation: As requests of varying lengths complete and release blocks, the free block pool can become fragmented.

KV Cache Architecture: The Foundation of Efficient Inference #

To understand vLLM's performance characteristics, we must first understand the Key-Value (KV) cache and why it exists.

Why We Need KV Cache #

Transformer models use attention mechanisms that compute relationships between all tokens in a sequence. During autoregressive generation, we face a choice:

Recompute from scratch: Every time we generate a new token, recompute attention across all previous tokens
Cache previous computations: Store the key and value vectors from previous tokens and reuse them

Without caching, each new token would require recomputing keys and values for the entire prefix — effectively a full prefill per generated token. With KV caching, we compute each token's keys and values once and reuse them, so a decode step does only one token's worth of new compute plus a read over the cache. Attention still reads a cache that grows with context length; what the cache eliminates is the redundant recomputation.

The catch? KV cache consumes substantial memory. For a model like LLaMA-2 70B, each token's KV cache requires approximately 0.5 MB per token.

vLLM's Block-Based Solution #

vLLM solves these problems with a brilliant insight borrowed from operating systems: paged memory management. Instead of allocating memory contiguously, vLLM divides KV cache into fixed-size blocks.

class KVBlock:
    """A fixed-size block storing KV cache for multiple tokens."""
    def __init__(self, block_size: int = 16):
        self.block_size = block_size
        self.num_layers = 80  # LLaMA-2 70B has 80 layers
        self.data = allocate_gpu_memory()
        self.ref_count = 0  # For sharing blocks between requests

Each request is allocated blocks dynamically as it generates tokens. A request needing 100 tokens will be allocated approximately 7 blocks (100 ÷ 16, rounded up).

Key Benefits:

Near-zero waste: Memory is allocated as needed, not upfront
High concurrency: Memory is shared efficiently across many requests
Block sharing: Multiple requests with identical prefixes can share blocks

PagedAttention: Memory-Efficient Attention at Scale #

PagedAttention is the kernel-level innovation that makes block-based KV cache practical. Traditional attention implementations assume contiguous memory; PagedAttention is specifically designed to work with scattered, block-based memory.

The key insight is that attention is fundamentally a series of independent dot products between query vectors and key vectors, followed by a weighted sum over value vectors. PagedAttention directly operates on block-based storage without expensive memory copies.

Request Scheduling: The Brain of vLLM #

vLLM divides request processing into two phases:

Prefill Phase: Process the input prompt in one shot, computing KV cache for all prompt tokens. This is compute-bound.

Decode Phase: Generate output tokens one at a time, each requiring a forward pass with just one new token plus the cached KV states. This is memory-bound.

class Scheduler:
    def schedule(self) -> Tuple[List[SequenceGroup], List[SequenceGroup]]:
        """Main scheduling decision each iteration."""
        # Phase 1: Schedule running requests (decode)
        decodes = self._schedule_running(num_free_gpu_blocks)
 
        # Phase 2: Schedule swapped requests (restore from CPU)
        restored = self._schedule_swapped(num_free_gpu_blocks)
 
        # Phase 3: Schedule waiting requests (prefill)
        prefills = self._schedule_waiting(num_free_gpu_blocks)
 
        return prefills, decodes

Queue Starvation: When Fair Scheduling Becomes Unfair #

Queue starvation occurs when a request waits for an extremely long time—hours, in severe cases—despite the scheduler continuously making progress. The key insight: even though the system is making progress, the waiting queue grows when arrivals exceed completions.

Factors Amplifying Starvation #

Long-running requests: If running requests generate thousands of tokens, they hold blocks for extended periods
Bursty arrivals: Traffic spikes fill the waiting queue faster than it can drain
Large prompts: Requests with very long prompts consume many blocks when prefilled

Block Exhaustion: Running Out of Memory #

Block exhaustion is the other major cause of extreme latency. When blocks are exhausted:

No new prefills: Waiting requests cannot be scheduled
Running requests continue: Already-running requests keep generating tokens
Preemption may occur: If running requests need more blocks and none are available, they may be preempted

Solutions and Mitigations #

1. Request Prioritization #

Implement priority scheduling to ensure latency-sensitive requests aren't stuck behind bulk jobs:

class PriorityScheduler(Scheduler):
    def _sort_by_priority(self, waiting: List[SequenceGroup]) -> List[SequenceGroup]:
        return sorted(waiting, key=lambda x: (-x.priority, x.arrival_time))

2. Request Admission Control #

Prevent queue buildup by rejecting requests when the system is overloaded:

def should_admit(scheduler: Scheduler, estimated_tokens: int) -> Tuple[bool, Optional[str]]:
    queue_length = len(scheduler.waiting)
    if queue_length >= max_queue_length:
        return False, f"Queue full ({queue_length} requests waiting)"
    return True, None

3. Aggressive Timeouts #

Set reasonable timeouts to prevent requests from hogging resources indefinitely.

4. Monitoring and Alerting #

Implement comprehensive monitoring to detect issues early—track queue length, wait times, and block utilization.

5. Horizontal Scaling #

Scale out to multiple vLLM instances with intelligent load balancing.

Conclusion #

vLLM's architecture is a masterpiece of systems engineering—PagedAttention and block-based memory management enable unprecedented GPU memory efficiency. But under heavy load, that complexity can manifest as multi-hour request latencies.

The key takeaways:

Understand your bottleneck: Is it block exhaustion or queue starvation?
Plan for overload: Use admission control, timeouts, and prioritization
Right-size your hardware: More GPU memory means more concurrent requests
Scale horizontally: Don't rely on a single instance
Monitor relentlessly: Set up alerts before problems escalate

By combining these strategies, you can build vLLM deployments that maintain low latencies even under extreme load.