vLLM - Why Requests Take Hours Under Load
Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.
I work in AI engineering, and this personal website is where I share my journey—designing systems, serving models, optimizing performance, and everything in between. Along the way, I'll also share the lessons learned, including the mistakes that shaped my growth.
Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.
Continuing the Mini SGLang deep dive - covering request batching, overlap scheduling, and tensor parallelism.
Deep dive into Mini SGLang architecture - covering system design, engine initialization, KV cache, and single request lifecycle.