Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving
How we serve Ming-Omni in SGLang: the unified multimodal architecture, and the optimizations — an encoder kernel fix, tensor parallelism, CFM CUDA-graph capture, and streaming TTS — behind fast omni serving.
Higgs Audio v3 TTS on SGLang-Omni
A short pointer to the LMSYS article on work I contributed to: serving Higgs Audio v3 TTS with SGLang-Omni for real-time, controllable voice agents.
Omni Serving Design Through the Lens of Ming-Omni
Using Ming-flash-omni-2.0 as a case study to compare vllm-omni and sglang-omni design tradeoffs across thinker, talker, streaming, and audio output.
Model Support in a VLM Serving Stack Is Not a Checkbox - It Is a Six-Layer Systems Contract
Why real multimodal model support is a six-layer serving-stack contract, from API extraction to decoder reentry.
Designing a Production-Minded RAG Chatbot for a Personal Website
How I designed, implemented, and hardened a cost-efficient RAG chatbot for my personal site with citations, streaming, and build-time indexing.
Why Context is Everything: Building Your Personal AI Knowledge Base
How gathering your information in one place transforms AI from a generic assistant into your personal superpower.
Chatboxes, Agents, and automatic workflow
Why this feels like the shift from horsepower to a real vehicle—and why I am excited about the future.
vLLM - Why Requests Take Hours Under Load
Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.
Mini SGLang (Part 2) - Batching & Advanced Scheduling
Continuing the Mini SGLang deep dive - covering request batching, overlap scheduling, and tensor parallelism.
Mini SGLang (Part 1) - Architecture, Engine & Request Flow
Deep dive into Mini SGLang architecture - covering system design, engine initialization, KV cache, and single request lifecycle.
Understanding Transformer Inference Optimization
Prefill vs decode, continuous batching, quantization, speculative decoding, and CUDA graphs — a map of the LLM inference optimization landscape and what each technique actually buys.
CUDA Memory Management for LLM Serving
How GPU memory actually gets spent serving an LLM: the PyTorch caching allocator, fragmentation, static KV pools, and debugging OOMs that make no sense.
Attention, From the Inference Side
Self-attention from an inference engineer's perspective: where the FLOPs go, why the KV cache exists, and how GQA and FlashAttention change the memory math.