Posts

All the articles I've posted.

Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving

Jun 8, 2026

How we serve Ming-Omni in SGLang: the unified multimodal architecture, and the optimizations — an encoder kernel fix, tensor parallelism, CFM CUDA-graph capture, and streaming TTS — behind fast omni serving.

Higgs Audio v3 TTS on SGLang-Omni

Jun 4, 2026

A short pointer to the LMSYS article on work I contributed to: serving Higgs Audio v3 TTS with SGLang-Omni for real-time, controllable voice agents.

Omni Serving Design Through the Lens of Ming-Omni

May 29, 2026

Using Ming-flash-omni-2.0 as a case study to compare vllm-omni and sglang-omni design tradeoffs across thinker, talker, streaming, and audio output.

Model Support in a VLM Serving Stack Is Not a Checkbox - It Is a Six-Layer Systems Contract

Apr 1, 2026

Why real multimodal model support is a six-layer serving-stack contract, from API extraction to decoder reentry.

Designing a Production-Minded RAG Chatbot for a Personal Website

Feb 21, 2026

How I designed, implemented, and hardened a cost-efficient RAG chatbot for my personal site with citations, streaming, and build-time indexing.

Why Context is Everything: Building Your Personal AI Knowledge Base

Feb 5, 2026

How gathering your information in one place transforms AI from a generic assistant into your personal superpower.

Chatboxes, Agents, and automatic workflow

Jan 25, 2026

Why this feels like the shift from horsepower to a real vehicle—and why I am excited about the future.

vLLM - Why Requests Take Hours Under Load

Jan 14, 2026

Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.

Mini SGLang (Part 2) - Batching & Advanced Scheduling

Jan 6, 2026

Continuing the Mini SGLang deep dive - covering request batching, overlap scheduling, and tensor parallelism.

Mini SGLang (Part 1) - Architecture, Engine & Request Flow

Jan 4, 2026

Deep dive into Mini SGLang architecture - covering system design, engine initialization, KV cache, and single request lifecycle.

Understanding Transformer Inference Optimization

Dec 28, 2025

Prefill vs decode, continuous batching, quantization, speculative decoding, and CUDA graphs — a map of the LLM inference optimization landscape and what each technique actually buys.

CUDA Memory Management for LLM Serving

Dec 20, 2025

How GPU memory actually gets spent serving an LLM: the PyTorch caching allocator, fragmentation, static KV pools, and debugging OOMs that make no sense.

Attention, From the Inference Side

Dec 15, 2025

Self-attention from an inference engineer's perspective: where the FLOPs go, why the KV cache exists, and how GQA and FlashAttention change the memory math.