Skip to content

Posts

All the articles I've posted.

Higgs Audio v3 TTS on SGLang-Omni

A short pointer to the LMSYS article on work I contributed to: serving Higgs Audio v3 TTS with SGLang-Omni for real-time, controllable voice agents.

Understanding Transformer Inference Optimization

Prefill vs decode, continuous batching, quantization, speculative decoding, and CUDA graphs — a map of the LLM inference optimization landscape and what each technique actually buys.

CUDA Memory Management for LLM Serving

How GPU memory actually gets spent serving an LLM: the PyTorch caching allocator, fragmentation, static KV pools, and debugging OOMs that make no sense.

Attention, From the Inference Side

Self-attention from an inference engineer's perspective: where the FLOPs go, why the KV cache exists, and how GQA and FlashAttention change the memory math.