The article tackles a real serving problem: how to run MOSS-TTS-Local Transformer v1.5 — an open 48 kHz stereo TTS model built on a Qwen3-4B backbone with a lightweight Local Transformer, with zero-shot voice cloning, 31 languages, and token-level duration control — efficiently and with low latency in production. A model like this does not fit a single LLM decode loop: one request crosses reference-audio encoding, an autoregressive backbone, a frame-local 12-codebook sampling loop, and a stateful codec decoder. SGLang-Omni serves it as a three-stage pipeline — reference encoding -> AR engine -> streaming vocoder — with frame-level CUDA graphs keeping the per-step overhead low. For TTS and omni models, this is the recurring lesson: the serving path looks more like a heterogeneous pipeline than "one model generates tokens until done," which makes CUDA-graph-friendly runners, streaming vocoder scheduling, and low-overhead stage communication first-class parts of the serving story.
I implemented the native-streaming support (sglang-omni#753): a streaming vocoder scheduler that drives stateful codec-decoding sessions, plus per-request streaming metadata (chunk sizes and first-audio latency targets). The key property I leaned on is that the codec is natively streamable — causal convolutions plus bounded-context attention with per-slot state — so the chunked output stays bit-identical to offline decode. That means streaming buys latency without changing the audio: for a single request, time-to-first-audio drops about 5.3x (0.13s vs 0.70s).
Read the full article on LMSYS.