MOSS-TTS Local Transformer v1.5 on SGLang-Omni

The article tackles a real serving problem: how to run MOSS-TTS-Local Transformer v1.5 — an open 48 kHz stereo TTS model built on a Qwen3-4B backbone with a lightweight Local Transformer, with zero-shot voice cloning, 31 languages, and token-level duration control — efficiently and with low latency in production. A model like this does not fit a single LLM decode loop: one request crosses reference-audio encoding, an autoregressive backbone, a frame-local 12-codebook sampling loop, and a stateful codec decoder. SGLang-Omni serves it as a three-stage pipeline — reference encoding -> AR engine -> streaming vocoder — with frame-level CUDA graphs keeping the per-step overhead low. For TTS and omni models, this is the recurring lesson: the serving path looks more like a heterogeneous pipeline than "one model generates tokens until done," which makes CUDA-graph-friendly runners, streaming vocoder scheduling, and low-overhead stage communication first-class parts of the serving story.

I implemented the native-streaming support (sglang-omni#753): a streaming vocoder scheduler that drives stateful codec-decoding sessions, plus per-request streaming metadata (chunk sizes and first-audio latency targets). The key property I leaned on is that the codec is natively streamable — causal convolutions plus bounded-context attention with per-slot state — so the chunked output stays bit-identical to offline decode. That means streaming buys latency without changing the audio: for a single request, time-to-first-audio drops about 5.3x (0.13s vs 0.70s).

Read the full article on LMSYS.

Higgs Audio v3 TTS on SGLang-Omni

Jun 4, 2026

A short pointer to the LMSYS article on work I contributed to: serving Higgs Audio v3 TTS with SGLang-Omni for real-time, controllable voice agents.

Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving

Jun 8, 2026

How we serve Ming-Omni in SGLang: the unified multimodal architecture, and the optimizations — an encoder kernel fix, tensor parallelism, CFM CUDA-graph capture, and streaming TTS — behind fast omni serving.

Omni Serving Design Through the Lens of Ming-Omni

May 29, 2026

Using Ming-flash-omni-2.0 as a case study to compare vllm-omni and sglang-omni design tradeoffs across thinker, talker, streaming, and audio output.

Read the full article on LMSYS.

Related posts

Higgs Audio v3 TTS on SGLang-Omni

Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving

Omni Serving Design Through the Lens of Ming-Omni

Related posts

Higgs Audio v3 TTS on SGLang-Omni

Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving

Omni Serving Design Through the Lens of Ming-Omni