Ming-Omni on SGLang: Architecture and the Optimizations Behind Fast Omni Serving

Over the past few months I've been working on and optimizing the omni and TTS models in sglang-omni. The one I've poured the most into is Ming-Omni, and it's been a genuinely rewarding process. I've consolidated some of that work into this article to illustrate our journey optimizing Ming-Omni — pushing it from a vanilla, preview-grade version into a real serving model that runs in production. I hope you enjoy it.

TL;DR — headline before/after, all SGLang-internal:

MMMU end-to-end latency: 14.99 s → 6.70 s (2.2×), from one patch-embed kernel fix.
Patch-embed op: 3.7 s → 0.02 ms (~185,000×).
Talker CFM CUDA-graph: 773 ms → 263 ms e2e p50 (2.94×), throughput 2.93×.
Streaming TTS time-to-first-audio: 509 ms → 236 ms (2.16×).
Talker single-stream throughput plateaus near ~3 req/s.

1. What is Ming-Omni?#

Ming-Omni is a unified multimodal model that perceives text, image, video, and audio in a single token space and responds in both text and speech. One model takes a spoken question about an image and answers out loud—no separate ASR, VLM, and TTS services stitched together.

The architecture is described in Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation (Inclusion AI / Ant Group, arXiv:2510.24821). Ming-Flash-Omni is built on a sparser MoE variant of Ling-Flash-2.0: 100 billion total parameters, of which only 6.1 billion are active per token. It unifies vision, speech, and language perception with joint, single-channel generation of speech, sound effects, and music—and replaces discrete speech tokens with continuous acoustic latents, avoiding quantization loss.

Where Ming sits in the omni landscape #

Any-to-any models extend LLMs to perceive and generate across modalities, and current systems cluster into a few patterns. Encoder–adapter–LLM stacks bolt frozen modality encoders onto a backbone LLM through a lightweight projection—strong at perception, but text-output-only. Thinker–talker designs add generation by separating an understanding LLM (the thinker) from a dedicated generation stack (the talker) that consumes the thinker's hidden states to synthesize speech. On the talker side, discrete codec-LM TTS predicts neural-codec tokens autoregressively, while continuous flow-matching / diffusion vocoders iteratively denoise continuous acoustic features for higher naturalness at greater inference cost.

Ming-Omni instantiates the thinker–talker paradigm with a continuous flow-matching talker. Reasoning stays in a high-capacity sparse-MoE thinker; high-rate acoustic modeling is delegated to a talker whose audio VAE synthesizes continuous latents rather than quantized codes.

The served architecture #

The configuration we serve in SGLang:

Stage	Component	Key config
Thinker	`BailingMoeV2ForCausalLM` (sparse MoE)	32 layers, hidden 4096, 256 experts (8 routed/token + 1 shared), GQA 32 heads / 4 KV, partial RoPE 50%, vocab 157,184
Vision encoder	SGLang ViT (Qwen3-style) + deepstack mergers	→ 4096-dim
Audio encoder	Whisper backbone (1280-dim) + Conv1d(k3,s2) + MLP	mel `[B,T,128]` → 4096-dim
Talker	dense Qwen2 (hidden 896) + CFM/DiT (28L, 1024H) + aggregator (28L, 1152H) + stop_head + spk_head	Audio VAE vocoder → 44.1 kHz waveform

The thinker's MoE block carries intermediate_size=9216, moe_intermediate_size=1024, grouped routing (n_group=8, topk_group=4), and first_k_dense_replace=1 (the first layer is dense). Head dim is 4096 / 32 = 128.

Note: the paper's headline numbers (100B total / 6.1B active) describe the full Ming-Flash-Omni family. The config in the table above is the specific instance we serve—they are not the same number and should not be conflated.

End-to-end flow #

Ming-Omni end-to-end architecture: text, image, and audio flow through their encoders into the BailingMoeV2 MoE thinker, which emits text and hands hidden states to the talker (Qwen2 + CFM/DiT + Audio VAE) to synthesize 44.1 kHz speech.

Design principles #

Sparse MoE efficiency. 256 experts but only 8 routed (+1 shared) per token—capacity scales with the expert pool while per-token FLOPs stay bounded.
Unified perception + generation. Vision, audio, and text collapse into one 4096-dim token space the thinker reasons over jointly, then a single stack emits both text and speech.
Decoupled understanding/speech stacks. The thinker owns reasoning; the talker owns acoustics. Each specializes, and the boundary is exactly where streaming synthesis begins.

Modalities: in = text / image / video / audio; out = text (always) + speech (on request).

2. What SGLang Supports for Ming-Omni #

SGLang exposes Ming-Omni through three request pipelines, selected by the modalities a client asks for. The text-only pipeline runs 6 stages; the speech pipeline adds a talker synthesis path for 7 stages; the streaming-speech pipeline adds a per-segment text segmenter for 8 stages. All three share the same thinker forward pass and diverge only at the generation tail.

Parallelism support is asymmetric. The thinker (BailingMoeV2, 256-expert MoE) and the vision encoder both shard across GPUs via tensor parallelism. The audio encoder and the talker do not — each runs single-GPU today.

Capability	Notes
Thinker TP	MoE backbone shards across GPUs
Image-encoder TP	`--image-encoder-tp`, `--gpu-image-encoder`
Text streaming (per-token, PR #656)	per-token SSE deltas for `stream=true`; talker fed via a segmenter (min 8 / max 40 tok, first-segment 450 ms wait)
Audio streaming (per-segment)	truly chunked; per-segment synthesis path

Text segmentation routes thinker output through a segmenter (stages.py:132-135) that emits a TTS segment once it accumulates between segment_min_tokens=8 and segment_max_tokens=40 tokens, with a first_segment_max_wait_ms=450 timeout so the first segment never stalls behind a slow accumulation. This feeds the talker; per-token text streaming to the client is added separately in PR #656 (earlier builds returned a single aggregate SSE chunk — see §5). Audio streaming then synthesizes and emits each segment independently, so the client hears audio before the full reply is decoded — and this per-segment chunking is real for audio.

Key Finding: the three chat pipelines are served through one OpenAI-compatible /v1/chat/completions endpoint. Clients pick text, speech, or streaming-speech via the modalities parameter and select a speaker through voice presets — no separate route per modality among the chat pipelines. The server additionally exposes dedicated /v1/audio/speech and /v1/realtime endpoints alongside the shared chat route.

The streaming and TP work referenced throughout this post lands in PRs #438, #513, #553, and #506 (see References).

3. The Serving Challenge #

Omni serving is a pipeline problem, not a single-model problem. A request to Ming-Flash-Omni traverses a chain of specialized stages—encoders, a sparse MoE thinker, a flow-matching talker, an audio vocoder, and a streaming segmenter—and the end-to-end latency a user feels is the sum of every stage, not the slowest one. Each stage runs on the critical path. None can be hidden behind another.

The chain looks like this:

audio/image/text in
   -> modality encoders        (vision ViT, speech encoder)
   -> MoE thinker              (sparse Ling-Flash-2.0-family backbone; only a few experts fire per token)
   -> talker CFM               (10-step flow-matching ODE over continuous acoustic latents)
   -> Audio VAE vocoder        (latent -> waveform)
   -> streaming segmenter       (per-segment chunk emission)
audio out

Two properties make this latency-sensitive end to end. First, the thinker is sparse but large: only a small fraction of its experts fire per token, yet every modality encoder upstream and every acoustic-synthesis step downstream still gate the first audible sample. Second, the talker replaces discrete speech tokens with continuous acoustic latents and synthesizes them through a 10-step ODE—higher fidelity, but ten denoising passes that sit squarely between the thinker's last token and the first waveform a user hears. Shaving milliseconds anywhere in the chain shifts time-to-first-audio directly.

While instrumenting this pipeline, we walked into three measurement traps. Each one is a lesson about trusting numbers before understanding what produced them.

Trap 1: Kernel dispatch. We measured a single "convolution" in the vision patch-embed taking ~3.7 s. The op was not expensive—it was degenerate. The patch-embed Conv3d has exactly one output cell per patch, making it mathematically a linear projection, but the Conv3d kernel hit a slow cuDNN path and paid the full convolution dispatch cost. The arithmetic was trivial; the dispatch was the bill.

Trap 2: Audio packaging. A measured audio duration came back inflated by roughly 1.84×. The waveform was fine. The WAV header was wrong: 24 kHz declared on 44.1 kHz PCM data. Every downstream duration and throughput figure computed from that header was off by the sample-rate ratio (44100 / 24000 ≈ 1.84). The bug was in the container metadata, not the audio.

Note: A wrong sample rate corrupts derived metrics silently—the waveform plays, the file opens, and only the numbers lie.

Trap 3: Streaming semantics. stream=true is a contract, not a flag. A streaming endpoint must deliver real per-segment chunks as they are synthesized—not buffer the full response and emit one trailing blob with a streaming label on it. We held the streaming path to the stricter definition: each segment leaves the segmenter as its own chunk, the moment it is ready.

Key Finding: Every trap shared one root cause—a metric we trusted before we understood the mechanism that produced it. The fixes in the following sections all start by re-deriving the number from first principles.

4. Optimization Journey: From 15s to Interactive Speech #

The first time we ran Ming-Flash-Omni end-to-end in SGLang, a single MMMU image-and-text request took roughly 15 seconds. Most of that time was not spent generating tokens. It was spent inside the vision encoder, doing one thing that looked cheap and was catastrophically slow. The journey from 15 seconds to interactive speech is a sequence of such discoveries — each one a place where the math said "this is free" and the hardware said otherwise. We start where the cost was largest.

Benchmark setup. Unless noted, every number is mean ± sample-std over n=3 repeats on 8×H100-80GB, with the thinker at TP=4 (the only feasible degree — see §4.2), MMMU at 50 samples and 16 warmup. The talker concurrency figures come from an earlier campaign; the encoder-fix, TP, CFM-graph, and streaming numbers are from the measured runs reported here. Every before/after is SGLang-internal — no cross-engine comparison.

4.1 Patch-embedding kernel fix — From 3.7s to 0.02ms#

The patch-embedding projection in Qwen3VLVisionPatchEmbed is a nn.Conv3d whose kernel size equals its input spatial dimension — so it produces exactly one output cell per patch, which makes it mathematically a linear projection, but on H100/bf16 it dispatches to a cuDNN slow path that costs ~3.7 s per call.

The Conv3d is convolution in name only. Each patch is a (in_channels, temporal_patch_size, patch_size, patch_size) volume, the kernel covers the entire volume, and the output is a single 1×1×1 cell. There is no sliding, no spatial reuse — every output is a dot product of the flattened patch against a flattened weight. That is F.linear. But cuDNN does not know the intent; it sees a 3D convolution over bf16 on H100 and selects a generic algorithm that spends almost four seconds doing what a GEMM does in microseconds.

Solution #

We keep the real Qwen3VLVisionPatchEmbed module — including its Conv3d weight — but bypass the convolution in the forward path, reshaping pixels into flat patch vectors and applying the identical weight as a linear layer (PR #539).

# Before: cuDNN 3D-convolution slow path (~3.7 s on H100/bf16)
hidden_states = self.patch_embed.proj(pixel_values)  # nn.Conv3d, output 1x1x1
 
# After: mathematically identical GEMM (sglang_omni/.../components/vision_encoder.py)
patch_dim = in_channels * temporal_patch_size * patch_size * patch_size
x = pixel_values.view(-1, patch_dim)
hidden_states = F.linear(x, proj.weight.view(embed_dim, -1), proj.bias)

The weight is reused as-is (proj.weight.view(embed_dim, -1)), so outputs are bit-for-bit the same projection. No retraining, no tolerance fuzzing — just the right kernel.

Key Finding: the substitution is exact, not approximate. The Conv3d and the F.linear share the same weight tensor; only the dispatched kernel changes.

Performance Impact #

The op itself collapses from ~3.7 s to ~0.02 ms — a ~185,000× speedup at the operator level. That single op dominated the encoder stage, so the whole encoder stage drops from ~8.3 s to ~0.02 s, a ~400× stage-level speedup. End to end, MMMU at concurrency 1 improves from 14.99 s to 6.70 s mean latency — a 2.2× speedup — with no observed accuracy regression within run variance (0.62 → 0.60 on the pre/post pair; the later TP=4 sweep in §4.5 reads 0.58–0.63, a separate campaign).

Note: a secondary fix in the same region — capturing the ViT block-forward into a CUDA graph — takes per-replay time from ~4 s down to ~5 ms. We do not separate the two contributions; only the combined post-fix numbers are reported.

Phase	Duration	%	Main Activities
Patch-embed (Conv3d, before)	~3.7 s	~45%	cuDNN 3D-conv slow path on bf16
Patch-embed (`F.linear`, after)	~0.02 ms	~0%	single GEMM, weight reused as-is
Encoder stage (before)	~8.3 s	100%	patch-embed + ViT blocks
Encoder stage (after)	~0.02 s	100%	patch-embed negligible; ViT blocks CUDA-graph replayed

Key Achievement: one kernel substitution turns the single largest term in the request budget into a rounding error, taking MMMU end-to-end from 14.99 s to 6.70 s with no accuracy regression (within run variance).

4.2 Tensor parallelism across the omni pipeline #

Tensor parallelism (TP) in the Ming-Omni pipeline shards each stage's weights across multiple GPUs, splitting attention heads, MLP projections, and MoE experts so that the thinker and image encoder can scale beyond a single device. The pipeline reached its current TP-everywhere state in three steps: an initial plumbing pass, the shortcomings that surfaced, and two optimization PRs that hardened the thinker and extended sharding to the encoder.

Where TP started — PR #438 ("Add omni serve to Ming omni v1"). This PR brought Ming-Omni V1 onto the generic sgl-omni serve --version v1 entrypoint and introduced the first thinker TP plumbing: --thinker-tp-size, --thinker-gpus, and disable_custom_all_reduce=True applied whenever TP > 1. For the first time the thinker could shard across GPUs through the shared CLI.

Shortcomings of that initial TP. Two problems remained.

The thinker text tower used a simplified TP pattern not aligned with upstream SGLang MoE/text models. Attention sharding, MoE routed/shared-expert reduction, and fused-weight loading were hard to reason about and fragile.
The image encoder (ViT + projector, ~7B) hardcoded TP=1 via _init_sglang_tp() with world_size=1. It could not be sharded, and it bottlenecked high-resolution and multi-image inputs.

Optimization PR #513 ("[Perf] Optimize Ming-Omni thinker tensor parallelism"). This PR realigned the thinker with SGLang's canonical TP patterns:

Dense MLP → MergedColumnParallelLinear + RowParallelLinear, with fused gate/up weight loading.
MoE routed-expert and shared-expert outputs combined before the TP all-reduce, so the reduction runs once per layer instead of twice.
LayerCommunicator integration in the decoder layers.
Hardened weight loading: track loaded vs skipped tensors, and reject incomplete gate/up shard pairs.
Fail-fast config so only the thinker stage may set tp_size > 1.

Optimization PR #553 ("[Feat] Add tensor parallelism support to Ming-Omni image encoder"). This PR extended sharding to the encoder that #513 left at TP=1:

Plumb real tp_rank/tp_size/nccl_port into the encoder process.
VisionProjector nn.Linear → ColumnParallelLinear / RowParallelLinear.
New --image-encoder-tp / --gpu-image-encoder CLI flags (functionally identical at TP=1).

Performance Impact. The thinker now shards cleanly, and TP=4 is the baseline for every benchmark in this post. The ~7B image encoder, previously pinned at TP=1, is now shardable across the encoder process. Correctness was verified by a deterministic-output parity test comparing TP=1 against TP=N: identical outputs across configurations.

Key Achievement: Every stage of the omni pipeline — thinker and image encoder — can now shard across GPUs through canonical SGLang TP primitives, with the thinker's per-layer MoE all-reduce halved from two reductions to one.

Note: TP=4 is the floor, not a tuning choice — a TP=1/2 sweep is infeasible. The served Ming model is ~200 GB; at 80 GB/H100 it simply does not fit below TP=4 without ~300×-slower CPU offload. We measured TP=1 climbing to 75 GB before OOM, while TP=4 lands at ~50 GB/GPU — comfortably resident. A full TP=8 pipeline would demand 10 GPUs. So TP=4 is the only valid operating point, and every section-4 number is reported at TP=4.

Image-encoder TP: does its own shard help? With the thinker fixed at TP=4, we toggled the dedicated image-encoder shard on and off at single-image c=1 (MMMU, mean ± sample-std, n=3, 8×H100):

Config	Mean latency (s)	QPS
encoder-TP OFF (single GPU)	6.094 ± 0.011	0.157 ± 0.000
encoder-TP ON (tp=2, dedicated)	6.263 ± 0.009	0.153 ± 0.000

Key Finding: turning the encoder shard on costs +0.169 s (+2.8%) latency and −2.5% QPS — small but real (the gap is ~15× the larger stddev). At single-image c=1 the ViT forward is already short relative to the per-request floor, so TP only adds all-reduce/sync overhead with no compute to amortize; encoder-TP is a knob for high-res / multi-image inputs, not single-image c=1.

4.3 CFM CUDA-graph capture — From 773ms to 263ms#

The talker synthesizes speech by integrating a conditional flow-matching (CFM) ODE: starting from noise, it denoises a continuous acoustic latent over a fixed number of solver steps. In SGLang the talker runs this CFM ODE for 10 steps (cfg=2.0, sigma=0.25, bf16) per request, and in eager mode every one of those steps dispatches its own kernels through the Python interpreter — a per-step launch tax that the GPU pays on the critical path of every utterance.

Problem. Each CFM step is a small, latency-bound block: the DiT denoiser forward, the aggregator, and the stop head. Run eagerly, the 10 steps serialize 10 rounds of host-side kernel launches. At c=1 the GPU is idle between launches, waiting on the CPU to enqueue the next step. The arithmetic is cheap; the launch overhead is not.

Solution. SGLang captures the entire 10-step CFM ODE — cfm.sample → aggregator → stop_head — into a single torch.cuda.CUDAGraph(), then replays that graph per request. Capture happens once into pre-allocated placeholder tensors; each subsequent request copies inputs in, calls graph.replay(), and copies outputs out. This ships by default.

The capture lives in _initialize_graph (sglang_omni/models/ming_omni/talker/modeling_ming_omni_talker.py:192-209):

self.graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(self.graph):
    self.gen_lat_placeholder = self.cfm.sample(
        self.last_hidden_state_placeholder,
        self.his_lat_placeholder,
        self.randn_like_placeholder,
        self.t_placeholder,
        self.sde_args_placeholder,
        self.sde_rnd_placeholder,
        abort_event=None,
    )
    self.inputs_embeds_placeholder = self.aggregator(self.gen_lat_placeholder)
    self.stop_out_placeholder = self.stop_head(
        self.last_hidden_state_placeholder[:, -1, :]
    ).softmax(dim=-1)

Replay is a single call (:160), bracketed by explicit abort checks since the in-graph abort path is disabled during capture (abort_event=None) — aborting cfm.sample mid-capture would corrupt the partial graph. Captured graphs are reused through a CFMGraphExecutorPool (default pool_size=5).

Note: the torch.cuda.graph(...) capture is at lines 192-209; replay is a single call at 160. The surrounding 166-179 region is the post-replay output copy-out (gen_lat/inputs_embeds/stop_out empty_like+copy_) plus the _initialize_graph signature and placeholder allocation.

Performance Impact. We ran a single-stream CFM microbenchmark (talker_cfm campaign) toggling only enforce_eager between arms — both arms are the same single-stream SGLang talker, so the A/B isolates the CFM-graph capture as the mechanism. Values are mean ± sample-std over n=3:

Mode	Throughput (req/s)	e2e p50 (ms)	e2e p99 (ms)	TTFA mean (ms)	RTF	Audio × RT
graph	2.987 ± 0.036	263.2 ± 5.4	651.1 ± 8.2	25.5 ± 2.3	0.0513 ± 0.0008	19.65 ± 0.24
eager	1.018 ± 0.027	772.5 ± 22.9	1829.5 ± 21.8	51.9 ± 2.8	0.1458 ± 0.0030	6.92 ± 0.13

Key Finding: avoiding the per-step kernel-launch overhead is the root cause of the talker's c=1 latency profile — capturing all 10 CFM steps into one replayable graph collapses 10 rounds of host-side dispatch into a single replay. Against the eager baseline this is a 2.94× e2e-p50 speedup (772.5 → 263.2 ms) and a 2.93× throughput gain (1.018 → 2.987 req/s).

Key Achievement: graph capture cuts TTFA 2.03× faster (51.9 → 25.5 ms) and RTF 2.84× lower (0.1458 → 0.0513), lifting the talker from 6.92× to 19.65× real-time on the same single-stream build.

4.4 Real streaming TTS — From 509ms to 236ms TTFA#

Real streaming TTS means the talker emits audio per text segment as it is synthesized, rather than buffering one monolithic blob and flushing it only when the full utterance is done. That distinction is the whole game for time-to-first-audio (TTFA): the latency a caller waits before the first sample of speech arrives.

Problem. The non-streaming talker path synthesizes the entire utterance, then returns a single audio response. Nothing reaches the caller until the last segment is decoded. TTFA is therefore bounded below by the cost of the whole synthesis — every segment's CFM ODE, every aggregation step — even though the first segment was ready long before. The caller hears silence for the full duration.

Solution (PR #506). We restructured the talker to relay audio per segment, and fixed two defects exposed once chunks started flowing:

Per-segment streaming. Each text segment's synthesized audio is emitted as its own chunk the moment it is decoded, so the first chunk leaves the engine as soon as the first segment finishes — not after the last.
Sample-rate fix. We forward chunk.sample_rate from the synthesis output instead of stamping a hardcoded header. The talker's audio runs at 44.1 kHz, but the relay had been emitting a 24 kHz header — so streamed chunks decoded at the wrong rate (pitch/duration skew). Forwarding the true per-chunk sample rate fixes playback.
Empty-tensor relay short-circuit (SHM backend). The shared-memory relay now short-circuits empty audio tensors instead of marshaling a zero-length payload across the boundary, avoiding wasted relay work on segments that produce no samples.

Performance Impact. Streaming cuts TTFA by more than half. Mean TTFA drops from 509 ms to 236 ms — a 2.16× speedup (p95: 553 ms → 234 ms, 2.36×). Once flowing, a request emits ~19.8 chunks at ~19 ms p50 intervals, each carrying ~325 ms of audio — roughly 17× faster than realtime.

The win is not free. Streaming trades aggregate throughput for latency: at c=1, throughput drops from 1.956 req/s to 1.206 req/s (~38%). This is an explicit UX/throughput trade — the caller starts hearing speech twice as fast, the engine serves fewer requests per second.

Key Finding: TTFA and throughput pull in opposite directions here. Per-segment emission front-loads work that the non-streaming path could batch, so the streaming engine does strictly more relay and dispatch work per request.

Mode	TTFA (mean)	Throughput (c=1)	Chunks/req	Chunk interval
Non-streaming	509 ms	1.956 req/s	1 (single blob)	—
Streaming	236 ms	1.206 req/s	~19.8	~19 ms (p50)

4.5 Concurrency scaling #

We sweep concurrency separately for the two halves of the pipeline. The thinker is compute-bound and TP-sharded; the talker runs a single-stream CFM scheduler.

Thinker concurrency (TP=4, MMMU, post-fix). Decode throughput holds at 138 tok/s per request at c=1 and degrades gracefully under load, falling 2.25× to 61 tok/s at c=16, while aggregate QPS scales 6.36× (0.157 → 0.998) and accuracy stays flat-to-up (0.58 → 0.63).

Concurrency	tok/s (per-req)	QPS	Mean latency (s)	p95 latency (s)	Accuracy
1	138.2 ± 0.3	0.157 ± 0.000	6.094 ± 0.011	13.937 ± 0.003	0.580 ± 0.000
2	123.1 ± 0.2	0.266 ± 0.001	7.219 ± 0.030	15.536 ± 0.021	0.607 ± 0.012
4	107.3 ± 0.7	0.465 ± 0.009	8.192 ± 0.119	17.473 ± 0.084	0.600 ± 0.000
8	83.4 ± 1.3	0.747 ± 0.026	10.169 ± 0.365	23.688 ± 0.301	0.620 ± 0.020
16	61.3 ± 0.3	0.998 ± 0.033	14.214 ± 0.302	28.873 ± 0.195	0.633 ± 0.023

Note: measured sweep (n=3, 8×H100, TP=4, MMMU 50 samples, warmup 16, mean ± sample-std). These supersede the earlier published thinker tok/s (120.4 → 56.1, older build); the c=16 QPS endpoint (0.998) agrees with the prior sweep (0.996) within run variance.

Talker concurrency (non-stream /v1/chat/completions). Throughput plateaus early — the CFM scheduler runs max_concurrency=1, so added concurrency mostly fills the queue rather than widening the stream.

Concurrency	Throughput (req/s)	Mean wall (s)	p95 wall (s)
1	2.02	0.493	0.522
2	2.87	0.68	0.74
4	3.01	1.25	1.39
8	2.92	2.35	2.77
16	3.03	3.70	5.01

Key Finding: the talker saturates near ~3.0 req/s by c=4 and holds flat through c=16 — the plateau is the single-stream SimpleScheduler ceiling, not a hardware limit. Mean wall grows linearly with concurrency (0.493 s → 3.70 s from c=1 to c=16), confirming added load queues behind the single stream rather than parallelizing.

Provenance: the talker sweep is from an earlier talker_compare.md campaign (c=1 an equalized rerun, c=2–16 the original batch) — distinct from the blog_data thinker/CFM runs above. Its ~3.0 req/s plateau is consistent with the CFM A/B's measured graph throughput (2.99 req/s, §4.3).

5. Known Limitations & Future Work #

The work above is in flight. Several rough edges remain.

Long-form speaker drift (Ming talker). On extended TTS generations, speaker timbre can drift away from the reference. Under investigation; not yet root-caused.
Talker / audio-encoder TP. The talker and the audio encoder do not yet support tensor parallelism. Only the thinker and image encoder are TP-sharded.
Duration-capped talker generation. We are exploring a text-length-derived max-step guard to prevent runaway TTS on short segments; it is not included in the benchmarks above.

Resolved since:

Per-token text streaming. Landed in PR #656: per-token text deltas now stream through _thinker_stage for stream=true requests. Earlier builds returned a single aggregate SSE chunk — that is what we measured (streaming_tpot, n=3: max_text_chunks_seen=1, text TPOT structurally N/A), so those text-cadence numbers are a pre-#656 baseline. Audio was already a real chunked stream (p50 inter-chunk 17.9 ± 0.4 ms, TTFA 268.9 ± 90.2 ms); a post-#656 text-TPOT measurement is the natural follow-up.
Eager-vs-graph CFM benchmark. Single-stream A/B (talker_cfm, n=3), only enforce_eager toggled. CUDA-graph capture beats eager by 2.94× e2e p50 (263 vs 773 ms), 2.93× throughput (2.99 vs 1.02 req/s), and 2.03× TTFA.

Key Finding: the CFM CUDA-graph win is real in isolation, not just an end-to-end artifact.

6. Acknowledgments #

This work was carried out by the SGLang-omni team and collaborators, who built and reviewed the Ming-Omni serving path, thinker/encoder TP, and the streaming TTS pipeline. We thank the Inclusion AI team for releasing Ming-Flash-Omni.

7. References #

SGLang-omni — serving framework and Ming-Omni implementation.
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation — Inclusion AI (Ant Group). arXiv:2510.24821.
PR #437 — Ming-Omni V1 pipeline migration (the V1 serving path this post describes).
PR #438 — Ming-Omni V1 serve entrypoint + initial tensor parallelism (§4.2).
PR #513 — thinker TP optimization (§4.2).
PR #553 — image-encoder TP (§4.2).
PR #687 — expose image-encoder TP via CLI (§4.2; in progress).
PR #539 — vision-encoder patch-embed Conv3d → F.linear fix (§4.1).
PR #506 — streaming TTS (§4.4).
PR #556 — thread per-chunk sample rate into WAV encoding (§4.4 sample-rate fix; in review).
PR #656 — per-token text streaming for stream=true (§5; merged).
PR #500 — align Ming benchmark plumbing.
PR #348 — Ming-Omni multi-stage CI: thinker / TTS / MMMU / MMSU.
PR #624 — Ming-flash-omni-2.0 cookbook (docs).