For the past two months, I have been benchmarking Ming-flash-omni-2.0 across two serving stacks: vllm-omni and sglang-omni.
At first, this looked like a normal backend comparison. Run the same model, same prompts, same hardware, then report which serving stack is faster.
That framing broke almost immediately.
Ming-Omni is not a text-only LLM. A request can pass through image encoders, audio encoders, a thinker LLM, a talker LM, a CFM diffusion head, an AudioVAE decoder, a streaming segmenter, and an OpenAI-compatible HTTP layer. Each part can change what "latency" means. A single number can hide a YAML default, a cuDNN kernel-selection trap, a CUDA graph boundary, a pseudo-streaming endpoint, or even a bad WAV header.
So this first post is not a winner table. It is the map I wish I had before I started: what I was actually benchmarking, how vllm-omni and sglang-omni structure Ming differently, and why the thinker and talker numbers looked so misleading before the mechanisms were isolated.
What Is Being Compared?#
The model is Ming-flash-omni-2.0. The serving stacks are vllm-omni and sglang-omni.
The benchmark covers four user-visible paths:
| Path | User-visible question | Main systems surface |
|---|---|---|
| Image-text thinker | How fast does the model answer image+text questions? | Vision encoder, multimodal projection, LLM prefill/decode |
| Text-only thinker | How fast is the LLM path when encoders are off the critical path? | Thinker engine, batching, TP=4 decode |
| Talker | How fast does text-to-speech generation finish? | Talker LM, CFM, AudioVAE, CUDA graph capture, scheduler |
| Streaming talker | When does the user hear the first audio? | Endpoint semantics, chunking, SSE behavior |
There is also an output-validation layer. For text, I check task accuracy. For audio, CER tells me whether the speech is intelligible, but it does not tell me whether two backends produce spectrally similar audio. For that I also look at mel-L2, loudness, and spectral centroid.
That distinction matters because "same model" is not the same as "same serving behavior."
Ming-Omni Has Two Main Runtime Stories#
The first story is the thinker.
The thinker takes text, images, and sometimes audio, turns modality features into embeddings, and feeds them into the LLM. For image-text reasoning, the critical path includes the vision encoder and the LLM. For text-only reasoning, the critical path is mostly the LLM engine.
The second story is the talker.
The talker turns the thinker's text or hidden states into speech. On Ming, this path is not a simple vocoder call. It includes a talker LM, a diffusion-style CFM process, and an AudioVAE decoder. That means low-level runtime details like CUDA graph capture can dominate latency, and API-level details like streaming vs pseudo-streaming can dominate user experience.
So a useful Ming benchmark has to separate at least these questions:
- Is the thinker slow, or is the encoder slow?
- Is the talker slow, or is the talker not graph-captured?
- Is the endpoint actually streaming, or does it only return one final blob?
- Is the audio intelligible, and is it spectrally equivalent?
Those are different questions. The first mistake is treating them as one benchmark.
vLLM-Omni vs SGLang-Omni: The Architectural Difference#
The cleanest high-level difference is that vllm-omni is more fused, while sglang-omni is more staged.
For the Ming thinker, vllm-omni loads the LLM and modality encoders inside one fused model stage. The request enters one engine, and the model runner invokes one forward path that includes the vision tower, projector, audio tower, and language model.
sglang-omni splits the same work into multiple stages: preprocessing, image encoder, audio encoder, multimodal aggregation, thinker, and detokenization. On an image-text request, the request moves through preprocessing, image encoding, aggregation, thinker decode, and final text decoding.
That split has two consequences:
| Architecture choice | What it helps | What it can cost |
|---|---|---|
| Fused stage | Fewer process boundaries; encoder and LLM live together | Harder to see which subcomponent caused a latency spike |
| Staged pipeline | Per-stage timing is visible; fixes can be localized | Extra stage transitions and separate scheduling surfaces |
This is why I do not think "fused vs pipelined" is the right final answer by itself. It is a useful map of where to look. The benchmark still has to identify the actual mechanism.
Thinker: Why SGLang-Omni Looked Much Slower#
The first headline result was from MMMU, an image-text reasoning workload.
At concurrency 1, vllm-omni finished requests in roughly 5.7 seconds on average, while sglang-omni took roughly 15 seconds. That looked like a 2.7x win for the fused architecture.
The server-side timing told a more specific story. On the sglang path, almost none of the time was IPC:
| Stage component | Mean time in the instrumented run |
|---|---|
| preprocessing | ~0.06 s |
| image encoder | ~7-8 s |
| multimodal aggregate | ~0.001 s |
| thinker LLM | ~6 s |
| detokenize | ~0.001 s |
| all IPC gaps combined | ~0.1 s |
That table changed the question. The latency gap was not primarily "five stages are slow." It was "why is the image encoder taking several seconds?"
The next narrowing step found the surprising part: the dominant cost was a single patch-embedding operation.
Ming's vision path used an nn.Conv3d with a kernel that covered the full input spatial extent. For the actual input shape, that convolution is mathematically just a linear projection over a flattened patch. But cuDNN did not choose an efficient algorithm for this degenerate 3D convolution shape on H100/bf16. The operation fell onto a slow path.
Replacing that operation with the equivalent F.linear changed the story. The sglang c=1 mean dropped from about 15 seconds to about 6.7 seconds, close to vllm-omni's 5.7 seconds. At c=16, post-fix sglang reached roughly 0.996 qps versus vllm's 1.121 qps.
So the original thinker result had two layers:
- The staged architecture made the bottleneck visible quickly.
- The actual bottleneck was not the stage boundary. It was a degenerate Conv3d kernel dispatch.
That is the kind of distinction benchmark tables usually erase.
Text-Only Thinker: The LLM Engines Were Basically At Parity#
To remove the encoder from the path, I also ran a text-only GSM8K benchmark.
That benchmark exposed a separate trap: vllm-omni's shipped Ming deploy YAML capped the thinker at max_num_seqs: 1. With that default, concurrent requests serialize and vllm looks artificially flat under load.
Once vllm was launched with --max-num-seqs 32, the text-only thinker numbers lined up:
| Workload | vllm-omni | sglang-omni | Reading |
|---|---|---|---|
| GSM8K c=16 throughput | ~4.82 qps | ~4.61 qps | within ~5% |
That result is important because it removes a tempting but wrong explanation. The thinker LLM engine itself was not wildly different. Same model, same TP=4 sharding, same broad kernel set, same hardware class. Once batching caps matched, the engines were essentially a wash on text-only serving.
The thinker lesson is: do not turn a multimodal latency gap into an LLM-engine story until the encoder path, launch config, and batching cap have all been audited.
Talker: The Direction Depends On Concurrency#
The talker story is different from the thinker story.
After fixing the measurement setup, sglang-omni was faster at low concurrency, while vllm-omni was better at high concurrency.
The cleaned-up non-streaming talker benchmark looked like this:
| Concurrency | sglang-omni | vllm-omni | Reading |
|---|---|---|---|
| c=1 throughput | ~2.02 req/s | ~0.68 req/s | sglang faster single-stream |
| c=16 throughput | ~3.03 req/s | ~4.33 req/s | vllm higher aggregate throughput |
| c=16 p95 wall time | ~5.01 s | ~3.00 s | vllm better tail at high c |
The c=1 result is not a scheduler story. I tested vllm with max_num_seqs=1 and max_num_seqs=32; c=1 performance was essentially unchanged. The bottleneck was not continuous-batching overhead.
The better explanation is CUDA graph capture.
sglang's Ming talker captures the full 10-step CFM ODE into a CUDA graph. That removes a lot of kernel-launch overhead from the single-stream path. vllm-omni ships the Ming talker in eager mode by default because its full graph-capture path hits stream-capture-incompatible operations during input construction. A PIECEWISE CUDA graph mode works and improves vllm standalone TTS throughput by about 1.7x, but it still does not capture the whole CFM loop the way sglang does.
At high concurrency, the picture flips. sglang's Ming talker is effectively single-stream and plateaus around 3 req/s. vllm's talker can keep admitting more in-flight requests through the engine scheduler, so throughput keeps climbing. The crossover is around c=10.
The talker lesson is: c=1 latency and c=16 throughput are not the same product requirement. A single-user voice agent and a batch synthesis service want different parts of the curve.
Streaming: stream=true Is Not A Contract#
The most user-visible result was not final latency. It was time to first audio.
sglang PR #506 added real chunked streaming TTS for Ming. In the c=1 run, it emitted roughly 20 audio chunks per request, with about 19 ms between chunks, and first audio around 236 ms.
The vllm-omni Ming endpoint accepted stream=true, but on the tested path it emitted one trailing SSE frame containing the full audio blob. That means first audio was effectively final wall time: about 3.93 seconds in the c=1 comparison.
That is a huge UX difference, but it is not the same as saying one backend's model is faster. It says the endpoint semantics were different:
| Endpoint behavior | What the client experiences |
|---|---|
| Real chunked streaming | Audio can start before the whole utterance finishes |
| Pseudo-streaming | The client waits for the final blob |
For an interactive voice product, this may matter more than total request time. For offline batch synthesis, it may matter less than aggregate throughput.
Audio Output: Intelligible Is Not Equivalent#
One more trap: CER alone is not enough to claim audio equivalence.
Both backends produced clean, intelligible speech in the talker benchmark. Whisper-large-v3 CER was near zero on both sides. But the audio was not spectrally equivalent.
The audit found cross-backend mel-L2 distance around 3.4x the within-backend baseline. vllm was also roughly 6 dB louder and about 25% brighter by spectral centroid on the measured sample.
So the honest statement is:
- The outputs were intelligible.
- The outputs were not spectrally equivalent.
- A production deployment should still run A/B audio evaluation.
This is a good example of why omni benchmarks are harder than text-only benchmarks. The output is not just a string. It is a waveform with timing, loudness, spectral shape, and perceptual quality.
The Benchmark Matrix#
Here is the compact version of the current read:
| Workload | Current reading | Mechanism |
|---|---|---|
| Text-only thinker | near parity | batching cap corrected; same TP=4 LLM path |
| Image-text thinker | near parity after fix | original gap was a Conv3d slow path |
| Low-c talker | sglang currently wins latency | full-CFM CUDA graph capture works there today |
| High-c talker | vllm currently wins throughput | continuous batching keeps scaling |
| Streaming TTS | sglang PR #506 wins TTFA | real chunks vs pseudo-streaming |
| Audio output | intelligible, not equivalent | CER is low; mel/RMS/centroid differ |
The word "currently" is doing work here. Some rows are architecture. Some are model-port maturity. Some are launch configuration. Some are endpoint semantics. A backend can move columns after a YAML change, a CUDA graph fix, or a streaming pipeline patch.
The main conclusion is that Ming-Omni serving cannot be evaluated as a single backend-level number. The thinker path, talker path, streaming path, and audio-output path are governed by different bottlenecks. Some are architecture choices. Some are launch configuration. Some are model-port implementation details. Some are endpoint semantics.
Conclusion#
For the thinker, the large initial gap was not an inherent property of fused serving versus staged serving. The staged layout made the bottleneck visible, but the dominant mechanism was a degenerate Conv3d patch-embedding path. Once batching was configured correctly and the patch embedder was fixed, the text-only and image-text thinker results were close enough that the serving framework itself was no longer the main story.
For the talker, the result is workload-dependent. SGLang-Omni currently has the stronger low-concurrency latency story because the Ming CFM path is captured more completely. vLLM-Omni currently has the stronger high-concurrency throughput story because its batched scheduler continues scaling after the single-stream talker path plateaus. The practical question is not which backend is faster in isolation, but which part of the latency-throughput curve the product needs.
For streaming and audio output, endpoint semantics and validation matter as much as model runtime. A stream=true response that emits one final SSE frame is not equivalent to chunked audio streaming, and a low CER does not imply spectral equivalence. These are product-visible differences, not just benchmark bookkeeping.
This post is the map for the rest of the series. The follow-up posts will separate the mechanisms in more detail:
- Thinker path: matched batching, the Conv3d slow path, and why the post-fix results converge.
- Text streaming: TTFT, TPOT, and why total latency is not the same as interactive chat UX.
- Talker latency: CFM CUDA graph capture,
PIECEWISEvs full graph capture, and the c≈10 crossover. - Streaming TTS: real chunked audio, pseudo-streaming APIs, and TTFA as a first-class metric.
- Audio output: CER, mel distance, loudness, spectral centroid, and long-form speaker drift.
- Reproducibility: how the benchmark artifacts, rerun recipes, and correction history were organized.
The goal of the series is to turn backend comparisons into mechanism-level claims. That is the only level where the numbers are actionable.