Model Support in a VLM Serving Stack Is Not a Checkbox - It Is a Six-Layer Systems Contract

When people say a serving system "supports" a multimodal model, they usually mean something shallow - the model loads, one demo request runs, the API accepts an image field without crashing. That bar is too low to be useful.

After tracing the VLM path through SGLang layer by layer, I think the more honest framing is: model support is a cross-layer contract. A model isn't truly supported because one processor exists or because one encoder path works. It's supported when the entire serving stack preserves multimodal meaning from the public API boundary all the way into the standard decoder runtime - while keeping batching, caching, transport, and scheduling coherent.

What makes this especially interesting is that the stack is not "multimodal everywhere." It's actually the opposite. The design tries hard to isolate multimodal complexity into a bounded overlay, then hand the request back to the normal LLM serving path as early as possible. That design choice is the real story.

Multimodal support is an overlay, not a replacement runtime #

A VLM request enters through a multimodal boundary, gets converted into structured media items, then into placeholder-bearing token structure, then into scheduler-owned state, then into projected encoder features, and finally gets fused back into input_embeds before continuing through the same decoder stack, attention path, KV cache, and streaming logic as a text-only request.

This matters because it answers a deceptively simple question: where should multimodal special cases live?

The answer: not in the decoder if you can avoid it.

By the time the request reaches decoder reentry, the model shouldn't need to "know" it came from pixels, audio, or video. It should see a fused embedding tensor and a ForwardBatch that already encodes the necessary state. Multimodal support succeeds when the decoder becomes boring again.

Public chat/API request
        |
        v
[Layer 5] Extract multimodal payload from messages
        |
        v
[Layer 4] Load media with model-specific processors
        |
        v
[Layer 3] Turn media into stable placeholder spans
        |
        v
[Layer 2] Transport and retain multimodal state in scheduler
        |
        v
[Layer 1] Encode + project + optionally cache multimodal features
        |
        v
[Layer 0] Fuse into input_embeds and re-enter decoder
        |
        v
Normal LLM runtime:
prefill -> decode -> KV cache -> logits -> streaming

This separation is what makes the architecture work. Concern one: what the media means for this model family. Concern two: how the serving system keeps the request schedulable and cacheable. These two concerns are split across layers instead of being collapsed into one monolithic "multimodal path."

Layer 5: the API boundary #

Most discussions about model support start too late - at the processor or encoder. But the first multimodal problem appears before any media is decoded: the serving system has to interpret a chat request as a structured internal representation that preserves text, images, videos, audios, tool-call constraints, continuation prefixes, batching shape, and routing metadata together.

If the boundary layer normalizes the request incorrectly, every downstream layer inherits ambiguity. The processor won't know the intended modality ordering. The tokenizer can't align rewritten sequences with the right media items. The scheduler can't tell whether the request should wait on encoder-side work.

The takeaway: supporting a VLM means supporting a richer request ontology. Text-only systems get away with treating prompts as strings or token lists. VLM systems can't. They need an internal request object - like GenerateReqInput - that explicitly carries multimodal payloads and keeps them aligned with the textual prompt. That's not plumbing trivia. It's foundational.

Layer 4: model-specific processors #

This is where the processor registry picks the right multimodal processor for the architecture, loads and transforms images or audio according to model-specific rules, and produces structured mm_inputs rather than raw media blobs.

The key design decision: the system does not force all VLMs through one monolithic universal processor. It defines a common processor contract and lets individual model families specialize under it.

That's exactly the right tradeoff. Too much centralization makes every new model family painful to integrate - every special case leaks into shared preprocessing code. Too little centralization makes the rest of the runtime impossible to reason about - every model emits a different downstream shape. The processor layer standardizes the interface but not the internals.

This is also where the "just add a preprocessor" myth breaks down. The preprocessor is necessary, but it's only one stage in a much longer systems commitment. Layer 4 gives you model-specific media understanding, but it still doesn't tell the scheduler how long the placeholder span should be, how to hash the media for reuse, or how the request survives transport across TP ranks.

Layer 3: placeholder materialization #

This might be the most underrated layer in the whole pipeline.

Multimodal serving doesn't work unless raw media becomes deterministic token-visible structure. The scheduler can't schedule "an image" as an abstract object. It needs a request whose sequence shape already contains stable multimodal spans - with hashes, pad values, offsets, and model-specific placeholder expansion rules.

The scheduler, radix matcher, chunked prefill logic, and prefix cache all operate on sequence structure. Multimodal content has to show up as something sequence-like long before the final embeddings are fused. The placeholder hash and pad-value mechanism does two jobs simultaneously: it gives the media a stable identity for reuse, and it turns that identity into a token-space stand-in that text-serving machinery can carry without needing the encoder result yet.

This is a kind of temporary fiction. The system pretends media is represented by synthetic token spans - not because those spans are semantically sufficient, but because they're operationally schedulable. Layer 0 will eventually erase that fiction by scattering real multimodal embeddings into the right positions. But until then, the fiction is what lets the entire LLM runtime keep functioning.

Stable hashing matters a lot here. If the same image doesn't map to the same placeholder identity, reuse breaks down upstream of the actual multimodal cache. Prefix matching weakens, request materialization becomes less reproducible, and the scheduler loses one of the main reasons to treat multimodal spans as first-class sequence citizens. Falling back to UUIDs when hash computation is disabled is a practical escape hatch, but it explicitly sacrifices reuse quality - a revealing tradeoff.

Layer 2: distributed state transport #

This layer exposes the gap between a research demo and a production serving system.

Once multimodal state exists, the stack still has to move it across process boundaries, broadcast it across tensor-parallel ranks, retain it long enough for chunked prefill or session reuse, and decide whether encoding happens locally or through a separate dispatch path.

The scheduler reconstructs MultimodalInputs from dictionary payloads, centralizes materialization on the entry rank, broadcasts structured state instead of forcing every rank to redo the work, and preserves request-local multimodal handles across the relevant execution lifetime. Clean separation: the tokenizer side decides what the request is; the scheduler decides how that state lives and moves.

A model is not "supported" in any serious sense if it only works in a single-process path but breaks when encoder work is disaggregated, when TP ranks need synchronized multimodal state, or when session-backed requests must retain media context across turns.

No stable multimodal transport
    -> no reliable scheduler-owned state
    -> no safe TP broadcast / retention
    -> no robust encoder dispatch
    -> fragile or duplicated multimodal execution
    -> "supported" model that only works in narrow conditions

Layer 2 deserves more attention than it usually gets. This is the point where multimodal support becomes a systems property rather than a model-loading property.

Layer 1: encoders and projectors #

Different model classes define their own multimodal forward paths: get_image_feature, get_video_feature, get_audio_feature, encode_images, model-specific pad_input_ids behavior. The stack doesn't force these families into one fake universal encoder API. Instead, it lets them differ while insisting they all eventually produce projected hidden states aligned with the decoder hidden size.

That's the right invariant. Not "all models must preprocess the same way," but "all models must eventually hand Layer 0 embeddings that can be scattered into the decoder input stream." It preserves architectural flexibility while keeping the rest of the stack stable.

The multimodal cache lives here too, and its placement is telling. Caching happens after model-specific encoding and projection - that's where the expensive, reusable artifact exists. But the cache depends on earlier layers having already created stable feature identity and request structure. Caching isn't an isolated optimization; it's downstream of the correctness of Layers 3 and 2.

Stable request extraction (L5)
    -> correct processor output (L4)
    -> stable hashes / pad values / spans (L3)
    -> scheduler-owned transport and lifetime (L2)
    -> reusable encoder embeddings + cacheability (L1)
    -> correct fusion and decoder reentry (L0)

Layer 0: decoder reentry #

Layer 0 is where the multimodal overlay ends and the standard LLM path resumes. It takes multimodal items, computes or retrieves their embeddings, aligns them to placeholder spans, scatters them into input_embeds, handles mixed cases like precomputed embeddings, and clears multimodal-specific state so the request can continue as a normal decoder batch.

The key achievement is containment.

The decoder doesn't want to reason about images, videos, or audio. It wants a tensor of embeddings and a forward batch. Once forward_batch.mm_input_embeds is prepared and the model is called with input_ids=None and input_embeds=..., the serving system regains normal invariants: transformer blocks run normally, attention reads KV cache normally, logits and sampling proceed normally, streaming is unchanged.

A weaker design would sprinkle multimodal conditionals deep into the decoder stack. This design pushes multimodal complexity upward and resolves it before reentry. That makes the system easier to extend, optimize, and reason about.

Invariant preservation across layers #

The serving stack constantly converts the request between representations. At the boundary, it's a chat message with mixed content parts. In Layer 4, it becomes model-specific media features. In Layer 3, placeholder-bearing sequence structure. In Layer 2, scheduler-owned distributed state. In Layer 1, projected hidden states. In Layer 0, a fused embedding tensor. After that, it's just a decoder input.

The hard part isn't any single conversion. It's preserving the right invariants across all of them. Text/media ordering must survive extraction. Model-specific preprocessing must preserve enough metadata for later span materialization. Hashes must stay stable enough for reuse. Placeholder spans must stay deterministic enough for scheduling and radix matching. Scheduler transport must preserve item ordering and lifetime. Encoder output must match decoder hidden-size requirements. Fusion must restore exact positional alignment before decoder reentry.

A model is supported only when every representation change preserves the invariants needed by the next layer. That's a stronger and more operational definition than "the model loads" or "the processor exists."

Strengths #

The first major strength is that it isolates multimodal complexity before decoder execution. The core LLM runtime stays reusable, and multimodal branching doesn't leak into performance-critical paths.

The second is that it respects model-family diversity without giving up system-wide structure. Processor and encoder layers can vary significantly, but the downstream interface stays disciplined.

Third, it makes multimodal spans visible to the scheduler through placeholder materialization instead of hiding them as opaque side data. Multimodal requests can participate in batching, scheduling, and reuse naturally.

Finally, it treats multimodal support as a memory-lifetime problem. Features are retained when needed, offloaded or cleared after fusion, cached when repetition justifies reuse.

Open complexity #

Processor/model diversity. Every new architecture potentially introduces different image layouts, patch rules, temporal semantics, or placeholder conventions. The stack contains that diversity but can't eliminate it. Layers 4 and 1 are where support work accumulates fastest.

Stable placeholder identity. The more the system depends on hashes and deterministic materialization for prefix reuse and cacheability, the more sensitive it becomes to anything that weakens identity stability. The UUID fallback is operationally useful but architecturally signals a reuse-quality tradeoff.

Distributed execution. Multimodal support gets substantially harder once encoder work can move across process boundaries or when session state persists across turns. The layering handles this cleanly, but these remain the operationally fragile zones.

Takeaway #

VLM model support should be evaluated as a serving-stack property, not a model-integration anecdote.

A model isn't meaningfully supported because you added a processor, or because get_image_feature() returns a tensor, or because one sample request cleared the API. It's supported when the stack can extract mixed-content requests, load architecture-specific media correctly, convert that media into deterministic sequence-visible structure, transport and retain it across scheduler boundaries, compute or reuse projected encoder embeddings, and fuse them back into the decoder path without contaminating the ordinary LLM runtime.

"Model support" is not a checkbox. It's a six-layer systems contract. And that's probably the clearest marker of a mature serving design: the multimodal request takes a long, specialized journey, but by the time it reaches the decoder, the system has made it look ordinary again.