⚡vLLM

Describes the vLLM V1 engine · current as of mid-2026 · unofficial

A design doc for the open-source LLM inference engine — written for systems engineers (no LLM expertise required). We'll cover what makes serving large models hard and the OS-paging analogy at vLLM's core.

TL;DR

vLLM treats LLM serving as a memory-management problem. Each request's working state (the "KV cache") grows as the request runs, and older systems wasted 60–80% of GPU memory holding it. PagedAttention applies OS-style virtual memory and paging to this state and pushes utilization to ~96%, which is where the 2–4× throughput win comes from.
The runtime makes a fresh scheduling decision every ~10–80 ms. Requests join, finish, or get evicted at every model step, so the GPU never idles waiting for the slowest in-progress request. The scheduler can also mix expensive "process the prompt" work for one request with cheap "generate the next token" work for others in the same step (chunked prefill) to keep the GPU utilization high.
The architecture is built so its features compose. The HTTP/API process is separate from the engine process, and the scheduler is built around a single uniform primitive — which is why chunked prefill, prefix caching, speculative decoding, and multimodal inputs all work together cleanly instead of fighting each other.

Background: how LLM serving works

If you've never run an LLM in production, four ideas are enough to follow the rest of this doc.

Models speak in tokens, not characters. A tokenizer breaks the input string into ~3–4-character chunks, each represented by an integer ID. "Hello, world!" might be five tokens. Everything below works at the token level.
Generation is autoregressive: one token at a time, in a loop. The model takes the prompt, predicts the next token, appends it to the input, and repeats. A 500-token response is 500 model forward passes. The autoregressive loop is what makes inference fundamentally serial inside a single request.
Every model layer revisits every previous token. The mechanism is called "attention", and what matters for this doc is the consequence: to predict token 501 you have to look at all 500 tokens before it. To avoid recomputing those 500 tokens' per-layer state on every step, the system caches it — that's the KV cache (named after the two matrices the cache holds, "keys" and "values"). The cache is per-request, grows linearly with sequence length, and lives in GPU memory.
GPU memory is the binding constraint. An A100 GPU has 80 GB of fast on-package memory (HBM). Loading a 13B-parameter model takes 26 GB before you serve a single request. The KV cache competes for what remains. More concurrent requests means more KV cache; ceiling out the KV cache means refusing to admit more requests.

These four facts together explain why "LLM serving" is mostly "GPU memory management." vLLM exists because the pre-2023 systems handled that memory naively and threw away most of it.

Why vLLM exists

The two facts from the background section — the KV cache is the dominant memory consumer, and requests in the same batch finish at different times — interact badly under a naive design.

The fragmentation problem. Pre-vLLM systems allocated each request's KV cache as one contiguous chunk of GPU memory, sized for the maximum response length the request might produce. A request that could generate 4096 tokens got 4096 tokens' worth of slots, even if it actually only generated 200. The leftover slots couldn't be given to other requests because the memory was already claimed. The paper measures that real systems used only 20–38% of the KV cache budget for actual token state. The rest was waste.

The batching problem. A user asking for a one-line answer and a user asking for a 4,000-token essay live in the same batch for a while, then one of them should leave. Naive batching pads every request to the longest length and waits for the slowest one — wasting compute, latency, and memory.

By 2022 the second problem had a known fix: iteration-level scheduling, where the server admits and retires requests at every model step instead of every batch. The Orca paper (OSDI '22) introduced it. But Orca and contemporary systems still allocated KV cache as one contiguous slab per request — the first problem was untouched.

vLLM is the synthesis: keep iteration-level scheduling, and treat the KV cache the way an operating system treats process memory — paged, allocated lazily as the request grows, reference-counted, copy-on-write. The headline result is 2–4× higher throughput at equivalent latency. The deeper claim is that LLM serving is mostly a memory-allocator problem: once you accept that frame, almost every other feature (prefix caching, chunked prefill, parallel sampling) falls out of the same block-allocator interface.

Origin

vLLM came out of the Sky Computing Lab at UC Berkeley (Woosuk Kwon, Zhuohan Li, et al.), published at SOSP '23, and was deployed as the backend for LMSYS Chatbot Arena and Vicuna in mid-2023. The project is now Apache-2.0 and the dominant open-source LLM inference engine, maintained by a community of 2000+ contributors across Berkeley, the LF AI & Data foundation, and most major model labs.

Mental model in 60 seconds

A running vLLM server looks like this:

A running vLLM server. Two host processes, then N worker processes (one per GPU on a node).

A few things worth noting up front:

The HTTP server and the engine run as separate OS processes. The "API server" handles HTTP, tokenization (string → token IDs), and other preprocessing. The "EngineCore" handles scheduling and the GPU forward pass. They talk over ZMQ sockets. The split matters because tokenization is CPU-bound and Python's GIL would otherwise make it contend with the engine; keeping it in its own process keeps the GPU-driving loop off the critical path.
The scheduler and KV cache manager are centralized — one of each, no matter how many GPUs. Even with 8 GPUs running the same model, a single scheduler decides each step what runs, and a single KV cache manager hands out block IDs. The workers (one per GPU) receive the resulting block tables and execute.
The worker is where the model actually runs on a GPU. Each worker owns one GPU, holds a copy of the model weights (or its share of them — see Distributed inference), and runs the forward pass plus the attention math.
There is no separate "control plane" daemon. The HTTP server is the orchestrator, and the engine is its only dependency. Multi-node deployments add a coordinator (Ray) on top, but the single-node case is just two processes.

The KV cache problem

To understand the size of the problem, look at where GPU memory actually goes on a representative setup: a 40 GB A100 running a 13-billion-parameter model.

Pre-vLLM KV cache utilization, even with an oracle, never crosses 40%. vLLM eliminates reservation and external fragmentation.

Two numbers tell the story:

~30% — the fraction of GPU memory left for KV cache after model weights are loaded. Weights are static and non-negotiable; the KV cache is where the design has any choice at all.
20–38% — the fraction of that 30% that pre-vLLM systems used for actual token state. The rest was waste, split across three flavors:

Where the waste came from	Why it happened
Slots reserved for tokens that never get generated	The system allocates room for the worst-case response length up front. A 200-token reply on a model that supports 4096-token contexts is paying for 4096 slots.
Padding inside an allocation	If the allocator rounds up to a power of two, a 257-token request takes 512 slots.
Gaps between requests' allocations	Classic memory-allocator fragmentation. Request A finishes, leaving a 600-token-sized hole between B and C that can't accept the 800-token request now arriving.

The PagedAttention design follows from one observation: all three of these waste categories vanish if KV cache is allocated in small, uniform blocks instead of one contiguous chunk per request. Reserved-but-unused slots go away because you allocate one block at a time, only when a request needs it. Padding shrinks to at most one block per request (the partially-filled last one). Gaps go away because any free block fits anywhere — they're all the same size.

PagedAttention

The technique is directly modeled on OS virtual memory. The vLLM paper labels its own diagrams with the OS terms, and the correspondence is one-to-one:

OS virtual memory	vLLM PagedAttention
Process	Request (one user's in-progress generation)
Page	KV block — a small fixed-size slot holding the cached state for ~16 tokens
Page table	Block table — one per request, mapping "logical block N" → "physical block address X" in GPU memory
Physical frame	Physical KV block in GPU memory
Lazy allocation	The allocator hands out a new physical block only when a request fills up its current one
Copy-on-write	Same: when one request needs to mutate a block another is reading, the block is copied and the writer's table updated
Page sharing (shared libraries)	Prefix caching — two requests with the same prompt prefix can share the same physical blocks for that prefix

A KV block typically holds the cached state for 16 tokens — small enough that the leftover space in a partially-filled block is negligible, large enough that the block-table lookups don't dominate.

The cost of the design is that the GPU code computing attention now has to walk a per-request lookup table instead of reading one contiguous region of memory. The paper measures this at about 20% slower on the attention math itself — but attention is only one slice of the total forward pass, and the memory savings let you batch 2–4× more requests through, so end-to-end throughput goes up sharply.

Two requests sharing a prompt prefix. Their block tables point into the same physical pool; when one diverges, copy-on-write splits the affected block.

Block-level memory management

Sitting underneath PagedAttention is a small allocator that looks a lot like a slab allocator: a pool of identically-sized blocks, three operations on them, and a reference count on each. The whole interface is:

Operation	What it does	Used by
`allocate`	Hand out a free block — unless an existing block already contains the same content (prefix cache hit), in which case just bump that block's reference count.	New requests starting up; running requests that fill their current tail block and need another.
`fork`	Create a second logical reference to a block. Both holders share it read-only.	Generating multiple alternative completions for the same prompt.
`free`	Drop a reference. When the count reaches zero, the block goes onto an LRU list for later reuse. The block isn't zeroed — its content hash is retained so a future request can rediscover it via prefix caching.	Requests finishing or being evicted.

Reference counting and copy-on-write

Every physical block carries a reference count. If two requests share a block (because they had the same prompt prefix, or because one was forked from the other), the count is 2. Writes to a shared block trigger copy-on-write — exactly the same pattern Linux uses for fork(): allocate a new block, copy the contents over, decrement the old block's count, and update the writer's lookup table to point at the new block.

This pays off whenever a workload involves sequences that share a prefix. The paper reports up to 55% memory savings on workloads that ask the model to generate several alternative completions of the same prompt — the prompt's blocks are all shared until one continuation diverges from the others.

Copy-on-write only triggers on the diverging block; all earlier prefix blocks stay shared.

Eviction

When a new request needs a block but the pool is empty, the allocator evicts. Free blocks (reference count zero) sit on an LRU list; the least-recently-used one gets recycled first. Crucially the block isn't zeroed on free — its content hash is kept, so if the same prompt prefix shows up later, the block can be resurrected without recomputing it.

Subtle detail

The very last block of a request — the one only partially filled with that request's specific final tokens — is least likely to be reused by anyone else. The eviction policy is biased to throw tail blocks away first, so blocks belonging to shared prefixes (system prompts, document Q&A contexts) tend to stay cached longer.

Continuous batching & the scheduler

"Continuous batching" is the technique of revisiting the batch on every model step instead of every request. vLLM didn't invent it (the Orca paper, OSDI '22 did), but it's what vLLM's block allocator enables: because allocating and freeing memory is now cheap, the scheduler can afford to make a fresh decision every step.

What "every step" means

A model forward pass takes roughly 10–80 ms depending on model size and batch. That is the scheduling unit. Every step:

The scheduler picks a set of in-flight requests and decides how many tokens each one will process this step. The output is essentially a dict of {request_id: num_tokens}.
The allocator hands out any blocks those requests need.
The worker runs one model forward pass, producing the next token for each request.
The scheduler updates state: requests that hit a stop condition are removed; new requests waiting in the queue are admitted if memory allows.

The unified {request_id: num_tokens} representation is a deliberate design point that matters more than it looks. The next section explains why.

What if memory runs out?

What happens when a request mid-generation needs another block but the pool is full? The answer is recompute: drop the request's KV cache, put the request back on the waiting queue, and re-run its prefix from scratch when it's rescheduled. The recompute is cheaper than it sounds — the prefix is processed in one shot, in parallel, which the GPU is fast at — and combined with prefix caching, an evicted request's blocks often survive on the LRU list anyway, so the "recompute" is mostly a cache hit.

A continuous-batching timeline. Step 1 runs R2's first prefill chunk alongside R1's and R4's decodes; R3 joins at step 3; R4 is evicted at step 4 and recomputed at step 6.

Prefill, decode, and chunked prefill

Serving a single request is not one workload — it's two, with very different shapes. Understanding them is the foundation for most LLM-serving optimization.

Prefill: processing the prompt

When a request first arrives, the system has to build up its KV cache by running the model once over the entire prompt. Because all the prompt's tokens are known up front, this single forward pass processes them in parallel. The GPU does a lot of computation per byte of model weight it loads — which is what the GPU is good at — so prefill is compute-bound: the bottleneck is how many math operations the GPU can do per second.

Decode: generating one token at a time

Once the prompt is processed, the request enters the autoregressive loop: produce one new token, append it, produce the next, repeat. Each new-token forward pass does only a tiny amount of math per byte of weight loaded — there's just one new token to process at this layer — so the bottleneck shifts to memory bandwidth: how fast the GPU can read its own weights from HBM. The compute units sit mostly idle, waiting for weights to arrive.

These two regimes shape the two latency metrics users see:

	Prefill	Decode
When it runs	Once, when the request arrives	Many times, once per output token
Bottleneck	GPU math throughput	HBM-to-compute memory bandwidth
Latency it affects	Time to first token (TTFT) — how long the user waits before any output appears	Time between output tokens — how fast the response "types out"
Sensitive to	Prompt length	Batch size (more requests in the same step amortize the weight loads)

Chunked prefill: mixing the two

The naive approach runs prefill and decode in separate steps. But then a new arriving request triggers a prefill step that pauses every in-flight decode — the bigger the prompt, the longer existing users wait for their next token — and conversely, decode-only steps leave the GPU's math units mostly idle.

Chunked prefill solves both problems. A long prefill (say, 4096 tokens) is split into pieces (say, 1024 tokens each), and each piece runs in the same model step as the ongoing decodes for other requests. Because prefill is compute-heavy and decode is memory-bandwidth-heavy, mixing them in one step uses both kinds of GPU resource — prefill fills the compute units that decode leaves idle, and decode rides along on the weight loads prefill is already paying for. End result: the GPU stays busy on both axes at once, and no user is starved.

This works cleanly because of the unified {request_id: num_tokens} scheduler. To the scheduler, a request processing the first 1024 tokens of its prompt and a request generating its 50th output token look identical — both are just numbers in the dict. Chunked prefill is on by default.

Automatic prefix caching (APC)

If two requests start with the same prompt prefix, the KV cache for that prefix is bit-for-bit identical between them. The block design exists in part to share it. Automatic prefix caching is the feature that makes that sharing happen with no help from the caller.

How blocks are identified

Each block is identified by a hash that depends on (a) its tokens and (b) the hash of the block before it. Because of the chain, two requests with the same first N tokens produce the same first N block hashes; the (N+1)th hashes diverge as soon as a token differs. A global hash → block lookup table makes the check O(1): when a new request arrives, the allocator walks the prompt's blocks, looks each one up by hash, and reuses any block that already exists.

The hash also incorporates a few "extras" — the active LoRA adapter ID, image/audio input hashes, an optional per-tenant "salt" string — so two requests that differ only on those don't accidentally share cache.

When it pays

Three workload shapes get most of the benefit:

System prompts. Every request to a chat endpoint shares the same multi-thousand-token "you are a helpful assistant…" prefix. Cache it once, share it across every request.
Few-shot prompting. Requests that include several example input/output pairs before the actual user input. The paper reports up to 3.58× throughput over Orca-Oracle on a translation benchmark with a 5-example shared prefix.
Multi-turn chat. Turn N reuses the KV cache built up over turns 1…N−1. Without prefix caching, every turn pays to re-process the entire conversation. With it, only the new user message gets processed.

The largest single case by sheer KV volume is document Q&A: a 50,000-token document is processed once, then every subsequent question against it reuses the same cache.

In practice

The bookkeeping is designed so the per-block hash lookup costs effectively nothing even when nothing matches — the official docs put it at "near-zero, even when the cache hit rate is 0%." So prefix caching is on by default, and most users never need to think about it.

Distributed inference

An H100 GPU has 80 GB of HBM. A LLaMA-3-70B model is about 140 GB before any cache. For big models, you must split the model across multiple GPUs. The two standard ways to split a dense model:

Tensor parallelism (TP) — every GPU runs every layer, but each one does only a slice of the work, and they synchronize at every layer boundary. Analogy: a team of cooks each preparing one quadrant of every dish and combining at every step. Each GPU has 1/N of the model weights and computes 1/N of each layer's output, then they share their partial results so everyone can continue.
Pipeline parallelism (PP) — different GPUs hold different layers, and a request flows through them in sequence. Analogy: an assembly line — GPU 1 runs layers 1–10, GPU 2 runs layers 11–20, and so on. Only the activations between layers travel between GPUs, so the bandwidth requirements are much lower.

Where each one wins: TP needs a fast interconnect between GPUs (because they sync often) but keeps end-to-end latency low. PP tolerates slower interconnects but adds latency proportional to the number of stages. The rule of thumb from the official distributed-inference blog: "use pipeline parallelism across nodes and tensor parallelism within nodes for slow interconnects; extend tensor parallelism across nodes if interconnects are efficient." In practice: TP within a server (typically 8-way over NVLink), PP across servers when one server isn't enough.

Expert parallelism (for Mixture-of-Experts models)

Most frontier models today are Mixture-of-Experts (MoE) models — DeepSeek-V3, Mixtral, Qwen-MoE, Llama 4. Instead of one big feed-forward block that every token passes through, an MoE layer holds many smaller "experts" (8, 128, even 256 of them) plus a tiny router that sends each token to just a few — often the top 2 to 8. The model's total parameter count is enormous, but only a small slice actually runs for any given token. That's what makes these models affordable to serve at their size, and it changes how you split them across GPUs.

Analogy: a hospital with hundreds of specialists. Triage (the router) sends each patient (token) to the two relevant specialists rather than parading them past every doctor. The hospital is huge, but each patient consumes only two specialists' time.

Expert parallelism (EP) spreads the experts across GPUs — GPU 0 owns experts 0–63, GPU 1 owns 64–127, and so on. Because the router can send any token to any expert, each MoE layer needs an all-to-all exchange: every GPU ships each of its tokens to whichever GPU owns the expert it was routed to, the experts run, then a second all-to-all sends the results back. That all-to-all shuffle is a different communication pattern from tensor parallelism's all-reduce (a sum), and it is usually the bottleneck for MoE serving — which is why vLLM invests in fast all-to-all kernels and "wide EP" deployments that fan a model's experts across dozens of GPUs.

Expert parallelism. Each token is routed to a few of the many experts, shuffled to whichever GPU owns them via an all-to-all, processed, then shuffled back.

EP is combined with the other strategies rather than replacing them: in a typical large MoE deployment, the attention layers (which are dense) run data- or tensor-parallel while the MoE layers run expert-parallel, all in one server group. The flag is --enable-expert-parallel.

One scheduler, many workers

One architectural choice deserves attention. Even with 8 GPUs serving the same model, vLLM keeps a single scheduler and a single KV cache manager. Every step:

The scheduler decides which requests run and computes their block tables.
It broadcasts the resulting block tables to all GPU workers.
Each worker holds the same set of block IDs but only its slice of each block's data. The forward pass runs, the workers share partial results at the synchronization points, and the step ends.

Workers don't negotiate among themselves about which blocks to use — they all trust the central scheduler's assignment. This is what keeps the design tractable. Coordinating block allocation between 8 independent caches per node would be a distributed-systems problem; centralizing it is a single-process bookkeeping problem.

Tensor parallelism in vLLM. One central scheduler hands block tables to all GPUs; each GPU holds the same block IDs but only its share of each block's contents.

Executor backends

Spawning and managing worker processes is the job of the "executor." vLLM has three:

uniproc — one process, one GPU. Simplest case, used for single-GPU deployments.
multiproc — Python multiprocessing, one worker process per GPU on a single machine. The default for tensor parallelism within one server.
Ray — uses the Ray distributed runtime to spawn and coordinate workers across multiple machines. Required when one server isn't enough GPU.

Speculative decoding

Decoding spends most of its time loading weights from memory, not computing. Speculative decoding uses that slack: a cheap predictor proposes several next tokens at once, and the main model verifies them in one forward pass. If the proposal was right, multiple tokens come out of the same step. If not, the first wrong token is corrected. The main model runs one pass per step either way — but the average tokens-per-step is higher than 1.

The predictor can be:

A smaller "draft" model (e.g., a 7B model drafting for a 70B model). Up to 1.5× speedup.
A pure pattern match against the prompt (no model at all). Surprisingly strong when the response repeats material from the input, like summarization or code editing — up to 2.8× speedup on summarization.
Extra "heads" bolted onto the main model that predict multiple next tokens directly (EAGLE, MEDUSA in the literature).

Where it stops helping

The technique only pays when the GPU has spare compute. At high request rates, the GPU is already saturated by the existing batch, and the extra work of verifying speculative tokens competes for the same compute. The official blog reports 1.4× to 1.8× slowdowns at high request rates. Speculative decoding is a latency optimization for low-throughput regimes; it can hurt at high throughput.

Why the design composes

The features described above — chunked prefill, prefix caching, speculative decoding, plus multimodal inputs and multi-LoRA serving — are not bolted-on extras. They all fall out of one design decision, and it's worth making explicit because it's the kind of thing system-design interviews reward.

The scheduler represents every step as a single uniform primitive: a {request_id: num_tokens} dictionary. There is no separate "prefill mode" and "decode mode," no special batch shape for one or the other. A request processing the first 1024 tokens of its prompt and a request generating its 50th output token are both just an entry in that dict.

Because every request is the same kind of object to the scheduler, the features stack instead of fighting:

Chunked prefill is just allowing a request's entry to be a partial token count.
Prefix caching is just letting a request start with some of its blocks already populated.
Speculative decoding is just letting a request advance by more than one token in a step.

Each is a small variation on the same uniform unit, which is why they combine without a combinatorial explosion of special cases. The general lesson: the right core abstraction for the scheduler determines what can later become a feature. A system whose central primitive makes new features fight each other accumulates complexity that no amount of later cleanup can fully undo — so the primitive is worth getting right early.

vLLM vs TensorRT-LLM, SGLang, TGI

An honest comparison is harder than it looks. The vLLM team published throughput numbers against TGI and FasterTransformer in the 2023 paper, but does not publish current head-to-head benchmarks against TensorRT-LLM or SGLang in any primary source. So what follows is the architectural differences, not relative speed numbers.

	vLLM	TensorRT-LLM	SGLang	HF TGI	llama.cpp
KV cache scheme	PagedAttention (blocks)	Paged KV (NVIDIA's implementation)	RadixAttention (radix tree of prefix sharing)	Paged KV	One buffer per sequence
Hardware	NVIDIA + AMD + Intel + TPU + AWS Inferentia	NVIDIA only	NVIDIA + AMD	NVIDIA + AMD + Gaudi	CPU + most consumer GPUs
Code generation	JIT-compiled with PyTorch tooling	Ahead-of-time compiled into a TensorRT engine	JIT-compiled with custom kernels	Eager or JIT-compiled	Hand-written C++
Scheduler granularity	Per-step, unified prefill/decode	Per-step	Per-step, radix-tree-aware	Per-step	Single request at a time
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0	MIT

A few framings of the choices:

vLLM vs TensorRT-LLM: vendor-neutral vs NVIDIA-only. TensorRT-LLM compiles each model into a hardware-specific binary ahead of time, which lets it optimize aggressively for a specific GPU but locks you to NVIDIA and adds a build step. vLLM trades some peak throughput on NVIDIA for portability across vendors and a single binary that loads any model from HuggingFace at runtime.
vLLM vs SGLang: different prefix-sharing data structures. vLLM uses a flat pool of blocks with hash-based lookup; SGLang organizes the same idea as a radix tree, which can be more efficient for workloads with many branching shared prefixes (LLM agents, programmatic prompting). Both are in active development.
vLLM vs TGI: TGI was the prior open-source state of the art (the 2023 vLLM paper reports 3.5× higher throughput than TGI). HuggingFace's strength is tight integration with the HuggingFace model hub; TGI is still the easiest way to serve a model directly from a HuggingFace repo. vLLM has since overtaken on throughput and feature surface.
vLLM vs llama.cpp: different problems. llama.cpp targets local/laptop inference on CPU and consumer GPUs with aggressive quantization; vLLM targets multi-tenant datacenter serving. They share almost no design decisions.

Where vLLM fits

The primary sources name a few production deployments and several partnerships. The picture is incomplete by design — most large production users haven't published.

LMSYS Chatbot Arena and Vicuna. The 2023 vLLM blog reports vLLM as the backbone behind LMSYS Vicuna and Chatbot Arena starting mid-April 2023, handling "60K daily requests" while "reducing the GPU usage by half" vs the prior HuggingFace-based setup. The first real production-scale deployment.
NVIDIA. Co-optimization partnership; recent blogs report vLLM-on-Blackwell numbers from joint work.
DeepSeek. The wide-expert-parallel blog reports serving DeepSeek-V3 (a 256-expert MoE model) at 2,200 tokens/sec on H200 using vLLM's wide expert-parallel deployment — experts fanned across many GPUs, as described above.
Sky Lab origin. Born in the Sky Computing Lab at UC Berkeley (the lab that also produced Ray, Mesos, and Spark).
Community. The README cites 2,000+ contributors. It is the default OpenAI-API-compatible serving stack for a huge number of self-hosted deployments.

The OpenAI-compatibility layer matters more than any single name. vLLM exposes the same HTTP endpoints as OpenAI's API (/v1/chat/completions, etc.), so the OpenAI client libraries work against a vLLM endpoint with no code changes — only the base URL changes. This is the protocol most LLM applications were built against; vLLM inherits that ecosystem for free.