Large language models are transforming every corner of software, but serving them in production is brutally expensive. A single LLaMA-13B model can consume over 26 GB of GPU memory just for its weights, and that is before accounting for the memory needed to actually process requests. When dozens or hundreds of users are hitting your model simultaneously, memory becomes the bottleneck that determines how many requests you can serve — and how much each one costs.
vLLM tackled this head-on. By rethinking how GPU memory is managed during inference, vLLM achieved 2–4x higher throughput than Hugging Face Transformers and up to 24x over the default text-generation pipeline, without changing the model weights or sacrificing output quality. The core insight? The same virtual memory and paging techniques that operating systems have used for decades can be applied to LLM serving.
In this first part, we will break down why LLM serving is so memory-hungry, what exactly gets wasted, and how PagedAttention — the algorithm at the heart of vLLM — fixes it.
This is Part 1 of a 2-part series on vLLM:
- Part 1: PagedAttention & the LLM Serving Problem (You are here)
- Part 2: Practical Serving & Walkthrough
- The LLM Inference Bottleneck
- The Waste Problem
- PagedAttention: Virtual Memory for KV Caches
- Copy-on-Write and KV Cache Sharing
- Continuous Batching vs Static Batching
- Putting It All Together
- Useful Resources
The LLM Inference Bottleneck
Unlike classification models that process an input and return a single answer, large language models generate text one token at a time. Each new token depends on every token that came before it. This autoregressive generation loop is inherently sequential and creates a unique memory challenge.
How Autoregressive Generation Works
At each step of generation, the model performs attention over all previous tokens. Naively, this would mean recomputing representations for the entire sequence at every step — an \( O(n^2) \) cost that grows quadratically with sequence length. To avoid this, modern implementations cache the key and value tensors from previous steps. This is the KV cache.
The KV cache allows each new token to attend to all previous tokens by simply looking up stored key-value pairs rather than recomputing them. This trades memory for compute — a worthwhile tradeoff, until memory runs out.
How Large is the KV Cache?
For a single request, the KV cache memory grows as:
\( \text{KV cache size} = 2 \times L \times H \times d \times s \times \text{sizeof(dtype)} \)
Where:
- \( L \) = number of layers
- \( H \) = number of attention heads
- \( d \) = dimension per head
- \( s \) = sequence length
- Factor of 2 accounts for both keys and values
For a 13B parameter model (like LLaMA-13B) with 40 layers, 40 heads, dimension 128, and a sequence length of 2048 tokens in FP16:
\( 2 \times 40 \times 40 \times 128 \times 2048 \times 2 \text{ bytes} \approx 1.6 \text{ GB} \)
That is 1.6 GB per request. If you want to serve 10 concurrent requests, you need 16 GB just for KV caches — on top of the 26 GB for model weights. With a single 80 GB A100 GPU, you can barely fit the model plus a handful of requests before running out of memory.
This is the fundamental bottleneck: the number of requests you can serve simultaneously is limited by how efficiently you manage KV cache memory.
The Waste Problem
If the KV cache is already large, you might expect that systems would manage it carefully. In practice, most serving systems before vLLM were shockingly wasteful.
How Naive Systems Allocate KV Cache
The standard approach (used by Hugging Face Transformers, FasterTransformer, and others) is to pre-allocate a contiguous block of memory for each request’s KV cache based on the maximum possible sequence length. Think of it like a restaurant that reserves the largest table in the house for every guest, even if they are dining alone.
This creates three types of waste:
1. Internal Fragmentation
When a request’s actual output is shorter than the maximum allocated length, the unused memory at the end of its allocation is wasted. If you allocate 2048 token slots but the request only generates 347 tokens, 83% of that allocation sits empty.
2. External Fragmentation
As requests of different sizes come and go, the free memory becomes scattered into small non-contiguous chunks. Even if the total free memory is sufficient for a new request, there may be no single contiguous region large enough to hold it.
3. Reservation Waste
Since the output length is unknown in advance, systems must reserve memory for the worst case. A request that might generate 50 tokens gets the same 2048-token allocation as one that generates 2000. This reservation happens up front, blocking other requests from using that memory.
Naive KV Cache Allocation (per request):
┌──────────────────────────────────────────────────────────┐
│ Prompt KV │ Generated KV │ Reserved but unused │
│ (used) │ (used) │ (wasted) │
├──────────────┼────────────────┼──────────────────────────┤
│ ██████████ │ ████████ │ ░░░░░░░░░░░░░░░░░░░░░░ │
│ 256 tokens │ 200 tokens │ 1592 tokens reserved │
└──────────────────────────────────────────────────────────┘
← max_seq_len = 2048 →
█ = actively used memory ░ = allocated but wasted
The vLLM paper measured that in typical serving scenarios, 60–80% of KV cache memory is wasted due to these three types of fragmentation. This means you are paying for 3–5x more GPU memory than you actually need, directly limiting your serving throughput and driving up costs.
PagedAttention: Virtual Memory for KV Caches
The central innovation of vLLM is PagedAttention, an attention algorithm that borrows the idea of virtual memory and paging from operating systems.
The OS Paging Analogy
In an operating system, processes do not get large contiguous blocks of physical RAM. Instead:
- Memory is divided into fixed-size pages (typically 4 KB).
- Each process has a page table that maps virtual addresses to physical page locations.
- Pages can be stored anywhere in physical memory — they do not need to be contiguous.
- Pages are allocated on demand as the process needs them, not all up front.
This is exactly what PagedAttention does for KV caches.
How PagedAttention Works
In PagedAttention, the KV cache for each request is divided into fixed-size blocks, where each block holds the key-value pairs for a fixed number of tokens (e.g., 16 tokens per block).
The key data structure is the block table — one per request — which maps logical block indices to physical block locations in GPU memory:
Block Table for Request A:
┌──────────────┬──────────────────────┐
│ Logical Block│ Physical Block │
├──────────────┼──────────────────────┤
│ 0 │ Physical Block 7 │
│ 1 │ Physical Block 1 │
│ 2 │ Physical Block 3 │
└──────────────┴──────────────────────┘
GPU Memory (Physical Blocks):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ... │
│free │ A-1 │free │ A-2 │free │free │free │ A-0 │ │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
When computing attention, the algorithm looks up the block table to find where each sequence’s KV pairs physically reside, fetches them from potentially scattered locations, and computes attention as usual. The blocks do not need to be contiguous in memory.
Why This Eliminates Waste
PagedAttention addresses all three types of fragmentation:
-
No internal fragmentation (except the last block): Memory is allocated in small, fixed-size blocks. Only the very last block of a sequence can have unused slots, wasting at most \( \text{block_size} - 1 \) tokens per request — typically under 4% versus 60%+ with contiguous allocation.
-
No external fragmentation: Since blocks are a fixed size and can be placed anywhere, there is no need for large contiguous regions. Any free block can serve any request.
-
No reservation waste: Blocks are allocated on demand as the sequence grows, not pre-allocated for the maximum possible length. A request that generates 50 tokens only uses 4 blocks (at 16 tokens/block), not 128 blocks.
PagedAttention KV Cache (same request as before):
┌────────┬────────┬────────┬─────────────────────────────┐
│Block 0 │Block 1 │Block 2 │ Not allocated — free for │
│(full) │(full) │(partial│ other requests! │
│16 tok │16 tok │ 8 tok) │ │
└────────┴────────┴────────┴─────────────────────────────┘
Used: 456 tokens across 29 blocks
Wasted: only 8 unused slots in the last block
The result: vLLM achieves near-optimal memory utilization — typically over 96% — compared to 20–40% with naive approaches. This means you can fit 2–4x more concurrent requests in the same GPU memory.
Copy-on-Write and KV Cache Sharing
PagedAttention unlocks another powerful optimization: memory sharing between requests that have common prefixes.
Parallel Sampling
When you generate multiple outputs from the same prompt (e.g., n=4 in the API), all outputs share the same prompt. With contiguous allocation, each output gets its own full copy of the prompt’s KV cache. With PagedAttention, all outputs can share the same physical blocks for the prompt portion by pointing their block tables to the same physical blocks.
Parallel Sampling (4 outputs from same prompt):
Prompt Blocks (shared) Output Blocks (independent)
Request 1: [P0][P1][P2][P3] → [O1a][O1b]
Request 2: [P0][P1][P2][P3] → [O2a]
Request 3: [P0][P1][P2][P3] → [O3a][O3b][O3c]
Request 4: [P0][P1][P2][P3] → [O4a]
Physical memory for prompt: stored ONCE, referenced 4 times
Memory saved: 3 × (prompt KV cache size)
Copy-on-Write (CoW)
What happens when generation proceeds and different outputs diverge? vLLM uses copy-on-write, the same technique used by OS fork(). When multiple sequences share a block and one of them needs to modify it (because it has generated new tokens that fill into the shared block), only then is the block copied. Until that point, the physical memory is shared.
This is particularly effective for:
- Beam search: All beams share prompt blocks and progressively diverge, with CoW minimizing duplication.
- Parallel sampling: Multiple completions from the same prompt share most of their KV cache.
- System prompts: When many requests share the same system prompt, the KV cache for that prefix can be computed once and shared.
The vLLM paper showed that for beam search with a beam width of 4, PagedAttention with CoW reduces KV cache memory usage by up to 55% compared to even an ideal contiguous allocation scheme.
Continuous Batching vs Static Batching
PagedAttention handles memory allocation, but there is another source of waste in LLM serving: how requests are batched together.
The Problem with Static Batching
In traditional (static) batching, a batch of requests is processed together, and the entire batch waits until the longest sequence finishes generating:
Static Batching:
Request A: ████████████░░░░░░░░░░░░░ (done at step 12, waits)
Request B: ████████████████████████░ (done at step 24, waits)
Request C: ██████████████████████████ (done at step 26)
Request D: ████████░░░░░░░░░░░░░░░░░ (done at step 8, waits)
──────────────────────────→ time
█ = GPU actively generating ░ = GPU idle, waiting for batch to finish
Request A finishes at step 12, Request D at step 8, but both sit idle until Request C completes at step 26. During this time, those GPU resources are wasted and no new requests can enter the batch.
How Continuous Batching Works
Continuous batching (also called iteration-level scheduling) removes this constraint. The scheduler checks the batch at every iteration and:
- Removes sequences that have finished generating.
- Inserts new sequences from the waiting queue into the freed slots.
Continuous Batching:
Request A: ████████████
Request D: ████████
Request E: ██████████████ (enters when D finishes)
Request B: ████████████████████████
Request F: ████████████ (enters when A finishes)
Request C: ██████████████████████████
──────────────────────────→ time
No idle GPU cycles — finished requests are immediately replaced.
The GPU stays fully utilized because there is no waiting for the slowest request. In practice, continuous batching improves throughput by up to 2x over static batching on its own.
vLLM Combines Both
vLLM uses continuous batching together with PagedAttention. The combination is powerful:
- PagedAttention ensures that adding or removing requests from the batch is cheap (just allocate or free blocks — no memory reshuffling).
- Continuous batching ensures that the GPU is never idle waiting for slow requests.
- Together, they enable larger effective batch sizes with higher GPU utilization, multiplying the throughput gains.
Putting It All Together
Let us trace through what happens when vLLM handles a burst of requests:
- Requests arrive. Each is queued by the scheduler.
- Scheduler selects a batch of requests to process together, based on available GPU memory (tracked at block granularity).
- Prompt processing (prefill): For new requests, the model processes the entire prompt in one forward pass. KV cache blocks are allocated on demand as needed.
- Token generation (decode): The model generates one token per iteration per request. New blocks are allocated as sequences grow.
- Request completes: When a request hits its stop token or max length, its blocks are freed immediately. The scheduler can instantly slot in a waiting request.
- Memory pressure: If GPU memory is full, vLLM can preempt lower-priority requests by swapping their KV cache blocks to CPU memory, making room for higher-priority requests. When resources free up, the preempted requests are swapped back in and resume exactly where they left off.
The net effect:
- Memory utilization: 96%+ (vs 20–40% with contiguous allocation)
- GPU utilization: Near 100% (vs 50–70% with static batching)
- Throughput: 2–4x over HuggingFace Transformers, up to 24x over naive pipelines
All of this happens transparently. From the user’s perspective, you send requests and get responses — vLLM handles the scheduling, memory management, and batching under the hood.
In Part 2, we will get hands-on: installing vLLM, running batch inference, launching an OpenAI-compatible API server, and benchmarking throughput against Hugging Face.
This is Part 1 of a 2-part series on vLLM:
- Part 1: PagedAttention & the LLM Serving Problem (You are here)
- Part 2: Practical Serving & Walkthrough