In Part 1, we explored why LLM serving is expensive, how naive KV cache allocation wastes 60–80% of GPU memory, and how PagedAttention eliminates that waste through OS-style paging. Now it is time to put theory into practice.
This post is a hands-on walkthrough. We will install vLLM, run offline batch inference, launch an OpenAI-compatible API server, tune key configuration flags, and benchmark throughput against vanilla Hugging Face Transformers — all with complete, runnable code.
This is Part 2 of a 2-part series on vLLM:
- Part 1: PagedAttention & the LLM Serving Problem
- Part 2: Practical Serving & Walkthrough (You are here)
- Installation and Setup
- Offline Batch Inference
- OpenAI-Compatible API Server
- Key Server Configuration
- Throughput Comparison: HuggingFace vs vLLM
- When to Use vLLM (and When Not To)
- Summary
- Useful Resources
Installation and Setup
Hardware Requirements
vLLM requires a CUDA-capable NVIDIA GPU. The minimum practical setup is:
- GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, RTX 30xx/40xx)
- VRAM: Depends on your model — 7B models need ~16 GB, 13B models need ~28 GB (in FP16)
- CUDA: CUDA 12.1 or later
- Python: 3.9–3.12
Installation
The simplest installation path is via pip:
pip install vllm
This installs vLLM along with its dependencies (PyTorch, Triton, etc.). For the latest features, you can install from the nightly build:
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Supported Models
vLLM supports a wide range of architectures including LLaMA, Mistral, Mixtral, Qwen, Phi, GPT-NeoX, Falcon, Gemma, and many more. The full list is maintained in the vLLM documentation. If a model uses a standard transformer architecture, vLLM likely supports it.
Offline Batch Inference
The simplest way to use vLLM is the LLM class for offline batch inference — processing a list of prompts all at once without running a server.
from vllm import LLM, SamplingParams
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=256,
)
# Initialize the model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
# Define prompts
prompts = [
"Explain the concept of PagedAttention in two sentences.",
"Write a Python function that checks if a number is prime.",
"What are the main differences between TCP and UDP?",
"Describe the process of photosynthesis for a 10-year-old.",
]
# Generate completions
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)
Key Parameters in SamplingParams
| Parameter | Description | Default |
|---|---|---|
temperature |
Controls randomness. 0 = deterministic, higher = more random. | 1.0 |
top_p |
Nucleus sampling: only consider tokens within this cumulative probability. | 1.0 |
top_k |
Only consider the top-k most likely tokens. -1 = disabled. | -1 |
max_tokens |
Maximum number of tokens to generate per prompt. | 16 |
stop |
List of strings that stop generation when encountered. | None |
presence_penalty |
Penalizes tokens that have appeared, encouraging diversity. | 0.0 |
frequency_penalty |
Penalizes tokens proportional to their frequency. | 0.0 |
n |
Number of output sequences to generate per prompt. | 1 |
When n > 1, vLLM uses the KV cache sharing and copy-on-write mechanism from Part 1 — the prompt’s KV cache is computed once and shared across all output sequences, only copying blocks when they diverge.
What happens under the hood when you call llm.generate():
- All prompts are tokenized and submitted to the vLLM engine.
- The scheduler batches them together using continuous batching.
- PagedAttention manages KV cache memory dynamically as tokens are generated.
- Results are returned once all prompts have completed generation.
Because vLLM handles batching internally, you get high throughput even if your prompts vary in length — there is no padding waste.
OpenAI-Compatible API Server
For production use, vLLM provides an HTTP server that implements the OpenAI Chat Completions and Completions API format. This means you can use vLLM as a drop-in replacement for OpenAI’s API.
Starting the Server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
The server starts and logs the available endpoints. By default, it exposes:
/v1/completions— Text completions/v1/chat/completions— Chat completions/v1/models— List available models
Querying with the OpenAI Python Client
Since the API is OpenAI-compatible, you can use the official OpenAI Python library:
from openai import OpenAI
# Point the client at your vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM does not require an API key by default
)
# Chat completion
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PagedAttention and why does it matter?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
Querying with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Summarize vLLM in one paragraph."}
],
"temperature": 0.7,
"max_tokens": 256
}'
The server handles concurrent requests automatically, applying continuous batching and PagedAttention behind the scenes. You do not need to manage batching yourself.
Key Server Configuration
vLLM exposes many flags to tune the server for your hardware and workload. Here are the most important ones:
| Flag | Description | Example |
|---|---|---|
--model |
HuggingFace model ID or path to local model directory. | meta-llama/Llama-3.1-8B-Instruct |
--tensor-parallel-size |
Number of GPUs for tensor parallelism. Set to your GPU count. | --tensor-parallel-size 4 |
--gpu-memory-utilization |
Fraction of GPU memory vLLM can use (0.0–1.0). Lower values leave room for other processes. | --gpu-memory-utilization 0.90 |
--max-model-len |
Maximum sequence length (prompt + output). Reduces memory pre-allocation if your use case has shorter sequences. | --max-model-len 4096 |
--dtype |
Data type for model weights. auto uses the model’s native dtype; half forces FP16; bfloat16 forces BF16. |
--dtype auto |
--quantization |
Quantization method: awq, gptq, squeezellm, etc. Used for pre-quantized models. |
--quantization awq |
--max-num-seqs |
Maximum number of sequences in a batch. Controls concurrency vs memory tradeoff. | --max-num-seqs 256 |
--enforce-eager |
Disables CUDA graph optimization. Useful for debugging or unsupported model architectures. | --enforce-eager |
Example: Multi-GPU Serving with Tuned Memory
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000
This serves a 70B model across 4 GPUs, using 95% of each GPU’s memory, capping context at 8192 tokens, and computing in BF16.
Throughput Comparison: HuggingFace vs vLLM
Let us compare throughput directly. We will generate outputs for 40 prompts using both Hugging Face Transformers (sequential) and vLLM (batched), then compare total time.
HuggingFace Transformers (Sequential Baseline)
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
prompts = [
"Explain quantum computing in simple terms.",
"Write a Python function to sort a list using merge sort.",
"What are the benefits of renewable energy?",
"Describe the water cycle step by step.",
"How does a blockchain work?",
] * 8 # 40 prompts total
max_new_tokens = 256
generated_texts = []
start_time = time.time()
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
generated_texts.append(text)
hf_time = time.time() - start_time
total_output_tokens = sum(
len(tokenizer.encode(t)) for t in generated_texts
)
print(f"HuggingFace: {hf_time:.1f}s for {len(prompts)} prompts")
print(f"Throughput: {total_output_tokens / hf_time:.1f} tokens/sec")
Each prompt is tokenized, sent through the model, and decoded one at a time. The GPU processes a single sequence per forward pass, leaving most of its parallel capacity idle.
vLLM (Batched Inference)
import time
from vllm import LLM, SamplingParams
model_name = "meta-llama/Llama-3.1-8B-Instruct"
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=256,
)
llm = LLM(model=model_name)
prompts = [
"Explain quantum computing in simple terms.",
"Write a Python function to sort a list using merge sort.",
"What are the benefits of renewable energy?",
"Describe the water cycle step by step.",
"How does a blockchain work?",
] * 8 # 40 prompts total
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
vllm_time = time.time() - start_time
total_output_tokens = sum(
len(output.outputs[0].token_ids) for output in outputs
)
print(f"vLLM: {vllm_time:.1f}s for {len(prompts)} prompts")
print(f"Throughput: {total_output_tokens / vllm_time:.1f} tokens/sec")
vLLM processes all 40 prompts together. The scheduler manages them as a continuous batch — some prompts finish early and free their KV cache blocks, while the remaining prompts continue generating without pause.
Expected Results
On a single A100 (80 GB), with LLaMA-3.1-8B and max_tokens=256:
| Method | Time (40 prompts) | Throughput (tokens/sec) |
|---|---|---|
| HuggingFace sequential | ~180–240s | ~50–80 |
| vLLM batch | ~30–60s | ~300–500 |
| Speedup | ~4–6x | ~4–6x |
The exact numbers vary with GPU, model, prompt length, and output length. The key observation: the speedup comes from vLLM keeping the GPU fully utilized through continuous batching and fitting more concurrent sequences through efficient memory management. On shorter outputs where the HF baseline wastes less memory, the gap narrows; on longer outputs and larger batches, it widens.
When to Use vLLM (and When Not To)
Use vLLM When:
- Serving models in production: vLLM’s API server, continuous batching, and memory efficiency make it ideal for serving endpoints.
- Batch processing: Processing large datasets of prompts benefits enormously from vLLM’s throughput optimizations.
- High concurrency: When multiple users or processes send requests simultaneously, vLLM’s scheduling shines.
- Throughput matters more than latency: vLLM optimizes for maximum tokens per second across all requests.
Consider Alternatives When:
- Fine-tuning: vLLM is an inference engine. For training or fine-tuning, use Hugging Face Transformers, Axolotl, or similar frameworks.
- CPU-only deployment: vLLM requires NVIDIA GPUs. For CPU inference, look at llama.cpp, CTranslate2, or ONNX Runtime.
- Unsupported architectures: If your model architecture is not yet supported by vLLM, you may need to use the model’s native inference code.
- Single-request latency: For a single request with no concurrency, the latency difference between vLLM and Hugging Face is small. vLLM’s advantages appear at scale.
- Edge/mobile deployment: vLLM targets server GPUs. For edge devices, look at MLC-LLM or ExecuTorch.
Summary
- Install vLLM with a single
pip install vllm— it handles PyTorch and CUDA dependencies. - Offline batch inference uses the
LLMclass andSamplingParamsfor processing multiple prompts efficiently with a few lines of code. - The OpenAI-compatible server (
python -m vllm.entrypoints.openai.api_server) provides a production-ready HTTP endpoint that works with the OpenAI Python client and curl. - Key configuration flags like
--tensor-parallel-size,--gpu-memory-utilization, and--max-model-lenlet you tune vLLM for your specific hardware and workload. - Throughput gains of 3–4x over sequential Hugging Face inference come from continuous batching, PagedAttention memory management, and optimized CUDA kernels.
- Use vLLM for serving and batch processing; use other tools for fine-tuning, CPU-only, or unsupported model architectures.
This is Part 2 of a 2-part series on vLLM:
- Part 1: PagedAttention & the LLM Serving Problem
- Part 2: Practical Serving & Walkthrough (You are here)