In Part 1, we explored why LLM serving is expensive, how naive KV cache allocation wastes 60–80% of GPU memory, and how PagedAttention eliminates that waste through OS-style paging. Now it is time to put theory into practice.

This post is a hands-on walkthrough. We will install vLLM, run offline batch inference, launch an OpenAI-compatible API server, tune key configuration flags, and benchmark throughput against vanilla Hugging Face Transformers — all with complete, runnable code.

This is Part 2 of a 2-part series on vLLM:

Installation and Setup

Hardware Requirements

vLLM requires a CUDA-capable NVIDIA GPU. The minimum practical setup is:

  • GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, RTX 30xx/40xx)
  • VRAM: Depends on your model — 7B models need ~16 GB, 13B models need ~28 GB (in FP16)
  • CUDA: CUDA 12.1 or later
  • Python: 3.9–3.12

Installation

The simplest installation path is via pip:

pip install vllm

This installs vLLM along with its dependencies (PyTorch, Triton, etc.). For the latest features, you can install from the nightly build:

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Supported Models

vLLM supports a wide range of architectures including LLaMA, Mistral, Mixtral, Qwen, Phi, GPT-NeoX, Falcon, Gemma, and many more. The full list is maintained in the vLLM documentation. If a model uses a standard transformer architecture, vLLM likely supports it.

Offline Batch Inference

The simplest way to use vLLM is the LLM class for offline batch inference — processing a list of prompts all at once without running a server.

from vllm import LLM, SamplingParams

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256,
)

# Initialize the model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Define prompts
prompts = [
    "Explain the concept of PagedAttention in two sentences.",
    "Write a Python function that checks if a number is prime.",
    "What are the main differences between TCP and UDP?",
    "Describe the process of photosynthesis for a 10-year-old.",
]

# Generate completions
outputs = llm.generate(prompts, sampling_params)

# Print results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated_text!r}")
    print("-" * 60)

Key Parameters in SamplingParams

Parameter Description Default
temperature Controls randomness. 0 = deterministic, higher = more random. 1.0
top_p Nucleus sampling: only consider tokens within this cumulative probability. 1.0
top_k Only consider the top-k most likely tokens. -1 = disabled. -1
max_tokens Maximum number of tokens to generate per prompt. 16
stop List of strings that stop generation when encountered. None
presence_penalty Penalizes tokens that have appeared, encouraging diversity. 0.0
frequency_penalty Penalizes tokens proportional to their frequency. 0.0
n Number of output sequences to generate per prompt. 1

When n > 1, vLLM uses the KV cache sharing and copy-on-write mechanism from Part 1 — the prompt’s KV cache is computed once and shared across all output sequences, only copying blocks when they diverge.

What happens under the hood when you call llm.generate():

  1. All prompts are tokenized and submitted to the vLLM engine.
  2. The scheduler batches them together using continuous batching.
  3. PagedAttention manages KV cache memory dynamically as tokens are generated.
  4. Results are returned once all prompts have completed generation.

Because vLLM handles batching internally, you get high throughput even if your prompts vary in length — there is no padding waste.

OpenAI-Compatible API Server

For production use, vLLM provides an HTTP server that implements the OpenAI Chat Completions and Completions API format. This means you can use vLLM as a drop-in replacement for OpenAI’s API.

Starting the Server

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

The server starts and logs the available endpoints. By default, it exposes:

  • /v1/completions — Text completions
  • /v1/chat/completions — Chat completions
  • /v1/models — List available models

Querying with the OpenAI Python Client

Since the API is OpenAI-compatible, you can use the official OpenAI Python library:

from openai import OpenAI

# Point the client at your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM does not require an API key by default
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention and why does it matter?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Querying with curl

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "user", "content": "Summarize vLLM in one paragraph."}
        ],
        "temperature": 0.7,
        "max_tokens": 256
    }'

The server handles concurrent requests automatically, applying continuous batching and PagedAttention behind the scenes. You do not need to manage batching yourself.

Key Server Configuration

vLLM exposes many flags to tune the server for your hardware and workload. Here are the most important ones:

Flag Description Example
--model HuggingFace model ID or path to local model directory. meta-llama/Llama-3.1-8B-Instruct
--tensor-parallel-size Number of GPUs for tensor parallelism. Set to your GPU count. --tensor-parallel-size 4
--gpu-memory-utilization Fraction of GPU memory vLLM can use (0.0–1.0). Lower values leave room for other processes. --gpu-memory-utilization 0.90
--max-model-len Maximum sequence length (prompt + output). Reduces memory pre-allocation if your use case has shorter sequences. --max-model-len 4096
--dtype Data type for model weights. auto uses the model’s native dtype; half forces FP16; bfloat16 forces BF16. --dtype auto
--quantization Quantization method: awq, gptq, squeezellm, etc. Used for pre-quantized models. --quantization awq
--max-num-seqs Maximum number of sequences in a batch. Controls concurrency vs memory tradeoff. --max-num-seqs 256
--enforce-eager Disables CUDA graph optimization. Useful for debugging or unsupported model architectures. --enforce-eager

Example: Multi-GPU Serving with Tuned Memory

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192 \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 8000

This serves a 70B model across 4 GPUs, using 95% of each GPU’s memory, capping context at 8192 tokens, and computing in BF16.

Throughput Comparison: HuggingFace vs vLLM

Let us compare throughput directly. We will generate outputs for 40 prompts using both Hugging Face Transformers (sequential) and vLLM (batched), then compare total time.

HuggingFace Transformers (Sequential Baseline)

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a list using merge sort.",
    "What are the benefits of renewable energy?",
    "Describe the water cycle step by step.",
    "How does a blockchain work?",
] * 8  # 40 prompts total

max_new_tokens = 256
generated_texts = []

start_time = time.time()

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.95,
        )
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    generated_texts.append(text)

hf_time = time.time() - start_time

total_output_tokens = sum(
    len(tokenizer.encode(t)) for t in generated_texts
)
print(f"HuggingFace: {hf_time:.1f}s for {len(prompts)} prompts")
print(f"Throughput: {total_output_tokens / hf_time:.1f} tokens/sec")

Each prompt is tokenized, sent through the model, and decoded one at a time. The GPU processes a single sequence per forward pass, leaving most of its parallel capacity idle.

vLLM (Batched Inference)

import time
from vllm import LLM, SamplingParams

model_name = "meta-llama/Llama-3.1-8B-Instruct"

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256,
)

llm = LLM(model=model_name)

prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a list using merge sort.",
    "What are the benefits of renewable energy?",
    "Describe the water cycle step by step.",
    "How does a blockchain work?",
] * 8  # 40 prompts total

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
vllm_time = time.time() - start_time

total_output_tokens = sum(
    len(output.outputs[0].token_ids) for output in outputs
)
print(f"vLLM: {vllm_time:.1f}s for {len(prompts)} prompts")
print(f"Throughput: {total_output_tokens / vllm_time:.1f} tokens/sec")

vLLM processes all 40 prompts together. The scheduler manages them as a continuous batch — some prompts finish early and free their KV cache blocks, while the remaining prompts continue generating without pause.

Expected Results

On a single A100 (80 GB), with LLaMA-3.1-8B and max_tokens=256:

Method Time (40 prompts) Throughput (tokens/sec)
HuggingFace sequential ~180–240s ~50–80
vLLM batch ~30–60s ~300–500
Speedup ~4–6x ~4–6x

The exact numbers vary with GPU, model, prompt length, and output length. The key observation: the speedup comes from vLLM keeping the GPU fully utilized through continuous batching and fitting more concurrent sequences through efficient memory management. On shorter outputs where the HF baseline wastes less memory, the gap narrows; on longer outputs and larger batches, it widens.

When to Use vLLM (and When Not To)

Use vLLM When:

  • Serving models in production: vLLM’s API server, continuous batching, and memory efficiency make it ideal for serving endpoints.
  • Batch processing: Processing large datasets of prompts benefits enormously from vLLM’s throughput optimizations.
  • High concurrency: When multiple users or processes send requests simultaneously, vLLM’s scheduling shines.
  • Throughput matters more than latency: vLLM optimizes for maximum tokens per second across all requests.

Consider Alternatives When:

  • Fine-tuning: vLLM is an inference engine. For training or fine-tuning, use Hugging Face Transformers, Axolotl, or similar frameworks.
  • CPU-only deployment: vLLM requires NVIDIA GPUs. For CPU inference, look at llama.cpp, CTranslate2, or ONNX Runtime.
  • Unsupported architectures: If your model architecture is not yet supported by vLLM, you may need to use the model’s native inference code.
  • Single-request latency: For a single request with no concurrency, the latency difference between vLLM and Hugging Face is small. vLLM’s advantages appear at scale.
  • Edge/mobile deployment: vLLM targets server GPUs. For edge devices, look at MLC-LLM or ExecuTorch.

Summary

  1. Install vLLM with a single pip install vllm — it handles PyTorch and CUDA dependencies.
  2. Offline batch inference uses the LLM class and SamplingParams for processing multiple prompts efficiently with a few lines of code.
  3. The OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server) provides a production-ready HTTP endpoint that works with the OpenAI Python client and curl.
  4. Key configuration flags like --tensor-parallel-size, --gpu-memory-utilization, and --max-model-len let you tune vLLM for your specific hardware and workload.
  5. Throughput gains of 3–4x over sequential Hugging Face inference come from continuous batching, PagedAttention memory management, and optimized CUDA kernels.
  6. Use vLLM for serving and batch processing; use other tools for fine-tuning, CPU-only, or unsupported model architectures.

This is Part 2 of a 2-part series on vLLM:

Useful Resources