vLLM Part 1: PagedAttention & the LLM Serving Problem

How vLLM Rethinks Memory Management to Serve LLMs at Scale

Large language models are transforming every corner of software, but serving them in production is brutally expensive. A single LLaMA-13B model can consume over 26 GB of GPU memory just for its weights, and that is before accounting for the memory needed to actually process requests. When dozens or hundreds... [Read More]

Quantization with BitsAndBytes: Running Large Models on Consumer Hardware

A practical guide to model quantization using Hugging Face, bitsandbytes, and accelerate

Large language models have grown at a staggering pace. Models like LLaMA-2 70B carry 70 billion parameters, which in full 32-bit floating point precision would require roughly 280 GB of GPU memory just to load the weights. Even a 7B parameter model needs around 28 GB in FP32. For most... [Read More]

A/B Testing Part 3: Execution & Decision-Making

From Running Experiments to Making Confident Deployment Decisions

In Part 1 we covered experiment design fundamentals, and in Part 2 we explored the statistical framework and metric selection. In this final part, we tackle the practical realities of running experiments — the pitfalls that can invalidate your results, the infrastructure needed to run experiments reliably, and the decision-making... [Read More]

A/B Testing Part 2: Statistical Framework & Metrics

Choosing the Right Metrics and Statistical Foundations for A/B Tests

In Part 1 we covered the foundations of A/B testing: what it is, why it matters, and how to design experiments with proper user segmentation and traffic allocation. Now we turn to the statistical machinery that makes A/B testing rigorous — how to determine sample sizes, choose the right metrics,... [Read More]