Infrastructure

Quantization

Quick Answer: A technique that reduces model size and memory usage by representing weights with fewer bits — for example, converting 32-bit floating point numbers to 8-bit or 4-bit integers.

Quantization is a technique that reduces model size and memory usage by representing weights with fewer bits — for example, converting 32-bit floating point numbers to 8-bit or 4-bit integers. Quantized models run faster and on cheaper hardware with minimal quality loss.

Example

A 70B parameter model at full precision (FP16) requires ~140GB of memory. With 4-bit quantization (GPTQ/AWQ), it fits in ~35GB — runnable on a single GPU instead of requiring multi-GPU setups.

Why It Matters

Quantization makes it possible to run large models locally or on affordable cloud instances. It's essential knowledge for anyone deploying open-source models in production, where compute cost is a primary concern.

How It Works

Quantization reduces a model's memory footprint by representing weights with fewer bits. A standard model uses 16-bit floating point (FP16) weights. 8-bit quantization halves the memory requirement; 4-bit quantization quarters it. This makes it possible to run large models on smaller GPUs.

Common quantization methods include GPTQ (post-training quantization using calibration data), AWQ (activation-aware quantization that preserves important weights), and bitsandbytes (dynamic quantization during inference). Each method trades off between compression ratio, inference speed, and output quality.

The quality impact of quantization depends on the model size. Large models (70B+) can typically be quantized to 4-bit with minimal quality loss. Smaller models (7B-13B) show more noticeable degradation at aggressive quantization levels. The sweet spot for most deployments is 8-bit quantization, which typically preserves 99%+ of the original model's performance.

Common Mistakes

Common mistake: Quantizing a small model (7B) to 4-bit and expecting full quality

Smaller models lose more quality per bit of quantization. Use 8-bit for models under 13B, and only go to 4-bit for 70B+ models where memory is the binding constraint.

Common mistake: Not benchmarking quantized models on your specific task before deploying

Quantization affects different capabilities unevenly. A model might retain strong reasoning but lose coding accuracy. Test on your actual use case, not just general benchmarks.

Career Relevance

Quantization skills are essential for ML engineers deploying models in production, especially on edge devices or cost-constrained infrastructure. Understanding quantization trade-offs helps AI engineers choose the right model size and format for their deployment constraints.

Related Terms

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →