What is Quantization?
A technique for reducing numeric precision in AI models so they run faster and use less RAM — at the cost of a bit of accuracy.
Quantization is a technique that reduces the precision of numbers inside an AI model — from 32-bit floats down to 16, 8, or even 4 bits — so the model becomes smaller, runs faster, and uses less RAM/GPU memory, in exchange for a 1-3% drop in accuracy.
Why does quantization matter?
Llama 3.3 70B at float32: ~280GB → requires an H200 GPU (>$30k).
The same model at 4-bit (Q4): 40GB → runs on an RTX 4090 ($1.6k) or a 64GB Mac M-series.
Quantization makes large models runnable on consumer hardware — especially important for self-hosted open-source LLMs.
Common quantization levels
| Bits | Name | Size reduction | Accuracy loss |
|---|---|---|---|
| 32 | FP32 | 1× (original) | 0% |
| 16 | FP16 / BF16 | 2× | ~0% |
| 8 | INT8 | 4× | < 1% |
| 4 | Q4_K_M, NF4 | 8× | 1-3% |
| 2 | Q2 | 16× | 5-15% (risky) |
| 1 | BitNet | 32× | still being researched |
→ The current “sweet spot”: Q4 (especially the Q4_K_M format from llama.cpp).
How quantization works (simply)
Each parameter in a model is a real number (e.g., 0.0327891).
Float32 stores all 32 bits → high precision, high memory cost.
Quantizing to int8: only stores 256 possible values (-128 to 127). 0.0327891 is rounded to the nearest available level.
At inference time: the values are temporarily dequantized for computation → results come out close to the float32 version, but much faster.
Quantization vs Distillation
| Quantization | Distillation | |
|---|---|---|
| Method | Reduce precision | Train a small model to learn from a large one |
| Effort | A few hours | Several days to weeks |
| Quality loss | 1-3% | 5-15% (depending on ratio) |
| When to use | Faster inference | Need a SMALLER + faster model |
The two techniques complement each other and are often used together.
Popular tools
For self-hosted LLMs
- llama.cpp — quantize to Q4/Q5/Q8, GGUF format
- bitsandbytes — quantization for HuggingFace
- GPTQ, AWQ — quantization-aware methods, better accuracy
- MLX — optimized for Apple Silicon
Typical workflow
# Pull the original 70B model, quantized to Q4
ollama pull llama3.3:70b-instruct-q4_K_M
# Or download GGUF from HuggingFace TheBloke
When NOT to use quantization
- The model is already small (< 7B): no need to quantize, runs fine on a regular GPU
- You need absolute accuracy (e.g., medical use): keep at least FP16
- You’re fine-tuning: train at higher precision, quantize afterwards