ky-thuat Advanced

What is Quantization?

A technique for reducing numeric precision in AI models so they run faster and use less RAM — at the cost of a bit of accuracy.

Updated: May 5, 2026 · 2 min read

Quantization is a technique that reduces the precision of numbers inside an AI model — from 32-bit floats down to 16, 8, or even 4 bits — so the model becomes smaller, runs faster, and uses less RAM/GPU memory, in exchange for a 1-3% drop in accuracy.

Why does quantization matter?

Llama 3.3 70B at float32: ~280GB → requires an H200 GPU (>$30k).

The same model at 4-bit (Q4): ~~40GB → runs on an RTX 4090 (~~$1.6k) or a 64GB Mac M-series.

Quantization makes large models runnable on consumer hardware — especially important for self-hosted open-source LLMs.

Common quantization levels

Bits	Name	Size reduction	Accuracy loss
32	FP32	1× (original)	0%
16	FP16 / BF16	2×	~0%
8	INT8	4×	< 1%
4	Q4_K_M, NF4	8×	1-3%
2	Q2	16×	5-15% (risky)
1	BitNet	32×	still being researched

→ The current “sweet spot”: Q4 (especially the Q4_K_M format from llama.cpp).

How quantization works (simply)

Each parameter in a model is a real number (e.g., 0.0327891).

Float32 stores all 32 bits → high precision, high memory cost.

Quantizing to int8: only stores 256 possible values (-128 to 127). 0.0327891 is rounded to the nearest available level.

At inference time: the values are temporarily dequantized for computation → results come out close to the float32 version, but much faster.

Quantization vs Distillation

	Quantization	Distillation
Method	Reduce precision	Train a small model to learn from a large one
Effort	A few hours	Several days to weeks
Quality loss	1-3%	5-15% (depending on ratio)
When to use	Faster inference	Need a SMALLER + faster model

The two techniques complement each other and are often used together.

Popular tools

For self-hosted LLMs

llama.cpp — quantize to Q4/Q5/Q8, GGUF format
bitsandbytes — quantization for HuggingFace
GPTQ, AWQ — quantization-aware methods, better accuracy
MLX — optimized for Apple Silicon

Typical workflow

# Pull the original 70B model, quantized to Q4
ollama pull llama3.3:70b-instruct-q4_K_M
# Or download GGUF from HuggingFace TheBloke

When NOT to use quantization

The model is already small (< 7B): no need to quantize, runs fine on a regular GPU
You need absolute accuracy (e.g., medical use): keep at least FP16
You’re fine-tuning: train at higher precision, quantize afterwards

Inference
LoRA — a lightweight fine-tuning technique
GPU