TopDev
ky-thuat Advanced

What is Quantization?

A technique for reducing numeric precision in AI models so they run faster and use less RAM — at the cost of a bit of accuracy.

Updated: May 5, 2026 · 2 min read

Quantization is a technique that reduces the precision of numbers inside an AI model — from 32-bit floats down to 16, 8, or even 4 bits — so the model becomes smaller, runs faster, and uses less RAM/GPU memory, in exchange for a 1-3% drop in accuracy.

Why does quantization matter?

Llama 3.3 70B at float32: ~280GB → requires an H200 GPU (>$30k).

The same model at 4-bit (Q4): 40GB → runs on an RTX 4090 ($1.6k) or a 64GB Mac M-series.

Quantization makes large models runnable on consumer hardware — especially important for self-hosted open-source LLMs.

Common quantization levels

BitsNameSize reductionAccuracy loss
32FP321× (original)0%
16FP16 / BF16~0%
8INT8< 1%
4Q4_K_M, NF41-3%
2Q216×5-15% (risky)
1BitNet32×still being researched

→ The current “sweet spot”: Q4 (especially the Q4_K_M format from llama.cpp).

How quantization works (simply)

Each parameter in a model is a real number (e.g., 0.0327891).

Float32 stores all 32 bits → high precision, high memory cost.

Quantizing to int8: only stores 256 possible values (-128 to 127). 0.0327891 is rounded to the nearest available level.

At inference time: the values are temporarily dequantized for computation → results come out close to the float32 version, but much faster.

Quantization vs Distillation

QuantizationDistillation
MethodReduce precisionTrain a small model to learn from a large one
EffortA few hoursSeveral days to weeks
Quality loss1-3%5-15% (depending on ratio)
When to useFaster inferenceNeed a SMALLER + faster model

The two techniques complement each other and are often used together.

For self-hosted LLMs

  • llama.cpp — quantize to Q4/Q5/Q8, GGUF format
  • bitsandbytes — quantization for HuggingFace
  • GPTQ, AWQ — quantization-aware methods, better accuracy
  • MLX — optimized for Apple Silicon

Typical workflow

# Pull the original 70B model, quantized to Q4
ollama pull llama3.3:70b-instruct-q4_K_M
# Or download GGUF from HuggingFace TheBloke

When NOT to use quantization

  • The model is already small (< 7B): no need to quantize, runs fine on a regular GPU
  • You need absolute accuracy (e.g., medical use): keep at least FP16
  • You’re fine-tuning: train at higher precision, quantize afterwards
Tags
#quantization#optimization#inference