What is a Transformer?
A neural network architecture introduced in 2017, the foundation behind every modern LLM (GPT, Claude, Gemini).
Transformer is a neural network architecture introduced in the paper “Attention Is All You Need” (Google, 2017). It is the foundation behind every modern LLM — GPT, Claude, Gemini, and Llama are all Transformer variants.
Why was the Transformer revolutionary?
Before Transformers, LLMs used RNNs/LSTMs to process sequences token-by-token (each token had to wait for the previous one to finish). Problems:
- Slow to train (couldn’t be parallelized)
- Forgot distant context
- Hard to scale to large datasets
Transformers solved this with the attention mechanism: each token looks at ALL other tokens at once and decides for itself “who do I need to pay attention to?”.
→ Training became parallelizable on GPUs → scaling to hundreds of billions of parameters became feasible in a reasonable time.
What is attention?
Take the sentence: "The cat sat on the mat because it was tired". When processing the word "it", attention lets the model learn to look back at "cat" (not "mat") to understand that "it" refers to the cat.
In detail: each token produces 3 vectors — Query, Key, Value (Q, K, V). Attention score = Q · K. Tokens with higher scores get their Value “attended to” more.
Simple structure
Input tokens
↓
[Embedding + Positional Encoding]
↓
[Multi-Head Self-Attention] ← the "revolutionary" part
↓
[Feed-Forward Network]
↓
... repeated N times (12, 24, 96, ...)
↓
Output token
GPT-4 is estimated to have ~120 stacked Transformer layers.
Transformer variants
- Encoder-only (BERT): understands text, doesn’t generate
- Decoder-only (GPT, Claude, Llama): text generation — the architecture of most modern LLMs
- Encoder-Decoder (T5, the original Transformer): machine translation, summarization
Where else are Transformers used, besides LLMs?
- Vision Transformer (ViT): image processing
- AlphaFold: predicting protein structures
- Whisper: speech-to-text
- Stable Diffusion: the text-encoder component
→ The Transformer has become the “universal architecture” of modern deep learning.
Further reading
- Original paper: “Attention Is All You Need” — Vaswani et al., 2017
- Related: LLM, Deep Learning