mo-hinh Advanced

What is a Transformer?

A neural network architecture introduced in 2017, the foundation behind every modern LLM (GPT, Claude, Gemini).

Updated: May 5, 2026 · 2 min read

Transformer is a neural network architecture introduced in the paper “Attention Is All You Need” (Google, 2017). It is the foundation behind every modern LLM — GPT, Claude, Gemini, and Llama are all Transformer variants.

Why was the Transformer revolutionary?

Before Transformers, LLMs used RNNs/LSTMs to process sequences token-by-token (each token had to wait for the previous one to finish). Problems:

Slow to train (couldn’t be parallelized)
Forgot distant context
Hard to scale to large datasets

Transformers solved this with the attention mechanism: each token looks at ALL other tokens at once and decides for itself “who do I need to pay attention to?”.

→ Training became parallelizable on GPUs → scaling to hundreds of billions of parameters became feasible in a reasonable time.

What is attention?

Take the sentence: "The cat sat on the mat because it was tired". When processing the word "it", attention lets the model learn to look back at "cat" (not "mat") to understand that "it" refers to the cat.

In detail: each token produces 3 vectors — Query, Key, Value (Q, K, V). Attention score = Q · K. Tokens with higher scores get their Value “attended to” more.

Simple structure

Input tokens
    ↓
[Embedding + Positional Encoding]
    ↓
[Multi-Head Self-Attention] ← the "revolutionary" part
    ↓
[Feed-Forward Network]
    ↓
... repeated N times (12, 24, 96, ...)
    ↓
Output token

GPT-4 is estimated to have ~120 stacked Transformer layers.

Transformer variants

Encoder-only (BERT): understands text, doesn’t generate
Decoder-only (GPT, Claude, Llama): text generation — the architecture of most modern LLMs
Encoder-Decoder (T5, the original Transformer): machine translation, summarization

Where else are Transformers used, besides LLMs?

Vision Transformer (ViT): image processing
AlphaFold: predicting protein structures
Whisper: speech-to-text
Stable Diffusion: the text-encoder component

→ The Transformer has become the “universal architecture” of modern deep learning.