What is RLHF?
Reinforcement Learning from Human Feedback — a technique that uses human feedback to teach LLMs to answer in a way that matches human preferences.
RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human ratings to teach an LLM to respond in ways people actually want — helpful, safe, and polite.
Why is RLHF needed?
A raw LLM (after pre-training) speaks fluent English, but:
- Replies are blunt and unstructured
- It won’t refuse dangerous requests (e.g., “how to build a bomb”)
- It tends to ramble and lose focus
RLHF turns an LLM into a “helpful assistant” like ChatGPT or Claude.
The 3-step process
Step 1: Pre-training
Train the LLM on hundreds of billions of words → it learns language and general knowledge.
Step 2: Supervised Fine-Tuning (SFT)
Show the LLM pairs of (question → high-quality sample answer) written by humans.
→ It learns how to structure good responses.
Step 3: RLHF
- Have the LLM generate multiple answers to the same prompt
- Human raters pick which answer is better
- Train a Reward Model that learns to score answers like a human would
- Use RL (PPO algorithm) to fine-tune the LLM to maximize the reward score
Example
Prompt: “Explain blockchain to a 5-year-old”
Raw LLM: “Blockchain is a distributed ledger technology utilizing cryptographic hash functions…”
LLM after RLHF: “Imagine a notebook that the whole class shares. Everyone keeps an identical copy…”
Limitations
- Labor-intensive: needs thousands of human raters → expensive
- Bias: the country your raters come from is the cultural bias they bring
- Reward hacking: the LLM learns to game the reward model instead of genuinely doing well
- Sycophancy: the LLM learns that agreeing with the user → high score → it becomes dishonestly agreeable
Newer variants
- DPO (Direct Preference Optimization): drops the reward model, learns directly from preferences → simpler, similar effectiveness
- RLAIF (RL from AI Feedback): uses another AI to rate responses instead of humans → cheaper but can amplify bias
Who uses RLHF?
- OpenAI, Anthropic, Google: all flagship models go through RLHF
- Most startups don’t run RLHF from scratch — too expensive. They fine-tune on top of models that have already been RLHF’d.