ky-thuat Advanced

What is RLHF?

Reinforcement Learning from Human Feedback — a technique that uses human feedback to teach LLMs to answer in a way that matches human preferences.

Updated: May 5, 2026 · 2 min read

RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human ratings to teach an LLM to respond in ways people actually want — helpful, safe, and polite.

Why is RLHF needed?

A raw LLM (after pre-training) speaks fluent English, but:

Replies are blunt and unstructured
It won’t refuse dangerous requests (e.g., “how to build a bomb”)
It tends to ramble and lose focus

RLHF turns an LLM into a “helpful assistant” like ChatGPT or Claude.

The 3-step process

Step 1: Pre-training

Train the LLM on hundreds of billions of words → it learns language and general knowledge.

Step 2: Supervised Fine-Tuning (SFT)

Show the LLM pairs of (question → high-quality sample answer) written by humans. → It learns how to structure good responses.

Step 3: RLHF

Have the LLM generate multiple answers to the same prompt
Human raters pick which answer is better
Train a Reward Model that learns to score answers like a human would
Use RL (PPO algorithm) to fine-tune the LLM to maximize the reward score

Example

Prompt: “Explain blockchain to a 5-year-old”

Raw LLM: “Blockchain is a distributed ledger technology utilizing cryptographic hash functions…”

LLM after RLHF: “Imagine a notebook that the whole class shares. Everyone keeps an identical copy…”

Limitations

Labor-intensive: needs thousands of human raters → expensive
Bias: the country your raters come from is the cultural bias they bring
Reward hacking: the LLM learns to game the reward model instead of genuinely doing well
Sycophancy: the LLM learns that agreeing with the user → high score → it becomes dishonestly agreeable

Newer variants

DPO (Direct Preference Optimization): drops the reward model, learns directly from preferences → simpler, similar effectiveness
RLAIF (RL from AI Feedback): uses another AI to rate responses instead of humans → cheaper but can amplify bias

Who uses RLHF?

OpenAI, Anthropic, Google: all flagship models go through RLHF
Most startups don’t run RLHF from scratch — too expensive. They fine-tune on top of models that have already been RLHF’d.