TopDev
ky-thuat Advanced

What is RLHF?

Reinforcement Learning from Human Feedback — a technique that uses human feedback to teach LLMs to answer in a way that matches human preferences.

Updated: May 5, 2026 · 2 min read

RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human ratings to teach an LLM to respond in ways people actually want — helpful, safe, and polite.

Why is RLHF needed?

A raw LLM (after pre-training) speaks fluent English, but:

  • Replies are blunt and unstructured
  • It won’t refuse dangerous requests (e.g., “how to build a bomb”)
  • It tends to ramble and lose focus

RLHF turns an LLM into a “helpful assistant” like ChatGPT or Claude.

The 3-step process

Step 1: Pre-training

Train the LLM on hundreds of billions of words → it learns language and general knowledge.

Step 2: Supervised Fine-Tuning (SFT)

Show the LLM pairs of (question → high-quality sample answer) written by humans. → It learns how to structure good responses.

Step 3: RLHF

  1. Have the LLM generate multiple answers to the same prompt
  2. Human raters pick which answer is better
  3. Train a Reward Model that learns to score answers like a human would
  4. Use RL (PPO algorithm) to fine-tune the LLM to maximize the reward score

Example

Prompt: “Explain blockchain to a 5-year-old”

Raw LLM: “Blockchain is a distributed ledger technology utilizing cryptographic hash functions…”

LLM after RLHF: “Imagine a notebook that the whole class shares. Everyone keeps an identical copy…”

Limitations

  • Labor-intensive: needs thousands of human raters → expensive
  • Bias: the country your raters come from is the cultural bias they bring
  • Reward hacking: the LLM learns to game the reward model instead of genuinely doing well
  • Sycophancy: the LLM learns that agreeing with the user → high score → it becomes dishonestly agreeable

Newer variants

  • DPO (Direct Preference Optimization): drops the reward model, learns directly from preferences → simpler, similar effectiveness
  • RLAIF (RL from AI Feedback): uses another AI to rate responses instead of humans → cheaper but can amplify bias

Who uses RLHF?

  • OpenAI, Anthropic, Google: all flagship models go through RLHF
  • Most startups don’t run RLHF from scratch — too expensive. They fine-tune on top of models that have already been RLHF’d.
Tags
#rlhf#training#alignment