mo-hinh Intermediate

What is a Diffusion Model?

A class of AI model that generates images by gradually denoising — the technology behind Midjourney, Stable Diffusion, and DALL-E.

Updated: May 5, 2026 · 2 min read

A Diffusion Model is a type of neural network that generates images (and more recently, video) by starting from pure noise and gradually denoising it into a meaningful image. This is the architecture behind Midjourney, Stable Diffusion, DALL-E, and FLUX.

The intuition

Picture the reverse process:

Take a photo of a cat
Add a little noise → the image gets slightly blurry
Add more → blurrier still
Repeat 1000 times → the image becomes pure noise (like a TV with no signal)

A diffusion model learns to do the opposite of this: starting from pure noise → denoising step by step → recovering an image of a cat.

When generating a new image: start with random noise plus a text prompt → the model gradually denoises into an image that matches the prompt.

Why does this architecture work so well?

Stable: easier to train than GANs (the previous contender)
High quality: good detail, few artifacts
Diverse: the same prompt with different starting noise → different images
Conditioning: easy to guide with text, reference images, depth maps, and more

Popular diffusion models (2026)

Model	Closed/Open	Highlights
Midjourney v7	Closed (web/Discord)	Best-in-class aesthetics
Stable Diffusion 3.5	Open source	Massive modding community
FLUX.1 Pro	Closed/Open variants	Photo-realistic, excellent prompt adherence
DALL-E 4	Closed (OpenAI)	ChatGPT integration, handles complex prompts
Imagen 4	Closed (Google)	Inside Gemini
Ideogram 3	Closed	Strong with text inside images

Diffusion for video

Same principle, scaled up to a temporal dimension:

Sora (OpenAI) — 60-second clips
Veo 3 (Google) — cinematic quality
Kling 2 (China) — strong character consistency
Runway Gen-4 — fine-grained control for professional creators

Video generation costs 100-1000× more compute than still images → still expensive and slow.

Diffusion vs LLM — how is image gen different from text gen?

	LLM (text gen)	Diffusion (image gen)
Output	Sequential tokens	Complete image after N steps
Speed	30-100 tokens/sec	1-10 seconds/image
Conditioning	Text prompt	Text prompt + reference image + ControlNet
Partial editing	Hard	Easy (inpainting, outpainting)

When NOT to use diffusion?

Need precise text inside images → still tricky (much improved but not perfect)
Need 3D model rendering → use dedicated 3D AI tools (Meshy, Tripo)
Need the SAME asset across many images (character consistency) → hard to control fully