co-ban Beginner

What is Computer Vision?

The branch of AI that lets machines 'see' and interpret images and video — from face recognition to self-driving cars.

Updated: May 5, 2026 · 2 min read

Computer Vision is the branch of AI that lets machines “see” and interpret images and video — distinguishing objects, counting them, reading text, detecting motion, and generating new images.

What can computer vision do?

Classic (already widely deployed)

Face recognition (Face ID, CCTV)
OCR (reading IDs, invoices, license plates)
Object detection (self-driving cars spotting pedestrians)
Image classification (Google Photos auto-tagging)
Pose estimation (games, fitness apps)
Medical imaging (reading X-rays, MRIs)

New (2023-26)

Image generation (Midjourney, Stable Diffusion)
Video generation (Sora, Veo)
Visual Question Answering (send an image + question to GPT-4o/Claude)
3D reconstruction from 2D images
Visual agents (Computer Use controlling a GUI)

Common model architectures

Type	Used for	Examples
CNN (Convolutional)	Classic classification, detection	ResNet, EfficientNet, YOLO
Vision Transformer (ViT)	Modern SOTA across most tasks	ViT, Swin
Diffusion	Image generation	Stable Diffusion, FLUX
CLIP	Text-image bridging	OpenAI CLIP
SAM	Image segmentation	Meta SAM 2

Real-world applications

Business

eKYC (customer verification): snap an ID + selfie → for example, Vietnamese banks and e-wallets verify in seconds
Security camera AI: intrusion detection, customer counting
Logistics: reading license plates entering/exiting warehouses, counting items
Healthcare: AI-assisted diagnosis on lung X-rays and eye scans

Personal

Google Photos / iCloud Photos auto-tagging
Snapchat / Instagram filters
Social apps recognizing friends in photos

Computer Vision vs Image Generation

Computer Vision (classic): UNDERSTANDS existing images
Image Generation: CREATES new images
Multimodal LLM: both — understands, generates, and combines with text

By 2026, the line is blurring. GPT-4o can both “see” an image and “draw” one.

Popular tools / frameworks

Production

OpenCV — classic library, every language
YOLO v10/v11 — fast, easy-to-deploy object detection
Mediapipe (Google) — realtime face, pose, hands
Roboflow — end-to-end platform for CV teams

LLM-based

GPT-4o, Claude 4.7, Gemini 2.5 — call the API, send images, ask in natural language
Llama 4 Vision — open source

When to use LLMs vs classic CV?

Situation	Pick
Realtime, edge device	YOLO / Mediapipe (fast, runs offline)
Standard task (face, OCR)	Specialized library (FaceNet, Tesseract)
Complex, fuzzy task	Multimodal LLM
Production needing determinism	Classic CV
Quick prototype	LLM