co-ban Beginner
What is Computer Vision?
The branch of AI that lets machines 'see' and interpret images and video — from face recognition to self-driving cars.
Updated: May 5, 2026 · 2 min read
Computer Vision is the branch of AI that lets machines “see” and interpret images and video — distinguishing objects, counting them, reading text, detecting motion, and generating new images.
What can computer vision do?
Classic (already widely deployed)
- Face recognition (Face ID, CCTV)
- OCR (reading IDs, invoices, license plates)
- Object detection (self-driving cars spotting pedestrians)
- Image classification (Google Photos auto-tagging)
- Pose estimation (games, fitness apps)
- Medical imaging (reading X-rays, MRIs)
New (2023-26)
- Image generation (Midjourney, Stable Diffusion)
- Video generation (Sora, Veo)
- Visual Question Answering (send an image + question to GPT-4o/Claude)
- 3D reconstruction from 2D images
- Visual agents (Computer Use controlling a GUI)
Common model architectures
| Type | Used for | Examples |
|---|---|---|
| CNN (Convolutional) | Classic classification, detection | ResNet, EfficientNet, YOLO |
| Vision Transformer (ViT) | Modern SOTA across most tasks | ViT, Swin |
| Diffusion | Image generation | Stable Diffusion, FLUX |
| CLIP | Text-image bridging | OpenAI CLIP |
| SAM | Image segmentation | Meta SAM 2 |
Real-world applications
Business
- eKYC (customer verification): snap an ID + selfie → for example, Vietnamese banks and e-wallets verify in seconds
- Security camera AI: intrusion detection, customer counting
- Logistics: reading license plates entering/exiting warehouses, counting items
- Healthcare: AI-assisted diagnosis on lung X-rays and eye scans
Personal
- Google Photos / iCloud Photos auto-tagging
- Snapchat / Instagram filters
- Social apps recognizing friends in photos
Computer Vision vs Image Generation
- Computer Vision (classic): UNDERSTANDS existing images
- Image Generation: CREATES new images
- Multimodal LLM: both — understands, generates, and combines with text
By 2026, the line is blurring. GPT-4o can both “see” an image and “draw” one.
Popular tools / frameworks
Production
- OpenCV — classic library, every language
- YOLO v10/v11 — fast, easy-to-deploy object detection
- Mediapipe (Google) — realtime face, pose, hands
- Roboflow — end-to-end platform for CV teams
LLM-based
- GPT-4o, Claude 4.7, Gemini 2.5 — call the API, send images, ask in natural language
- Llama 4 Vision — open source
When to use LLMs vs classic CV?
| Situation | Pick |
|---|---|
| Realtime, edge device | YOLO / Mediapipe (fast, runs offline) |
| Standard task (face, OCR) | Specialized library (FaceNet, Tesseract) |
| Complex, fuzzy task | Multimodal LLM |
| Production needing determinism | Classic CV |
| Quick prototype | LLM |
Related
Tags
#computer-vision#vision