TopDev
co-ban Beginner

What is Computer Vision?

The branch of AI that lets machines 'see' and interpret images and video — from face recognition to self-driving cars.

Updated: May 5, 2026 · 2 min read

Computer Vision is the branch of AI that lets machines “see” and interpret images and video — distinguishing objects, counting them, reading text, detecting motion, and generating new images.

What can computer vision do?

Classic (already widely deployed)

  • Face recognition (Face ID, CCTV)
  • OCR (reading IDs, invoices, license plates)
  • Object detection (self-driving cars spotting pedestrians)
  • Image classification (Google Photos auto-tagging)
  • Pose estimation (games, fitness apps)
  • Medical imaging (reading X-rays, MRIs)

New (2023-26)

  • Image generation (Midjourney, Stable Diffusion)
  • Video generation (Sora, Veo)
  • Visual Question Answering (send an image + question to GPT-4o/Claude)
  • 3D reconstruction from 2D images
  • Visual agents (Computer Use controlling a GUI)

Common model architectures

TypeUsed forExamples
CNN (Convolutional)Classic classification, detectionResNet, EfficientNet, YOLO
Vision Transformer (ViT)Modern SOTA across most tasksViT, Swin
DiffusionImage generationStable Diffusion, FLUX
CLIPText-image bridgingOpenAI CLIP
SAMImage segmentationMeta SAM 2

Real-world applications

Business

  • eKYC (customer verification): snap an ID + selfie → for example, Vietnamese banks and e-wallets verify in seconds
  • Security camera AI: intrusion detection, customer counting
  • Logistics: reading license plates entering/exiting warehouses, counting items
  • Healthcare: AI-assisted diagnosis on lung X-rays and eye scans

Personal

  • Google Photos / iCloud Photos auto-tagging
  • Snapchat / Instagram filters
  • Social apps recognizing friends in photos

Computer Vision vs Image Generation

  • Computer Vision (classic): UNDERSTANDS existing images
  • Image Generation: CREATES new images
  • Multimodal LLM: both — understands, generates, and combines with text

By 2026, the line is blurring. GPT-4o can both “see” an image and “draw” one.

Production

  • OpenCV — classic library, every language
  • YOLO v10/v11 — fast, easy-to-deploy object detection
  • Mediapipe (Google) — realtime face, pose, hands
  • Roboflow — end-to-end platform for CV teams

LLM-based

  • GPT-4o, Claude 4.7, Gemini 2.5 — call the API, send images, ask in natural language
  • Llama 4 Vision — open source

When to use LLMs vs classic CV?

SituationPick
Realtime, edge deviceYOLO / Mediapipe (fast, runs offline)
Standard task (face, OCR)Specialized library (FaceNet, Tesseract)
Complex, fuzzy taskMultimodal LLM
Production needing determinismClassic CV
Quick prototypeLLM
Tags
#computer-vision#vision