TopDev
ky-thuat Intermediate

What is Inference (AI Inference)?

The process of running a trained AI model to serve users — the main driver of cost and latency in any AI product.

Updated: May 5, 2026 · 2 min read

Inference is the process of USING an already-trained model — you send an input, the model returns an output. Every time ChatGPT answers your question → that’s one inference call.

Inference vs training

TrainingInference
WhenOnce (or on a cycle)Every user request
Cost$10M-$1B (large models)$0.001-$1 per request
ResourcesMany top-tier GPUsFewer GPUs but they must scale
Optimized forThroughputLatency + cost

Why does inference matter for businesses?

You train once, but inference runs FOREVER, for every user, on every request. It adds up to 80%+ of total AI cost for a production product.

Example: a chatbot app with 10k users, each sending 10 messages a day at ~$0.01 per message → $1000/day = $30k/month on inference alone.

Factors that drive cost and latency

1. Model size

  • Claude Opus (large): high accuracy, expensive, slow
  • Claude Haiku (small): fast, cheap, often good enough → Pick the SMALLEST model that solves the task — the golden rule for cutting costs.

2. Input/output tokens

  • Every token costs money
  • Output is 4-5× more expensive than input on most APIs → Trim unnecessary prompt content; ask for shorter responses

3. Batching

Sending many requests together (batch APIs) is usually cheaper than one-by-one.

  • Anthropic Batch API: 50% off
  • OpenAI Batch API: 50% off

4. Caching

Cache fixed prompt prefixes so you don’t pay for them repeatedly.

  • Anthropic prompt caching: up to 90% off the cached portion

5. Streaming

Users see the first token in under 1 second instead of waiting 10 seconds for the full response. Total cost is the same, but UX is much better.

Self-hosted inference vs API

Using an API (OpenAI, Anthropic, Google)

Pros:

  • No worries about hardware, scaling, or ops
  • Always access to the strongest models Cons:
  • Vendor lock-in
  • Can be more expensive at large scale
  • Privacy: your data passes through a third party

Self-hosting (Llama, Mistral, Qwen open source)

Pros:

  • Full privacy
  • Can be cheaper at large scale
  • Total customization Cons:
  • Needs an ops team that understands GPUs, vLLM, CUDA
  • Open-source models still trail frontier closed models
  • Hardware investment required

→ Rule of thumb: under 1M requests/month → API. Over 100M requests/month → consider self-hosting. Anywhere in between depends on the situation.

  • vLLM — the fastest inference engine (Berkeley)
  • TGI (Text Generation Inference) — by HuggingFace
  • Ollama — run LLMs locally for dev/personal use
  • LM Studio — UI for those who don’t want a CLI
  • MLX — optimized for Apple Silicon
Tags
#inference#llm#production