ky-thuat Intermediate

What is Inference (AI Inference)?

The process of running a trained AI model to serve users — the main driver of cost and latency in any AI product.

Updated: May 5, 2026 · 2 min read

Inference is the process of USING an already-trained model — you send an input, the model returns an output. Every time ChatGPT answers your question → that’s one inference call.

Inference vs training

	Training	Inference
When	Once (or on a cycle)	Every user request
Cost	$10M-$1B (large models)	$0.001-$1 per request
Resources	Many top-tier GPUs	Fewer GPUs but they must scale
Optimized for	Throughput	Latency + cost

Why does inference matter for businesses?

You train once, but inference runs FOREVER, for every user, on every request. It adds up to 80%+ of total AI cost for a production product.

Example: a chatbot app with 10k users, each sending 10 messages a day at ~$0.01 per message → $1000/day = $30k/month on inference alone.

Factors that drive cost and latency

1. Model size

Claude Opus (large): high accuracy, expensive, slow
Claude Haiku (small): fast, cheap, often good enough → Pick the SMALLEST model that solves the task — the golden rule for cutting costs.

2. Input/output tokens

Every token costs money
Output is 4-5× more expensive than input on most APIs → Trim unnecessary prompt content; ask for shorter responses

3. Batching

Sending many requests together (batch APIs) is usually cheaper than one-by-one.

Anthropic Batch API: 50% off
OpenAI Batch API: 50% off

4. Caching

Cache fixed prompt prefixes so you don’t pay for them repeatedly.

Anthropic prompt caching: up to 90% off the cached portion

5. Streaming

Users see the first token in under 1 second instead of waiting 10 seconds for the full response. Total cost is the same, but UX is much better.

Self-hosted inference vs API

Using an API (OpenAI, Anthropic, Google)

Pros:

No worries about hardware, scaling, or ops
Always access to the strongest models Cons:
Vendor lock-in
Can be more expensive at large scale
Privacy: your data passes through a third party

Self-hosting (Llama, Mistral, Qwen open source)

Pros:

Full privacy
Can be cheaper at large scale
Total customization Cons:
Needs an ops team that understands GPUs, vLLM, CUDA
Open-source models still trail frontier closed models
Hardware investment required

→ Rule of thumb: under 1M requests/month → API. Over 100M requests/month → consider self-hosting. Anywhere in between depends on the situation.

Popular self-hosting tools

vLLM — the fastest inference engine (Berkeley)
TGI (Text Generation Inference) — by HuggingFace
Ollama — run LLMs locally for dev/personal use
LM Studio — UI for those who don’t want a CLI
MLX — optimized for Apple Silicon