What is Inference (AI Inference)?
The process of running a trained AI model to serve users — the main driver of cost and latency in any AI product.
Inference is the process of USING an already-trained model — you send an input, the model returns an output. Every time ChatGPT answers your question → that’s one inference call.
Inference vs training
| Training | Inference | |
|---|---|---|
| When | Once (or on a cycle) | Every user request |
| Cost | $10M-$1B (large models) | $0.001-$1 per request |
| Resources | Many top-tier GPUs | Fewer GPUs but they must scale |
| Optimized for | Throughput | Latency + cost |
Why does inference matter for businesses?
You train once, but inference runs FOREVER, for every user, on every request. It adds up to 80%+ of total AI cost for a production product.
Example: a chatbot app with 10k users, each sending 10 messages a day at ~$0.01 per message → $1000/day = $30k/month on inference alone.
Factors that drive cost and latency
1. Model size
- Claude Opus (large): high accuracy, expensive, slow
- Claude Haiku (small): fast, cheap, often good enough → Pick the SMALLEST model that solves the task — the golden rule for cutting costs.
2. Input/output tokens
- Every token costs money
- Output is 4-5× more expensive than input on most APIs → Trim unnecessary prompt content; ask for shorter responses
3. Batching
Sending many requests together (batch APIs) is usually cheaper than one-by-one.
- Anthropic Batch API: 50% off
- OpenAI Batch API: 50% off
4. Caching
Cache fixed prompt prefixes so you don’t pay for them repeatedly.
- Anthropic prompt caching: up to 90% off the cached portion
5. Streaming
Users see the first token in under 1 second instead of waiting 10 seconds for the full response. Total cost is the same, but UX is much better.
Self-hosted inference vs API
Using an API (OpenAI, Anthropic, Google)
Pros:
- No worries about hardware, scaling, or ops
- Always access to the strongest models Cons:
- Vendor lock-in
- Can be more expensive at large scale
- Privacy: your data passes through a third party
Self-hosting (Llama, Mistral, Qwen open source)
Pros:
- Full privacy
- Can be cheaper at large scale
- Total customization Cons:
- Needs an ops team that understands GPUs, vLLM, CUDA
- Open-source models still trail frontier closed models
- Hardware investment required
→ Rule of thumb: under 1M requests/month → API. Over 100M requests/month → consider self-hosting. Anywhere in between depends on the situation.
Popular self-hosting tools
- vLLM — the fastest inference engine (Berkeley)
- TGI (Text Generation Inference) — by HuggingFace
- Ollama — run LLMs locally for dev/personal use
- LM Studio — UI for those who don’t want a CLI
- MLX — optimized for Apple Silicon