ky-thuat Beginner
What is a Context Window?
The maximum amount of text an LLM can 'remember' in a single pass. It decides how much information you can fit into a prompt.
Updated: May 5, 2026 · 2 min read
The context window is the maximum number of tokens an LLM can process in a single call — including both your prompt and the model’s response.
Comparing models (2026)
| Model | Context Window | Roughly equal to |
|---|---|---|
| GPT-3.5 | 16k | ~30 A4 pages |
| GPT-4o | 128k | ~250 pages |
| Claude 4.7 Sonnet | 200k - 1M | 400 - 2000 pages |
| Gemini 2.5 Pro | 2M | ~4000 pages (an entire thick book) |
| Llama 3.3 | 128k | ~250 pages |
Why does context window matter?
Upsides of a large context
- Drop a whole document into the prompt without setting up a complex RAG pipeline
- Long conversations (ChatGPT remembering chats from a month ago)
- Analyze an entire codebase or full book in one shot
Downsides
- Expensive — pricing is per token. A whole book = a big bill
- Slow — the longer the context, the longer the model takes to respond
- Diluted — the model can miss information buried in the middle of a long context (the “lost in the middle” effect)
- Reliability gets harder — RAG is still better when documents are huge
Practical rules
| Situation | Approach |
|---|---|
| < 50 pages of docs | Drop straight into the prompt |
| 50 - 500 pages | Consider a large context (Claude 1M, Gemini 2M) |
| > 500 pages | Use RAG, don’t brute-force it |
| Long conversations | Use prompt caching to save money |
Tips for using context windows well
- Put the IMPORTANT question/instruction at the BEGINNING and the END — avoid getting buried in the middle
- Structure the prompt clearly with XML tags (Claude) or markdown headings
- Use prompt caching if you reuse the same context many times (saves up to 90%)
Related
Tags
#context#llm#token