Have you ever noticed that ChatGPT or Claude remembers everything perfectly at the beginning of a conversation, but after an hour, they start to confuse details or ask again what you've already explained? This isn't a bug – it's a fundamental limitation that determines how much AI can "hold in its head" at once. It's called the context window – and it's precisely what determines the quality, speed, and cost of every response.
If you're not yet familiar with the basics of LLM operation, start with the foundational article "How ChatGPT, Claude, and Gemini Work: A Complete Guide 2026"
📌 TL;DR — the main points in 30 seconds:
The context window is the maximum amount of text an AI can see at once. For Claude, it's 200K tokens (~500 pages), for GPT-5 it's 400K, and for Gemini, it's up to 1M+. But more isn't always better: doubling the context quadruples memory costs (quadratic complexity), AI remembers information from the middle of the text worse (lost in the middle), and each additional token costs real money. This is why RAG is still relevant – even in the era of million-token context windows.
📚 Table of Contents
🎯 What is a context window and why is it limited
Short answer: A context window is the maximum number of tokens a model can process in a single request. This includes what you wrote, what the model responded, and the entire chat history. When the window fills up, the oldest messages "fall out."
Imagine a desk. Everything the AI can see at once is laid out on this desk. The size of the desk is the context window. New documents are placed on top, old ones fall on the floor.
When you write a message in ChatGPT or Claude, the following is placed on the "desk": the system prompt (instructions for the model) + the entire conversation history + your new request. The model "looks" at all of this simultaneously and generates a response.
Important detail: the context window is measured not in words, but in tokens. A token is a fragment of text: a word, part of a word, or even a single character. In English, one word ≈ 1.3 tokens. In Ukrainian, it's more complex: due to rich morphology, one word is usually broken down into 2–4 tokens. This means that less Ukrainian text fits into the context window than English text.
More about tokens and tokenization – in the article What are tokens: how ChatGPT sees your text.
I experienced this limitation in practice when I was building an RAG chatbot for WebsCraft. As long as the conversation is short, the bot responds quickly and accurately. But when a user asks 10-15 questions in a row, the context fills up with previous messages, and the quality of responses gradually decreases. That's when I realized how much the context window affects everything – from speed to cost.
Why the window can't be infinite
Three fundamental limitations:
- ✔️ Memory (RAM/VRAM): each token in the context requires storing the KV-cache (key-value cache). More tokens = more GPU memory. With a 200K token context, the KV-cache can occupy tens of gigabytes.
- ✔️ Speed: each new token "looks" at all previous ones. More context = slower generation of each subsequent word.
- ✔️ Cost: API providers charge for each token. A longer conversation = a more expensive subsequent request.
But the main reason is mathematical, and it deserves a separate section.
Conclusion: The context window is not a technical detail, but a fundamental trade-off between quality, speed, and cost. Each model resolves this trade-off in its own way.
🎯 Quadratic complexity: why doubling the context = 4 times more expensive
Short answer: At the core of LLMs lies the attention mechanism, where each token "looks" at every other token. This creates a quadratic dependency: doubling the context increases computation and memory not by 2, but by 4 times. This is the main barrier to infinite context.
1,000 tokens = 1 million attention operations.
10,000 tokens = 100 million.
200,000 tokens = 40 billion.
The growth is not linear – it's quadratic.
Analogy: a room full of people
Imagine a room full of people. Each person is a token. For everyone to understand the full context of the conversation, each person must talk to every other person. If there are 10 people in the room, that's 100 conversations (10 × 10). If there are 100 people, that's already 10,000 conversations. If there are 1,000, it's a million. You doubled the number of people – the number of conversations quadrupled. This is quadratic complexity: O(n²).
Now imagine that each "conversation" is a computation on a GPU that takes time and memory. It becomes clear why increasing the context from 4K to 200K tokens is not just "50 times more," but 2,500 times more expensive in terms of computation.
What happens inside: self-attention
In technical terms, this happens in the self-attention mechanism, which is the heart of the transformer architecture. For each token, the model computes three vectors – Query, Key, and Value. Then, the Query of each token is "compared" with the Key of every other token to determine what to pay attention to. The result is an n × n attention scores matrix, where n is the number of tokens.
This matrix is stored in the so-called KV-cache (key-value cache) – a special area of GPU memory. With each new token, the cache grows, and it becomes the "bottleneck" on systems with limited memory.
More about the attention mechanism – in the article Transformers and the attention mechanism: why AI understands context.
What this means in practice: scaling table
| Context (tokens) |
Attention Operations |
~KV-Cache Size |
Relative Cost |
| 4,000 (GPT-3, 2022) |
16 million |
~100 MB |
1x |
| 32,000 |
1 billion |
~800 MB |
64x |
| 200,000 (Claude) |
40 billion |
~5 GB |
2,500x |
| 1,000,000 (Gemini) |
1 trillion |
~25+ GB |
62,500x |
Note: KV-cache sizes are approximate and depend on the specific model architecture, the number of attention layers, and computation precision.
How it feels on real hardware
I experienced this in practice when working with Ollama on my Mac M1. I increased the context window from 2K to 8K tokens – and the model noticeably slowed down, and Activity Monitor showed a several-gigabyte jump in memory usage. When I tried to set a 16K context on a 7B model, the system started swapping to disk – and the response, instead of a second, took 30+.
This is the same quadratic complexity, just scaled down to a single laptop instead of a Google data center. More about memory limitations on weak hardware – in the article Ollama on 8 GB RAM: which models actually work.
Three walls created by quadraticity
- ✔️ Memory wall: The KV-cache for 1M tokens can occupy 25+ GB of GPU memory – more than the entire VRAM of most consumer graphics cards. Even on a data center GPU like A100 (80 GB), this is a significant portion of the resource.
- ✔️ Speed wall: Each new token of the response requires "reviewing" all previous context tokens. With a 200K context, generating each word takes noticeably longer than with 4K. The user feels this as a delay before the response starts (time to first token).
- ✔️ Money wall: more computation = more GPU time = higher cost per request. This is why API providers charge for input tokens, and some (like Anthropic) charge double when the context exceeds a certain threshold.
Conclusion: Quadratic complexity is not a problem that can be solved simply by adding servers. It's a fundamental mathematical property of the transformer architecture that applies equally to a laptop with 8 GB of RAM and a Google data center with thousands of GPUs. This is why companies spend millions on researching alternative architectures – and why RAG remains a more practical solution than endlessly increasing context.
🎯 Lost in the middle: why AI remembers the beginning and end better
Short answer:
Even if a model can technically process 200K or 1M tokens,
information in the middle of the context is remembered worse than at the beginning
or end. Studies show a 20–50% drop in accuracy
for information from the middle of a long context. This is not a bug of a specific
model—it's a fundamental property of the transformer architecture.
Imagine you were given a 500-page book and asked to
find one specific phrase. You remember the introduction and the last
chapter well—but what was on page 247? The same happens with AI.
Psychologists call this the "serial position effect"—
and it turns out LLMs suffer from it just like humans do.
This phenomenon has been named "lost in the middle" after
a foundational study by Stanford and the University of Washington
(Liu et al., 2023).
The authors tested models on two tasks: finding an answer among several
documents and extracting key-value pairs from a long list.
In both cases, they found a U-shaped curve: accuracy is highest
when relevant information is at the beginning or end
of the context, and drops significantly when it's in the middle.
Moreover, the effect was observed in all tested models—from GPT-3.5
to GPT-4 and Claude.
Specific numbers: how serious is the problem
According to
Chroma Research (2025),
which tested 18 frontier models including GPT-4.1, Claude Opus 4
and Gemini 2.5:
- ✔️ Information at the beginning and end of the context: accuracy 85–95%
- ⚠️ Information in the middle: accuracy drops to 76–82%
- ❌ With 100K+ context tokens: overall accuracy drop 20–50% compared to 10K
- ✔️ Claude models degrade the slowest, but no model is immune
A separate study by Du et al. (2025) proved an even more alarming fact:
even when irrelevant tokens were replaced with empty spaces and the model
was forced to "look" only at relevant information—performance
still dropped by 13.9–85% with increasing context length.
This means the problem is not just "distraction"—the sheer volume
of tokens hinders the model's ability to think effectively.
Why it happens: architectural reasons
Researchers at
MIT (2025)
found a specific mechanism. They created a theoretical framework to
analyze information flow in a transformer and identified two reasons:
- ✔️ Attention masking: the causal mask
in a transformer allows tokens to "look" only at previous ones.
This creates a natural bias—the last tokens have access
to the most context, the first ones receive the most
attention from subsequent ones.
- ✔️ Positional encodings:
methods like RoPE (Rotary Position Embedding) gradually
"fade" with distance—the further two tokens are from each other,
the weaker their connection. Tokens in the middle are
far enough from both the beginning and the end.
The result is a U-shaped attention curve: strong focus on the beginning
(primacy bias), strong focus on the end (recency bias),
and a "blind spot" in the middle.
Why it matters for practice
When you have a long conversation with Claude or ChatGPT, your early
messages gradually "fall" into the middle of the context.
New messages are always at the end, the system prompt—at the beginning.
But important details you explained in the 15th message—
end up in the very zone where the model performs worst.
I noticed this in my own experience: during long sessions working
with Claude, while discussing the architecture of a Spring Boot project,
the model began to "forget" decisions made at the beginning of the conversation.
Only one thing helped—periodically repeating key details
or starting a new conversation with a summary of the previous one.
Practical recommendations
- ✔️ For long conversations: periodically remind the model
of key details or start a new conversation with a brief summary
- ✔️ For RAG systems: if loading multiple documents
into context, place the most important ones first or last—never
in the middle
- ✔️ For prompts: the main instruction—at the beginning
(system prompt), the specific task—at the end (user message).
Leave the middle for auxiliary context that is less critical
- ✔️ For developers: use re-ranking
in your RAG pipeline—reorder documents by relevance
before inserting them into the context
More on the difference between approaches and when to choose which—
in the article LLM vs RAG in 2026: why it's not the same thing and when to use what.
Conclusion: Advertised context window and actual
effectiveness—these are different things. A model with 200K context that
consistently works across the entire range is more valuable in practice
than a model with 1M that "loses" the middle. And the best way to combat
the problem is not to increase context, but to reduce it through RAG and compression,
giving the model only what it truly needs.
🎯 Comparison: Claude vs GPT vs Gemini — who remembers how much
Short answer:
Context window sizes in 2026: Claude Opus 4.6 — 200K tokens
(1M in beta), GPT-5.4 — up to 1M, Gemini 3 Pro — up to 2M+.
But advertised size and actual effectiveness—are different things.
A larger context window is like a bigger backpack.
You can put more things in it, but finding the right one
becomes increasingly difficult.
| Model |
Context |
Effective Range* |
Price (input/1M tokens) |
Strong Suit |
| Claude Opus 4.6 |
200K |
~190K (stable) |
~$15 |
Least quality degradation |
| Claude Sonnet 4 |
200K (1M beta) |
~180K |
~$3 |
Balance of price and quality |
| GPT-5.4 |
1M (API) |
~400K |
~$1.50 |
Large volume, affordable price |
| GPT-4.1 |
1M (API) |
~600K |
~$2 |
Coding, large codebases |
| Gemini 2.5 Pro |
1M |
~700K |
~$1.25 |
Multimodality |
| Gemini 3 Pro |
2M+ |
~1M |
~$12 |
Maximum volume |
| Llama 4 Scout |
10M |
depends on infrastructure |
free (self-hosted) |
Open-source, data sovereignty |
* "Effective Range" is the approximate volume at which the model
maintains stable quality without significant degradation. Based on data
from Elvex,
AIMultiple
and Morph.
Actual performance depends on the task and content type.
An important nuance: hidden surcharge for long context
Some providers charge a higher price when the context exceeds
a certain threshold. For example, according to
Morph,
Anthropic charges double for input tokens and 1.5x for output
when Claude's context exceeds 200K in the 1M beta mode.
This is logical—longer context requires more computation.
Conclusion: Choose a model not by its maximum context size,
but by its effective range and stability on your tasks.
200K stable tokens are often more useful than 1M with quality degradation.
🎯 Four ways to bypass context limitations
Short answer:
Instead of waiting for infinite context, the industry
has developed several approaches: RAG (store information externally and
retrieve on demand), context compression, optimized attention architectures,
and fundamentally new architectures without attention at all.
Each approach has its trade-offs—and in practice, the best result
comes from their combination.
1. RAG (Retrieval-Augmented Generation) — external memory
The idea is simple: instead of cramming everything into the context,
information is stored in a vector database. When a query arrives—
only relevant fragments are retrieved from the database and inserted
into the context. The window remains small, but the model "knows" what's needed.
I implemented exactly this approach on
WebsCraft:
instead of loading all 500 blog articles into the model's context,
I store them in pgvector and retrieve only the 3–5 most relevant
fragments for each query. The context remains ~2000 tokens
instead of millions—and the answer comes in seconds, not minutes.
Advantages: cheap (small context = fewer tokens = less money),
fast (less computation), accurate (the model sees only relevant information,
no "noise" from irrelevant information).
Limitations: quality depends on search quality.
If the system retrieves incorrect fragments—the model will give an incorrect
answer. Requires careful tuning of chunking, embeddings,
and relevance thresholds.
More on the difference between approaches—
in the article LLM vs RAG in 2026: why it's not the same thing and when to use what.
And about the architecture of production-ready RAG systems—
in the full guide to RAG.
2. Context Compression
Not all tokens in the context are equally useful. Words like "and", "in",
"also" carry minimal information but take up space in the context window.
Compression methods find and remove such uninformative
tokens, leaving only the essence.
The most well-known method is LLMLingua from Microsoft.
It uses a small language model (e.g., GPT-2) to assess
the "surprise" (perplexity) of each token. Tokens with low
informativeness are removed. The result is compression up to 20x
with minimal quality loss.
For RAG systems, there is an extended version—
LongLLMLingua.
It additionally considers the user's query during compression and
reorders documents in the context—placing the most relevant
at the beginning and end. This directly helps with the "lost in the middle"
problem we discussed in section 3.
According to researchers, accuracy increased by 21.4%
when using 4 times fewer tokens.
Advantages: works with any model without changing the architecture,
significantly reduces API costs.
Limitations: adds a processing step before each query,
there's a risk of removing an important token that appears "unimportant"
to the small compressor model.
3. Optimized Attention (Flash Attention, Sparse Attention, Ring Attention)
This approach doesn't change the transformer architecture—it optimizes
computations within it. Three main methods:
Flash Attention—rearranges the order of attention computations
to minimize data exchange between GPU memory and the processor cache.
In practice, this provides a 2–4x speedup and a significant reduction
in memory consumption—without any change in response quality.
Flash Attention is already built into most modern models.
Sparse Attention—instead of each token
"looking" at every other token (full attention), it allows looking
at only a subset: neighboring tokens + a few "global" anchor
points. This reduces complexity from O(n²) to O(n√n) or even O(n log n).
The trade-off: the model might miss distant but important connections.
Ring Attention—distributes a long sequence
across multiple GPUs, where each GPU processes its fragment and passes
results around in a ring. This allows scaling context proportionally
to the number of GPUs. This is the approach behind Gemini's million-token context windows.
Advantages: do not change model quality, work with existing
architectures, provide significant acceleration.
Limitations: do not solve the fundamental problem
of quadratic complexity—they only push the wall further. With a sufficiently
large context, O(n²) will still win.
4. New architectures without attention (Mamba, RWKV, State Space Models)
The most radical approach is to abandon attention altogether
and build a model on a different mathematical basis.
Mamba (State Space Models)—processes sequences
linearly: O(n) instead of O(n²). Each token is processed once,
and the model maintains a "state" that accumulates information
about previous tokens. This is similar to how a person reads a book—
without re-reading every page for each new paragraph,
but keeping a "summary of what's read" in their head.
RWKV—a recurrent architecture with transformer performance.
It combines the advantages of RNNs (linear complexity) and
transformers (generation quality). The model can even run on weak hardware
due to low memory requirements.
Advantages: theoretically unlimited context,
linear scaling, significantly lower memory consumption.
Limitations: currently lag behind transformers in quality
on complex tasks—reasoning, long document analysis, coding.
This is an active area of research. Some new models (Jamba by AI21)
combine Mamba with transformer layers, trying to get the best of both worlds.
Summary table of approaches
| Approach |
Complexity |
Quality |
Maturity |
Best for |
| RAG |
Context-independent |
High (if retrieval is good) |
Production-ready |
Large knowledge bases, documents |
| Compression |
O(n) for compression |
High (up to 20x compression) |
Production-ready |
Long conversations, cost optimization |
| Flash/Sparse Attention |
O(n²) → O(n√n) |
Lossless |
Built into models |
General acceleration |
| Mamba/RWKV |
O(n) |
Lower on complex tasks |
Research / early production |
Potentially—everything |
Conclusion: No single method is ideal.
RAG is the most practical right now and proven in production.
Compression is an effective supplement that also saves money.
Optimized attention is already built into the models you use.
New architectures are the potential future that could make all of the above
irrelevant. The most effective approach in 2026 is a combination:
RAG for core information + optimized long context
for the current conversation.
🎯 How Much Does It Cost: From One Google Query to Scale
Short answer:
Every token in the context is real GPU computation that someone pays for. One ChatGPT query costs ~$0.001–0.01. Multiply that by
billions of Google AI Overviews queries, and you'll understand why companies
so meticulously optimize context size.
When you ask ChatGPT "what's the weather?" — it costs
a fraction of a cent. When you upload a 100-page document
and ask 20 questions — that's already tens of cents. At Google's scale
— that's millions of dollars every day.
Cost of a Single Query
A typical AI chatbot query involves approximately 500–2000 input tokens
(your query + system prompt + context) and 200–500 output tokens
(response). At Claude Sonnet's price of ~$3 per 1M input tokens:
- ✔️ Simple query (1K tokens): ~$0.003
- ✔️ Query with document context (10K tokens): ~$0.03
- ✔️ Long conversation (100K tokens): ~$0.30
- ⚠️ Maximum context (200K tokens): ~$0.60
Note: the same conversation becomes more expensive with each message,
because the model "re-reads" the entire previous context every time.
Google AI Overviews: Scale of Expenses
Google processes approximately 8.5 billion search queries per day.
AI Overviews (AI-generated answers at the top of the results) are shown
for about 10–15% of queries — that's ~1 billion AI generations daily.
Even with Google's internal cost (own TPU chips, own
Gemini model) — at $0.0001 per query × 1 billion = approximately
$100,000 per day, or ~$36 million per year just for AI
answers in search.
For comparison: I built a RAG bot for searching articles on
WebsCraft —
at 100 queries per day, it costs me ~$2 per month.
It's the same technology as in Google AI Overviews — the difference
is only in the scale, 10 million times greater.
Why Local AI is Radically Cheaper
When you run a model through
Ollama on your computer —
the context window is limited by RAM, but the cost of each query = $0.
No API tariffs, no tokens to pay for. You've already "paid"
for your hardware — and you can make an unlimited number of queries.
This is why for regular tasks with confidential data,
local AI via Ollama is the optimal choice in terms of cost.
More details — in the article How Much AI Costs: Tokens, GPUs, and Why Google Spends Millions.
Conclusion: The context window is not just a technical limitation,
but a financial multiplier. The longer the conversation and the larger the context — the
more expensive each subsequent query. Optimizing context size (through
RAG, compression, or smart conversation management) is not just
an improvement in quality, but also direct cost savings.
❓ Frequently Asked Questions (FAQ)
What is a context window in simple terms?
It's the maximum amount of text that an AI can "see" at once.
It includes your query, the entire previous conversation, and system instructions.
It's measured in tokens — fragments of text, each roughly
equivalent to 0.7 English words or 0.3–0.5 Ukrainian words.
Why does ChatGPT forget what I said earlier?
When the conversation exceeds the context window — the oldest messages
"fall out." Even within the window, the model remembers information
from the middle less well (the "lost in the middle" effect). For long conversations,
it helps to periodically remind key details or start a new conversation
with a summary of the previous one.
What is the context window for Claude, ChatGPT, and Gemini?
As of March 2026: Claude Opus 4.6 — 200K tokens
(~500 pages), GPT-5.4 — up to 1M via API, Gemini 2.5 Pro — 1M,
Gemini 3 Pro — 2M+. But advertised size and effective range are
different things. More details — in section 4 of this article.
Why is RAG still relevant if there are million-token contexts?
Three reasons: cost (loading a million tokens for each
query is expensive), quality (lost in the middle reduces accuracy), speed
(longer context = slower response). RAG provides only relevant
fragments — cheap, fast, accurate.
More details — in the article
LLM vs RAG.
Can the context window be increased in Ollama?
Yes, via the num_ctx parameter in Modelfile or the
OLLAMA_CTX_SIZE variable. But on a system with 8 GB RAM, increasing
the context beyond 4096 tokens can cause swapping to disk
and a sharp slowdown. More details —
in the article Ollama on 8 GB RAM.
How much does a long conversation with ChatGPT via API cost?
The price increases with each message because the model re-reads the entire
previous context. A 100K token conversation via Claude Sonnet
costs ~$0.30 per query. Via GPT-5.4 — ~$0.15.
To optimize costs, use RAG or context compression.
✅ Conclusions
The context window is a fundamental characteristic of LLMs,
affecting everything: response quality, generation speed, and the cost
of each query. Here's the main takeaway:
- ✔️ Context Window = AI's "Desk": everything the model can see at once. When the desk fills up — old stuff falls on the floor.
- ✔️ Quadratic Complexity: doubling the context quadruples the cost, not doubles it. This is a fundamental limitation of the transformer architecture.
- ✔️ Lost in the Middle: AI remembers the beginning and end of the text better. Information from the middle can get "lost" — accuracy drops by 20–50%.
- ✔️ More ≠ Better: 200K stable tokens (Claude) are often more practical than 1M+ with degradation (Gemini).
- ✔️ RAG Remains Relevant: even with million-token contexts, RAG is cheaper, faster, and more accurate for working with large data sets.
- ✔️ Every Token Costs Money: a longer conversation = a more expensive subsequent query. Context optimization is direct savings.
I personally use a combined approach: RAG for searching blog articles
(context ~2000 tokens), and a long context for detailed
conversations with Claude, where a deep history is needed. This is the most effective
strategy in 2026 — don't wait for infinite context, but wisely
manage what you have.
If you want to understand other aspects of LLM operation —
how AI sees text through tokens,
how it generates responses using the attention mechanism,
and why RAG is still more relevant than long context —
go to the relevant cluster articles.
And if you need a website or web application with integrated
AI functionality — RAG search, chatbot, or analytics —
contact us at WebsCraft,
we'll help you implement it.
📖 Sources