Why AI Forgets: LLM Context Window, Costs & Solutions

Updated:
Why AI Forgets: LLM Context Window, Costs & Solutions

Have you ever noticed that ChatGPT or Claude remembers everything perfectly at the beginning of a conversation, but after an hour, they start to confuse details or ask again what you've already explained? This isn't a bug – it's a fundamental limitation that determines how much AI can "hold in its head" at once. It's called the context window – and it's precisely what determines the quality, speed, and cost of every response.

If you're not yet familiar with the basics of LLM operation, start with the foundational article "How ChatGPT, Claude, and Gemini Work: A Complete Guide 2026"

📌 TL;DR — the main points in 30 seconds:

The context window is the maximum amount of text an AI can see at once. For Claude, it's 200K tokens (~500 pages), for GPT-5 it's 400K, and for Gemini, it's up to 1M+. But more isn't always better: doubling the context quadruples memory costs (quadratic complexity), AI remembers information from the middle of the text worse (lost in the middle), and each additional token costs real money. This is why RAG is still relevant – even in the era of million-token context windows.

📚 Table of Contents

🎯 What is a context window and why is it limited

Short answer: A context window is the maximum number of tokens a model can process in a single request. This includes what you wrote, what the model responded, and the entire chat history. When the window fills up, the oldest messages "fall out."

Imagine a desk. Everything the AI can see at once is laid out on this desk. The size of the desk is the context window. New documents are placed on top, old ones fall on the floor.

When you write a message in ChatGPT or Claude, the following is placed on the "desk": the system prompt (instructions for the model) + the entire conversation history + your new request. The model "looks" at all of this simultaneously and generates a response.

Important detail: the context window is measured not in words, but in tokens. A token is a fragment of text: a word, part of a word, or even a single character. In English, one word ≈ 1.3 tokens. In Ukrainian, it's more complex: due to rich morphology, one word is usually broken down into 2–4 tokens. This means that less Ukrainian text fits into the context window than English text.

More about tokens and tokenization – in the article What are tokens: how ChatGPT sees your text.

I experienced this limitation in practice when I was building an RAG chatbot for WebsCraft. As long as the conversation is short, the bot responds quickly and accurately. But when a user asks 10-15 questions in a row, the context fills up with previous messages, and the quality of responses gradually decreases. That's when I realized how much the context window affects everything – from speed to cost.

Why the window can't be infinite

Three fundamental limitations:

  • ✔️ Memory (RAM/VRAM): each token in the context requires storing the KV-cache (key-value cache). More tokens = more GPU memory. With a 200K token context, the KV-cache can occupy tens of gigabytes.
  • ✔️ Speed: each new token "looks" at all previous ones. More context = slower generation of each subsequent word.
  • ✔️ Cost: API providers charge for each token. A longer conversation = a more expensive subsequent request.

But the main reason is mathematical, and it deserves a separate section.

Conclusion: The context window is not a technical detail, but a fundamental trade-off between quality, speed, and cost. Each model resolves this trade-off in its own way.

🎯 Quadratic complexity: why doubling the context = 4 times more expensive

Short answer: At the core of LLMs lies the attention mechanism, where each token "looks" at every other token. This creates a quadratic dependency: doubling the context increases computation and memory not by 2, but by 4 times. This is the main barrier to infinite context.

1,000 tokens = 1 million attention operations. 10,000 tokens = 100 million. 200,000 tokens = 40 billion. The growth is not linear – it's quadratic.

Analogy: a room full of people

Imagine a room full of people. Each person is a token. For everyone to understand the full context of the conversation, each person must talk to every other person. If there are 10 people in the room, that's 100 conversations (10 × 10). If there are 100 people, that's already 10,000 conversations. If there are 1,000, it's a million. You doubled the number of people – the number of conversations quadrupled. This is quadratic complexity: O(n²).

Now imagine that each "conversation" is a computation on a GPU that takes time and memory. It becomes clear why increasing the context from 4K to 200K tokens is not just "50 times more," but 2,500 times more expensive in terms of computation.

What happens inside: self-attention

In technical terms, this happens in the self-attention mechanism, which is the heart of the transformer architecture. For each token, the model computes three vectors – Query, Key, and Value. Then, the Query of each token is "compared" with the Key of every other token to determine what to pay attention to. The result is an n × n attention scores matrix, where n is the number of tokens.

This matrix is stored in the so-called KV-cache (key-value cache) – a special area of GPU memory. With each new token, the cache grows, and it becomes the "bottleneck" on systems with limited memory.

More about the attention mechanism – in the article Transformers and the attention mechanism: why AI understands context.

What this means in practice: scaling table

Context (tokens) Attention Operations ~KV-Cache Size Relative Cost
4,000 (GPT-3, 2022) 16 million ~100 MB 1x
32,000 1 billion ~800 MB 64x
200,000 (Claude) 40 billion ~5 GB 2,500x
1,000,000 (Gemini) 1 trillion ~25+ GB 62,500x

Note: KV-cache sizes are approximate and depend on the specific model architecture, the number of attention layers, and computation precision.

How it feels on real hardware

I experienced this in practice when working with Ollama on my Mac M1. I increased the context window from 2K to 8K tokens – and the model noticeably slowed down, and Activity Monitor showed a several-gigabyte jump in memory usage. When I tried to set a 16K context on a 7B model, the system started swapping to disk – and the response, instead of a second, took 30+.

This is the same quadratic complexity, just scaled down to a single laptop instead of a Google data center. More about memory limitations on weak hardware – in the article Ollama on 8 GB RAM: which models actually work.

Three walls created by quadraticity

  • ✔️ Memory wall: The KV-cache for 1M tokens can occupy 25+ GB of GPU memory – more than the entire VRAM of most consumer graphics cards. Even on a data center GPU like A100 (80 GB), this is a significant portion of the resource.
  • ✔️ Speed wall: Each new token of the response requires "reviewing" all previous context tokens. With a 200K context, generating each word takes noticeably longer than with 4K. The user feels this as a delay before the response starts (time to first token).
  • ✔️ Money wall: more computation = more GPU time = higher cost per request. This is why API providers charge for input tokens, and some (like Anthropic) charge double when the context exceeds a certain threshold.

Conclusion: Quadratic complexity is not a problem that can be solved simply by adding servers. It's a fundamental mathematical property of the transformer architecture that applies equally to a laptop with 8 GB of RAM and a Google data center with thousands of GPUs. This is why companies spend millions on researching alternative architectures – and why RAG remains a more practical solution than endlessly increasing context.

Why AI Forgets: LLM Context Window, Costs & Solutions

🎯 Lost in the middle: why AI remembers the beginning and end better

Short answer: Even if a model can technically process 200K or 1M tokens, information in the middle of the context is remembered worse than at the beginning or end. Studies show a 20–50% drop in accuracy for information from the middle of a long context. This is not a bug of a specific model—it's a fundamental property of the transformer architecture.

Imagine you were given a 500-page book and asked to find one specific phrase. You remember the introduction and the last chapter well—but what was on page 247? The same happens with AI. Psychologists call this the "serial position effect"— and it turns out LLMs suffer from it just like humans do.

This phenomenon has been named "lost in the middle" after a foundational study by Stanford and the University of Washington (Liu et al., 2023). The authors tested models on two tasks: finding an answer among several documents and extracting key-value pairs from a long list. In both cases, they found a U-shaped curve: accuracy is highest when relevant information is at the beginning or end of the context, and drops significantly when it's in the middle. Moreover, the effect was observed in all tested models—from GPT-3.5 to GPT-4 and Claude.

Specific numbers: how serious is the problem

According to Chroma Research (2025), which tested 18 frontier models including GPT-4.1, Claude Opus 4 and Gemini 2.5:

  • ✔️ Information at the beginning and end of the context: accuracy 85–95%
  • ⚠️ Information in the middle: accuracy drops to 76–82%
  • ❌ With 100K+ context tokens: overall accuracy drop 20–50% compared to 10K
  • ✔️ Claude models degrade the slowest, but no model is immune

A separate study by Du et al. (2025) proved an even more alarming fact: even when irrelevant tokens were replaced with empty spaces and the model was forced to "look" only at relevant information—performance still dropped by 13.9–85% with increasing context length. This means the problem is not just "distraction"—the sheer volume of tokens hinders the model's ability to think effectively.

Why it happens: architectural reasons

Researchers at MIT (2025) found a specific mechanism. They created a theoretical framework to analyze information flow in a transformer and identified two reasons:

  • ✔️ Attention masking: the causal mask in a transformer allows tokens to "look" only at previous ones. This creates a natural bias—the last tokens have access to the most context, the first ones receive the most attention from subsequent ones.
  • ✔️ Positional encodings: methods like RoPE (Rotary Position Embedding) gradually "fade" with distance—the further two tokens are from each other, the weaker their connection. Tokens in the middle are far enough from both the beginning and the end.

The result is a U-shaped attention curve: strong focus on the beginning (primacy bias), strong focus on the end (recency bias), and a "blind spot" in the middle.

Why it matters for practice

When you have a long conversation with Claude or ChatGPT, your early messages gradually "fall" into the middle of the context. New messages are always at the end, the system prompt—at the beginning. But important details you explained in the 15th message— end up in the very zone where the model performs worst.

I noticed this in my own experience: during long sessions working with Claude, while discussing the architecture of a Spring Boot project, the model began to "forget" decisions made at the beginning of the conversation. Only one thing helped—periodically repeating key details or starting a new conversation with a summary of the previous one.

Practical recommendations

  • ✔️ For long conversations: periodically remind the model of key details or start a new conversation with a brief summary
  • ✔️ For RAG systems: if loading multiple documents into context, place the most important ones first or last—never in the middle
  • ✔️ For prompts: the main instruction—at the beginning (system prompt), the specific task—at the end (user message). Leave the middle for auxiliary context that is less critical
  • ✔️ For developers: use re-ranking in your RAG pipeline—reorder documents by relevance before inserting them into the context

More on the difference between approaches and when to choose which— in the article LLM vs RAG in 2026: why it's not the same thing and when to use what.

Conclusion: Advertised context window and actual effectiveness—these are different things. A model with 200K context that consistently works across the entire range is more valuable in practice than a model with 1M that "loses" the middle. And the best way to combat the problem is not to increase context, but to reduce it through RAG and compression, giving the model only what it truly needs.

🎯 Comparison: Claude vs GPT vs Gemini — who remembers how much

Short answer: Context window sizes in 2026: Claude Opus 4.6 — 200K tokens (1M in beta), GPT-5.4 — up to 1M, Gemini 3 Pro — up to 2M+. But advertised size and actual effectiveness—are different things.

A larger context window is like a bigger backpack. You can put more things in it, but finding the right one becomes increasingly difficult.
Model Context Effective Range* Price (input/1M tokens) Strong Suit
Claude Opus 4.6 200K ~190K (stable) ~$15 Least quality degradation
Claude Sonnet 4 200K (1M beta) ~180K ~$3 Balance of price and quality
GPT-5.4 1M (API) ~400K ~$1.50 Large volume, affordable price
GPT-4.1 1M (API) ~600K ~$2 Coding, large codebases
Gemini 2.5 Pro 1M ~700K ~$1.25 Multimodality
Gemini 3 Pro 2M+ ~1M ~$12 Maximum volume
Llama 4 Scout 10M depends on infrastructure free (self-hosted) Open-source, data sovereignty

* "Effective Range" is the approximate volume at which the model maintains stable quality without significant degradation. Based on data from Elvex, AIMultiple and Morph. Actual performance depends on the task and content type.

An important nuance: hidden surcharge for long context

Some providers charge a higher price when the context exceeds a certain threshold. For example, according to Morph, Anthropic charges double for input tokens and 1.5x for output when Claude's context exceeds 200K in the 1M beta mode. This is logical—longer context requires more computation.

Conclusion: Choose a model not by its maximum context size, but by its effective range and stability on your tasks. 200K stable tokens are often more useful than 1M with quality degradation.

🎯 Four ways to bypass context limitations

Short answer: Instead of waiting for infinite context, the industry has developed several approaches: RAG (store information externally and retrieve on demand), context compression, optimized attention architectures, and fundamentally new architectures without attention at all. Each approach has its trade-offs—and in practice, the best result comes from their combination.

1. RAG (Retrieval-Augmented Generation) — external memory

The idea is simple: instead of cramming everything into the context, information is stored in a vector database. When a query arrives— only relevant fragments are retrieved from the database and inserted into the context. The window remains small, but the model "knows" what's needed.

I implemented exactly this approach on WebsCraft: instead of loading all 500 blog articles into the model's context, I store them in pgvector and retrieve only the 3–5 most relevant fragments for each query. The context remains ~2000 tokens instead of millions—and the answer comes in seconds, not minutes.

Advantages: cheap (small context = fewer tokens = less money), fast (less computation), accurate (the model sees only relevant information, no "noise" from irrelevant information).

Limitations: quality depends on search quality. If the system retrieves incorrect fragments—the model will give an incorrect answer. Requires careful tuning of chunking, embeddings, and relevance thresholds.

More on the difference between approaches— in the article LLM vs RAG in 2026: why it's not the same thing and when to use what. And about the architecture of production-ready RAG systems— in the full guide to RAG.

2. Context Compression

Not all tokens in the context are equally useful. Words like "and", "in", "also" carry minimal information but take up space in the context window. Compression methods find and remove such uninformative tokens, leaving only the essence.

The most well-known method is LLMLingua from Microsoft. It uses a small language model (e.g., GPT-2) to assess the "surprise" (perplexity) of each token. Tokens with low informativeness are removed. The result is compression up to 20x with minimal quality loss.

For RAG systems, there is an extended version— LongLLMLingua. It additionally considers the user's query during compression and reorders documents in the context—placing the most relevant at the beginning and end. This directly helps with the "lost in the middle" problem we discussed in section 3. According to researchers, accuracy increased by 21.4% when using 4 times fewer tokens.

Advantages: works with any model without changing the architecture, significantly reduces API costs.

Limitations: adds a processing step before each query, there's a risk of removing an important token that appears "unimportant" to the small compressor model.

3. Optimized Attention (Flash Attention, Sparse Attention, Ring Attention)

This approach doesn't change the transformer architecture—it optimizes computations within it. Three main methods:

Flash Attention—rearranges the order of attention computations to minimize data exchange between GPU memory and the processor cache. In practice, this provides a 2–4x speedup and a significant reduction in memory consumption—without any change in response quality. Flash Attention is already built into most modern models.

Sparse Attention—instead of each token "looking" at every other token (full attention), it allows looking at only a subset: neighboring tokens + a few "global" anchor points. This reduces complexity from O(n²) to O(n√n) or even O(n log n). The trade-off: the model might miss distant but important connections.

Ring Attention—distributes a long sequence across multiple GPUs, where each GPU processes its fragment and passes results around in a ring. This allows scaling context proportionally to the number of GPUs. This is the approach behind Gemini's million-token context windows.

Advantages: do not change model quality, work with existing architectures, provide significant acceleration.

Limitations: do not solve the fundamental problem of quadratic complexity—they only push the wall further. With a sufficiently large context, O(n²) will still win.

4. New architectures without attention (Mamba, RWKV, State Space Models)

The most radical approach is to abandon attention altogether and build a model on a different mathematical basis.

Mamba (State Space Models)—processes sequences linearly: O(n) instead of O(n²). Each token is processed once, and the model maintains a "state" that accumulates information about previous tokens. This is similar to how a person reads a book— without re-reading every page for each new paragraph, but keeping a "summary of what's read" in their head.

RWKV—a recurrent architecture with transformer performance. It combines the advantages of RNNs (linear complexity) and transformers (generation quality). The model can even run on weak hardware due to low memory requirements.

Advantages: theoretically unlimited context, linear scaling, significantly lower memory consumption.

Limitations: currently lag behind transformers in quality on complex tasks—reasoning, long document analysis, coding. This is an active area of research. Some new models (Jamba by AI21) combine Mamba with transformer layers, trying to get the best of both worlds.

Summary table of approaches

Approach Complexity Quality Maturity Best for
RAG Context-independent High (if retrieval is good) Production-ready Large knowledge bases, documents
Compression O(n) for compression High (up to 20x compression) Production-ready Long conversations, cost optimization
Flash/Sparse Attention O(n²) → O(n√n) Lossless Built into models General acceleration
Mamba/RWKV O(n) Lower on complex tasks Research / early production Potentially—everything

Conclusion: No single method is ideal. RAG is the most practical right now and proven in production. Compression is an effective supplement that also saves money. Optimized attention is already built into the models you use. New architectures are the potential future that could make all of the above irrelevant. The most effective approach in 2026 is a combination: RAG for core information + optimized long context for the current conversation.

Why AI Forgets: LLM Context Window, Costs & Solutions

🎯 How Much Does It Cost: From One Google Query to Scale

Short answer: Every token in the context is real GPU computation that someone pays for. One ChatGPT query costs ~$0.001–0.01. Multiply that by billions of Google AI Overviews queries, and you'll understand why companies so meticulously optimize context size.

When you ask ChatGPT "what's the weather?" — it costs a fraction of a cent. When you upload a 100-page document and ask 20 questions — that's already tens of cents. At Google's scale — that's millions of dollars every day.

Cost of a Single Query

A typical AI chatbot query involves approximately 500–2000 input tokens (your query + system prompt + context) and 200–500 output tokens (response). At Claude Sonnet's price of ~$3 per 1M input tokens:

  • ✔️ Simple query (1K tokens): ~$0.003
  • ✔️ Query with document context (10K tokens): ~$0.03
  • ✔️ Long conversation (100K tokens): ~$0.30
  • ⚠️ Maximum context (200K tokens): ~$0.60

Note: the same conversation becomes more expensive with each message, because the model "re-reads" the entire previous context every time.

Google AI Overviews: Scale of Expenses

Google processes approximately 8.5 billion search queries per day. AI Overviews (AI-generated answers at the top of the results) are shown for about 10–15% of queries — that's ~1 billion AI generations daily.

Even with Google's internal cost (own TPU chips, own Gemini model) — at $0.0001 per query × 1 billion = approximately $100,000 per day, or ~$36 million per year just for AI answers in search.

For comparison: I built a RAG bot for searching articles on WebsCraft — at 100 queries per day, it costs me ~$2 per month. It's the same technology as in Google AI Overviews — the difference is only in the scale, 10 million times greater.

Why Local AI is Radically Cheaper

When you run a model through Ollama on your computer — the context window is limited by RAM, but the cost of each query = $0. No API tariffs, no tokens to pay for. You've already "paid" for your hardware — and you can make an unlimited number of queries.

This is why for regular tasks with confidential data, local AI via Ollama is the optimal choice in terms of cost. More details — in the article How Much AI Costs: Tokens, GPUs, and Why Google Spends Millions.

Conclusion: The context window is not just a technical limitation, but a financial multiplier. The longer the conversation and the larger the context — the more expensive each subsequent query. Optimizing context size (through RAG, compression, or smart conversation management) is not just an improvement in quality, but also direct cost savings.

❓ Frequently Asked Questions (FAQ)

What is a context window in simple terms?

It's the maximum amount of text that an AI can "see" at once. It includes your query, the entire previous conversation, and system instructions. It's measured in tokens — fragments of text, each roughly equivalent to 0.7 English words or 0.3–0.5 Ukrainian words.

Why does ChatGPT forget what I said earlier?

When the conversation exceeds the context window — the oldest messages "fall out." Even within the window, the model remembers information from the middle less well (the "lost in the middle" effect). For long conversations, it helps to periodically remind key details or start a new conversation with a summary of the previous one.

What is the context window for Claude, ChatGPT, and Gemini?

As of March 2026: Claude Opus 4.6 — 200K tokens (~500 pages), GPT-5.4 — up to 1M via API, Gemini 2.5 Pro — 1M, Gemini 3 Pro — 2M+. But advertised size and effective range are different things. More details — in section 4 of this article.

Why is RAG still relevant if there are million-token contexts?

Three reasons: cost (loading a million tokens for each query is expensive), quality (lost in the middle reduces accuracy), speed (longer context = slower response). RAG provides only relevant fragments — cheap, fast, accurate. More details — in the article LLM vs RAG.

Can the context window be increased in Ollama?

Yes, via the num_ctx parameter in Modelfile or the OLLAMA_CTX_SIZE variable. But on a system with 8 GB RAM, increasing the context beyond 4096 tokens can cause swapping to disk and a sharp slowdown. More details — in the article Ollama on 8 GB RAM.

How much does a long conversation with ChatGPT via API cost?

The price increases with each message because the model re-reads the entire previous context. A 100K token conversation via Claude Sonnet costs ~$0.30 per query. Via GPT-5.4 — ~$0.15. To optimize costs, use RAG or context compression.

✅ Conclusions

The context window is a fundamental characteristic of LLMs, affecting everything: response quality, generation speed, and the cost of each query. Here's the main takeaway:

  • ✔️ Context Window = AI's "Desk": everything the model can see at once. When the desk fills up — old stuff falls on the floor.
  • ✔️ Quadratic Complexity: doubling the context quadruples the cost, not doubles it. This is a fundamental limitation of the transformer architecture.
  • ✔️ Lost in the Middle: AI remembers the beginning and end of the text better. Information from the middle can get "lost" — accuracy drops by 20–50%.
  • ✔️ More ≠ Better: 200K stable tokens (Claude) are often more practical than 1M+ with degradation (Gemini).
  • ✔️ RAG Remains Relevant: even with million-token contexts, RAG is cheaper, faster, and more accurate for working with large data sets.
  • ✔️ Every Token Costs Money: a longer conversation = a more expensive subsequent query. Context optimization is direct savings.

I personally use a combined approach: RAG for searching blog articles (context ~2000 tokens), and a long context for detailed conversations with Claude, where a deep history is needed. This is the most effective strategy in 2026 — don't wait for infinite context, but wisely manage what you have.

If you want to understand other aspects of LLM operation — how AI sees text through tokens, how it generates responses using the attention mechanism, and why RAG is still more relevant than long context — go to the relevant cluster articles.

And if you need a website or web application with integrated AI functionality — RAG search, chatbot, or analytics — contact us at WebsCraft, we'll help you implement it.

📖 Sources

Останні статті

Читайте більше цікавих матеріалів

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

TL;DR Як ефективно керувати контекстом у довгоживучих AI-агентах: — Sliding Window + Pinning — Автоматична summarization з розумними тригерами — Compression та semantic memory З конкретними цифрами, кодом і архітектурними рішеннями, які значно підвищили стабільність агента. Ця стаття —...

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

15 травня 2026 року Google тихо оновив одне речення у своїй Spam Policy. Але це речення змінює правила гри для всіх хто займається контентом і SEO. Без гучних анонсів, без великої прес-конференції — просто нове формулювання на сторінці документації. Search Engine Roundtable...

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Агент отримав запит — обробив — відповів. Наступний запит — і він не пам'ятає нічого з попереднього. Не тому що щось зламалось. А тому що так влаштована LLM за замовчуванням: кожен виклик — чистий аркуш. Якщо ви будуєте агента і не думали про пам'ять — ви будуєте амнезика з доступом до...

Grok Build від xAI: детальний технічний огляд

Grok Build від xAI: детальний технічний огляд

Grok Build — новий agentic CLI від xAI (early beta, 14 травня 2026). Головні фішки: Plan Mode з обов’язковим затвердженням плану, паралельні субагенти (до 8), контекстне вікно ~1–2M токенів та сучасний TUI на Rust. Працює на Grok 4.3, підтримує ACP, git worktree та MCP....

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Оновлено: 15 травня 2026 14 травня 2026 вийшла Ollama 0.24 — і це не черговий патч з виправленням багів. Цей реліз додає офіційну підтримку Codex App від OpenAI: тепер десктопний AI coding agent можна запустити на будь-якій локальній або хмарній моделі через Ollama....

Tool RAG: що робити коли у агента забагато інструментів

Tool RAG: що робити коли у агента забагато інструментів

У вас 5 tools — все чудово. У вас 15 tools — починаються проблеми. У вас 50 tools — агент деградує. Але є рішення яке вирішує проблему масштабу елегантно — і ви вже знаєте як воно працює, бо використовуєте його для документів. Ця стаття — частина серії про AI агентів на Spring Boot. Якщо...