In 2026, context windows of 1–2 million tokens became the norm — Claude Sonnet already has 1 million, Gemini 3 also has 1 million, and the latest Gemini 3.1 Ultra has reached 2 million. Llama 4 Scout even claims 10 million. The logical conclusion many teams are drawing is: why bother with a RAG pipeline with chunking, a vector database, and a re-ranker, when you can just throw your entire knowledge base into the prompt?
If you're not yet clear on how RAG differs from a standard LLM, start with the article LLM vs RAG: What's the Difference and When to Use What. It provides the foundational knowledge that is assumed here. This article is for those who already understand what RAG is and want to figure out if it's still necessary in the era of million-token context windows.
This is a trap. Context size is not free magic. This article explains why architecturally dumping data in a "dumb" way ruins project economics, why it's a security issue, not just a quality one, and what the architecture of 2026 actually looks like. Examples are from the real RAG service AskYourDocs.
The Myth of "Omnivorous" Models
When Anthropic announced 1 million tokens of context for Claude Sonnet, I made myself some tea and seriously asked myself: is it even worth continuing to maintain the RAG pipeline in AskYourDocs? I did a rough calculation: a typical corporate client uploads 200–400 documents — that's about 600–800 thousand tokens. Theoretically, it all fits into one window. The temptation was not abstract but very concrete: ditch chunking, the vector database, the re-ranker — and just upload everything in one query.
The public discourse around this issue I saw split into two extremes, neither of which matched what I observed in my own project. The first is vendor marketing, which I read in model announcements and at technical conferences: "long context replaces RAG, just throw everything into the prompt, and the complexity will disappear on its own." The second is outdated 2024 thinking, which I myself wrote in my first articles on WebsCraft: the value of RAG was explained solely as a way to bypass the 4–8K token limit. But this limit has been gone for a year or two, and in my own project, RAG has not disappeared and should not disappear — the reason for its existence has simply changed.
Why the old argument no longer works: when the context was small, RAG was the only technical way to give the model access to external knowledge — without it, the model simply wouldn't see the client's documents. Now the context is large, and the question has shifted from "will the document fit" to "is it worth putting it there." I became convinced of this through my own calculations: when I calculated the real cost of putting the client's entire database into every query (this is further in the section on economics), it became obvious that the issue is no longer technical but purely economic, speed-related, and — as it turned out to be the most important for B2B — a matter of security for accessing data from different clients.
In short: the fact that a model can technically chew through 1MB of a document at once does not mean that architecturally it is the right solution for a production system with thousands of requests per month. Possibility and expediency are different things, and it was precisely this difference that I didn't immediately understand until I crunched the numbers.
Inference Economics: The Token Trap
Let's imagine a corporate knowledge base of 5MB — approximately 1 million tokens. Let's compare two approaches using the real prices of Claude Sonnet 4.6 as of June 2026: $3 per million input tokens.
Approach
Tokens per Request
Cost per Request
10,000 Requests/Month
"Dump Everything into Context"
1,000,000
$3.00
$30,000
RAG (3 chunks × 1000 tokens)
3,000
$0.009
$90
Why the difference is so significant: the cost of LLM inference depends linearly on the number of input tokens. When you dump the entire archive, the model charges for reading the entire array for every — even the simplest — request. RAG only pays for the 3000 tokens that are actually relevant to the question. The difference of 333 times is not an optimization, but a different economic model.
Even with context caching technologies that reduce the cost of re-reading prompts, you remain tied to a single static snapshot of data. As soon as the knowledge base is updated or each user receives their unique set of documents, the cache is invalidated, and the economics go south again. For AskYourDocs, this is not a theoretical case: each client has its own set of documents, and that's why a static cache won't save it here.
We should also add the cost of retrieval itself: embedding via text-embedding-3-small costs $0.02 per million tokens — meaning searching a vector database is practically free compared to generation. Even with infrastructure maintenance costs (e.g., RAM for pgvector indexes in PostgreSQL), these costs are fixed (flat rate) and do not grow exponentially with each new user request — unlike tokens, which you pay for again and again for each request with the "dump everything into context" approach.
Practical conclusion: if your product handles more than a few hundred requests per month to a large knowledge base, dumping everything into context is not financially scalable, even with caching. The point where RAG becomes mandatory arrives much sooner than you might think.
Example from AskYourDocs: in production, we use openai/text-embedding-3-small (1536 dimensions) for embeddings and meta-llama/llama-3.3-70b-instruct via OpenRouter for response generation. Embedding the client's entire base costs pennies once; we essentially pay only for the final generation — and that's why the RAG architecture makes the price predictable regardless of the client's knowledge base size.
Why Long Context Costs More Than It Seems
The direct cost of tokens is only half the story. The other half is the attention mechanism within the transformer, and this is what scares me the most when I think about the "just dump everything into context" scenario for AskYourDocs.
Simply put: to generate a response, the model has to compare every token in the context with every other token — this is attention. If there are 1000 tokens in the context, that's approximately 1000×1000 = 1 million comparison operations. If the context doubles to 2000 tokens, the operations don't just double, but become 2000×2000 = 4 million. This is called quadratic scaling: the attention mechanism scales quadratically with context length — doubling the context quadruples the computational cost, not doubles it.
Why this is not abstract math but a real problem for UX: context growth affects not only price but also latency. TTFT (Time to First Token) — the time until the first token of the response appears — degrades non-linearly as the prompt grows. Specific figures from an academic benchmark on infrastructure with 16 nodes and 128 H100 GPUs: prefill for 128K context tokens takes 3.8 seconds, and for 1M tokens — already 77 seconds. This is not a linear 8-fold increase, but almost a 20-fold increase — and this is on top-tier, specially optimized infrastructure. On a single standard GPU host, the same study shows that even 128K context can take up to 60 seconds to process.
Why this is critical for a product: imagine a chatbot or voice assistant where the user is waiting for a response. In my opinion, the difference between 0.2 seconds and 7 seconds is the difference between "the response appeared instantly" and "the user closed the tab without waiting." The same analysis adds two more consequences of long context: long-context models are still limited by their training date, and increasing context increases the risk of irrelevant or noisy text — meaning larger context is not only more expensive and slower, but also statistically "dirtier."
Remember: if your product is sensitive to response speed — a chatbot, voice assistant, real-time agent — the quadratic increase in attention cost makes long context the worst choice precisely where speed is most critical. RAG, conversely, keeps prefill short (3000 tokens instead of a million) regardless of how large the client's knowledge base is.
RAG vs. Fine-Tuning: Knowledge vs. Behavior
Before moving on, it's worth explaining a term I've used in passing: Fine-Tuning. This is the process where an already ready, pre-trained model (e.g., Llama or GPT) is further trained on a narrow, specially prepared set of examples — and during this training, the model's weights are physically changed. Technically, it looks like another training round, only much shorter and cheaper than training from scratch: the model sees thousands of "prompt → desired response" pairs in the required format or tone, and gradient descent gradually adjusts the model's parameters to this pattern. In lighter variants (LoRA, QLoRA), not the entire model is updated, but only a small additional "superstructure" on top of frozen weights — this is both cheaper and faster.
This is the basic framework to know before moving on: Fine-Tuning changes Behavior — tone, format, style, syntax. After fine-tuning, the model simply speaks differently and structures its response differently, regardless of what it's asked about. RAG changes Knowledge — operational facts that the model uses when responding, without touching any of the model's weights.
Why confusing these two categories is costly: if your data changes more often than weekly, fine-tuning turns into an endless retraining cycle — every price update, every new company policy requires a new training pass, a new dataset of examples, and new validation that the model hasn't "broken" from fine-tuning. By the time this cycle is complete, the model is already outdated. RAG solves this fundamentally differently: updating the knowledge base means simply overwriting a vector in the DB — seconds, without any retraining of model weights and without the risk that changing one fact will accidentally spoil the style of responses on other topics.
But there's a nuance that is rarely written about: the problem is rarely in the choice between RAG and fine-tuning itself. According to Gartner, up to 80% of corporate RAG implementations fail — and the main reason is not architectural, but the quality of the data being indexed into the system. We regularly encounter this at AskYourDocs: clients send documents in unreadable formats — upside-down scans, PDFs without recognized text, photos of paper forms instead of digital files. As the saying goes, garbage in, garbage out: the model cannot give an accurate answer based on text it cannot physically see, and it starts to hallucinate, constructing a "plausible" answer instead of an honest "I don't know." We detailed one such case in the case study Vision OCR for Business: How We Taught AI to Read Upside-Down Scans, where a client sent 10,000 scans, and the unreadable format, not the RAG architecture, was the root cause of hallucinations.
Why this happens: a RAG system is only as good as the documents in its database. If duplicates, outdated contract versions, or unstructured PDFs without a clear hierarchy end up there — the model will confidently answer based on garbage, and no retrieval architecture will fix it. That is, even a perfectly chosen architecture won't save the project if the documents in the knowledge base are unstructured, duplicated, or outdated.
In practice, this means: before choosing between RAG and fine-tuning, check the hygiene of your data. A flawless architecture on dirty data yields flawlessly confident wrong answers — and that's worse than no answer at all.
The "Lost in the Middle" Problem
Large models can technically read long texts, but their focus of attention is unevenly blurred: information at the beginning and end of the context is remembered better than what's in the middle. This is not an accidental glitch but a consistently reproducible pattern that researchers call the U-shaped attention curve: if you plot "accuracy of response" against the position of the relevant fact in the text, it will have the shape of a U — high at the edges, a dip in the middle.
Why this is especially important when working with long documents — 80-page contracts, technical documentation, meeting minutes — I've detailed the mechanics of the context window and why doubling the context costs four times as much in the article LLM Context Window: Why AI Forgets and How Much It Costs. It also compares how Claude, GPT, and Gemini handle attention on long contexts differently, which is useful to know if you're choosing a model specifically for tasks with large documents.
In practice for AskYourDocs, this means a simple thing: if a client uploads an 80-page contract and asks about a specific clause on page 45, RAG will retrieve exactly that fragment — regardless of where it is physically located in the document. If we instead dumped the entire contract into the model's context, the accuracy of the response would depend on a random factor — how "close to the edge" the desired clause ended up.
The main point here is: good RAG works like an attentive secretary — it removes noise, leaves the essence. It's easier for the model to draw conclusions when it has clean facts in front of it, not a 500-page verbose report, and the quality of the response no longer depends on where the necessary information is hidden in the document.
Where Long Context Actually Wins
Honesty requires admitting: long context is not always the worse choice. There are scenarios where it is objectively stronger than RAG, and remaining silent about it would mean repeating the same one-sided marketing mistake, just in the other direction.
Scenario
What's Better
Why
In-depth analysis of a single document (book, contract)
a conversation is a sequential flow, not a set of discrete facts; RAG is poorly suited for references like "remember when you said..." and maintaining the tone of the entire dialogue
updating a vector takes seconds, without re-uploading context
One-time research analysis
Long context
it's not worth building a pipeline for a task you'll perform once
I became convinced of this through my own experience when working on a chatbot service where the model had to remember the entire course of the conversation — previously mentioned facts about the user, communication tone, context of previous remarks. Trying to maintain this through RAG (retrieving individual "facts" from history) worked worse than simply passing the last N messages into the context: the conversation lost coherence, the model "forgot" the nature of the interaction, which consisted of tone rather than individual facts. For conversational systems, the context window is a natural and correct tool, not a compromise.
Recommendation: if your task involves one large, cohesive source (contract, book, technical report) or maintaining a connected dialogue — long context is faster and simpler. If you have a dynamic base of tens of thousands of documents and a stream of queries to it — RAG remains the economically and architecturally correct choice.
This is an argument that almost no one writes about in RAG vs Long Context comparisons — and it's perhaps the most important for enterprise.
Imagine a company's corporate knowledge base: HR documents, financial reports from directors, client contracts. A sales manager should not see the HR department's payroll statements. If you upload the entire knowledge base into the context window of a single model — how do you delineate access at the level of a single query?
Why it's impossible to do this securely through a context window or fine-tuning: both approaches "bake" knowledge either into the prompt or into the model's weights without any concept of who exactly is asking the question. Any attempt to delineate access would mean maintaining a separate model instance or a separate context for each user — which destroys all the economics we calculated in the section on the token trap.
RAG solves this naturally, at the architecture level, not as a patch: access is controlled by metadata in the vector database — Row-Level Security. Each chunk is marked with access tags (department, role, client), and the model physically does not receive chunks to which the user does not have rights. This is not a filter at the output that can be bypassed by prompt injection — it's a restriction at the retrieval stage, before the text even enters the model's context.
Practical conclusion: for B2B SaaS working with documents from multiple clients simultaneously, multitenant security is not an optional feature, but the reason why RAG is the only viable architecture, regardless of context window size.
Example from AskYourDocs: when we designed the architecture, we faced this exact choice — to keep all clients in one shared database with isolation via tenant_id, or to deploy a separate database (and often a separate instance) for each client. This is not a theoretical question from a textbook, but a decision that directly determines the appearance of the entire stack.
Architecture
Pros
Cons
Multi-tenant (one DB, tenant_id + RLS)
Cheaper infrastructure — one PostgreSQL instance serves all clients; one migration instead of N; easier to scale to hundreds of small clients
Security relies on *every* query correctly filtering by tenant_id — one forgotten WHERE clause or error in RLS policy means data leakage between clients; harder to satisfy enterprise clients who contractually require dedicated infrastructure
Single-tenant (separate DB / instance per client)
Physical, not logical isolation — data leakage between clients is architecturally impossible, even if there's a bug in the code; easier to respond to GDPR requests (deleting/exporting one client's data means deleting one DB, not selectively deleting rows); easier to offer self-hosted deployments
More expensive infrastructure — N clients = N databases to maintain; migrations and updates need to be rolled out to each instance separately; harder to scale to a large number of small clients
We chose single-tenant — and the decisive factors were two points that directly stem from our audit. First, GDPR: when each client's data physically resides in a separate database, responding to a "delete all our data" request means deleting one database, not auditing the code to ensure all queries truly filtered by tenant_id in every one of the dozens of places in the codebase. Second, some of our enterprise clients require self-hosted deployments — their documents should not leave their own infrastructure, and a single-tenant architecture is naturally suited for this, without additional isolation layers.
If you don't have such restrictions — for example, you're building a product for hundreds of small clients without strict compliance requirements — multi-tenant with tenant_id and Row-Level Security remains the economically correct choice: access is filtered by metadata directly in the vector search SQL query, before retrieval even starts calculating similarity. It's just important to clearly understand that this infrastructure saving is a conscious compromise on security, not a free bonus.
What Makes a Good RAG Pipeline
All of the above makes sense only if RAG itself is done correctly. Poor RAG — with primitive chunking and pure vectors without keywords — discredits the entire approach and gives arguments to proponents of "just upload to context." Three components distinguish production quality from a prototype.
Chunking strategy: cut by structure, not by characters
The most common mistake is cutting text by a fixed number of characters, ignoring headings, tables, and paragraph breaks. This "cuts through the living tissue": it breaks a thought in the middle of a sentence and destroys the context of a single chunk. The correct approach considers the document's structure. A detailed breakdown of seven strategies with benchmarks is in the article Chunking in RAG 2026: 7 Strategies for Production.
Hybrid Search: vectors without keywords are a sign of bad RAG
Pure vector search misses exact matches — order numbers, error codes, precise terms — because semantic similarity doesn't guarantee lexical matching. A combination of BM25 (keyword) and vector search closes this gap. I discussed this in detail using my own RAG service as an example in the article I Added BM25 to My RAG Service — and Vector Search Stopped Missing Exact Queries.
Reranking: a second check before showing to the LLM
Vector search returns candidates quickly, but not always accurately. A reranker — a separate, lighter model (Cross-Encoder) — re-checks the top results before passing them to the LLM and sorts them by true relevance. Paired with hybrid search, this provides a measurable increase in quality: a full breakdown with numbers of +15–40% — in the article Hybrid Search and Reranking for RAG: +15-40% Quality.
In short: skipping any of these three components turns "RAG" into an expensive and slow degradation of quality compared to the same long context. RAG only wins when done correctly.
The mechanics of the approach involve two specific steps, not an abstract "the model decides itself." Step 1: the query and RAG chunks (as in regular RAG) are passed to the model, but with one difference — the model is given explicit permission to refuse to answer if it considers the information in the chunks insufficient, instead of inventing a plausible answer. Step 2: if the model refuses — only then is the query redirected to the full document context, and the model answers with the entire text in front of it. A simple query will pass Step 1 and receive a fast, cheap answer. A complex query will "fail" Step 1, the model will honestly say "insufficient data" — and only then will the system pay for the more expensive pass through the full context.
Why this works in practice: simple factual questions ("what is the contract signing date?") almost always successfully pass Step 1 — RAG will handle it faster and cheaper, as the required fact is usually in one or two chunks. A complex multi-step question ("how did the contract terms change between three versions?") will likely fail Step 1: the answer is scattered throughout the document, retrieval will only extract part of the picture, and the model — if well-calibrated — will recognize this. The word "calibrated" is key here: the entire approach is based on the assumption that the model honestly assesses its own uncertainty, rather than feigning confidence where it doesn't exist. This is not an unconditional guarantee, but a probabilistic mechanism that works better on more powerful models.
For a production system, this means a specific engineering pattern: the first pass is always cheap (RAG), the second pass is a more expensive fallback option (full context), activated only for the minority of queries where it's truly needed. Most traffic — simple facts — never touches the expensive path.
In practice, this means: the architecture of 2026 is not a choice of "RAG or context" at the start of a project, but a routing layer that decides this for each query individually, and the cost for this decision is just one additional, cheap pass of the model through RAG chunks before it admits "this is not enough."
Hybrid Stack 2026: Conclusion
The ideal architecture today is not an "either/or." It consists of three layers: a small, fast model fine-tuned for a specific response format (behavioral fine-tuning); a high-quality, polished RAG pipeline with hybrid search and reranker (knowledge layer); and a Self-Route layer that decides for each query whether retrieval is sufficient or full context is needed.
None of these three layers replaces the others. Fine-tuning without RAG provides a consistent but outdated tone. RAG without good chunking and hybrid search yields a fast but inaccurate answer. Long context without RAG provides accuracy on a single document — and financial disaster on thousands of queries to a large database.
The main conclusion of the article: the context window size of your model is a marketing figure on a presentation slide, not an indicator of how good your AI system is. A million tokens won't save a project where retrieval returns irrelevant chunks, documents are not structured or are duplicated, and access to third-party data can be obtained through a trivial SQL query error. True quality indicators are retrieval accuracy (whether the system finds a truly relevant fragment, not just one similar by vector), data hygiene (whether documents are structured, up-to-date, without duplicates — the reason 80% of RAG implementations fail), and the security architecture around all of this (whether it's physically impossible for one client to see another's data, regardless of code bugs).
The context window is just one tool in the toolbox, not a goal in itself. A team that chooses a model with the largest context and considers architectural questions closed will most likely repeat the path described at the beginning of this article: first, the temptation to "just throw everything into the prompt," then the token costs that don't add up, and then — a painful return to retrieval, chunking, and RLS, but this time under the pressure of a production incident, rather than as a conscious architectural choice at the start.
Frequently Asked Questions
Is RAG obsolete due to long context windows?
No. Context windows have solved the technical token limit problem but have not solved the cost, speed, and security of data access — and it is these three factors that determine whether RAG can be used in production.
What is cheaper: RAG or long context?
RAG is orders of magnitude cheaper for frequent queries to a large knowledge base — in our calculation, the difference was 333 times. Long context can only be cheaper for a one-time analysis of a single document.
When is it better to use fine-tuning instead of RAG?
Fine-tuning is suitable when you need to change the model's behavior — tone, format, style of response. For operational facts that change more often than once a week, RAG is the only practical option.
What is "lost in the middle"?
The effect where a model uses information located in the middle of a long context worse than information at the beginning or end. RAG mitigates this problem by providing only relevant fragments instead of the entire text mass.
Can RAG and fine-tuning be combined?
Yes, and this is the most common production pattern of 2026: fine-tuning is responsible for stable behavior and format, RAG — for current facts at the time of the query.
How does RAG solve the problem of data access in enterprise?
Through Row-Level Security at the metadata level of the vector database: each chunk is tagged with access tags, and the user physically does not receive information they are not authorized for — the restriction works even before the text enters the model's context.
What is Self-Route in RAG systems?
An approach where the model itself decides for each specific query whether retrieval (RAG) is sufficient or full document context is needed — based on the complexity of the question.
Why do RAG implementations often fail?
According to Gartner estimates, up to 80% of RAG implementations in companies fail — and the main reason is not the architecture, but the quality of the indexed data: duplicates, outdated information, lack of structure.