Why Did Google Kill Its Medical AI Feature? The RAG Architecture Disaster Explained

Updated:
Why Did Google Kill Its Medical AI Feature? The RAG Architecture Disaster Explained

Google quietly rolled back the What People Suggest feature for medical queries. The official wording is “answer quality.” But behind this lies a specific architectural problem: the retrieval system extracted semantically similar but clinically incompatible fragments — and the model synthesized technically sound but dangerous answers from them.

Main point: this is not about medicine or corporate responsibility. This is a public case of how RAG architecture fails when the data corpus does not meet the requirements of the task. The lesson applies to any pipeline — from a legal bot to corporate search.

Why user opinion aggregation is not a valid sample

The What People Suggest feature was built on a logical but flawed premise: if enough people describe similar experiences, the aggregated result approaches the truth. In medicine, this model breaks down at the sample level itself.

For restaurant recommendations or hotel reviews, aggregation works — the sample is large and representative enough. A medical forum is a different population. People come here with atypical disease courses, dissatisfied with standard treatment, or those seeking confirmation of an already made decision.

Validation bias as a structural property of UGC platforms

Platforms like Reddit or Quora have a built-in validation bias: those who have something to say publish — meaning those whose experience deviates from the norm. A person who took ibuprofen and feels better a day later doesn't write a post. A person who experienced an unusual reaction does.

This is not a problem of specific platforms — it's a mathematical property of any voluntary content. Researchers call this reverse survivorship bias: extreme cases, not typical ones, prevail in the sample.

What UGC distribution looks like in practice

Imagine a hypothetical medical forum with 10,000 posts about ibuprofen side effects. Of these, 8,500 describe unusual or negative reactions — precisely why people came to seek answers or share experiences. Another 1,200 are queries like “is it normal that...”. And only 300 posts are in the style of “took it, helped, thank you” — because that person no longer had a reason to return to the forum.

Now the RAG system indexes this corpus and receives a query: “is ibuprofen safe?” Similarity search finds the 20 most relevant chunks. 17 of them describe complications. Not because ibuprofen is dangerous — but because safe cases are simply underrepresented in the corpus.

Medical forum versus clinical trial: difference in sample

A clinical trial is built on a controlled sample: a defined population, randomization, a control group. The result reflects the distribution in the real patient population.

A medical UGC forum is the opposite:

  • Non-randomized sample — those who have something to say come
  • No control group — no voice from those for whom “everything went fine”
  • Confirmation bias amplifies the signal — posts with unusual experiences receive more replies and upvotes, rise higher, and appear in more chunks
  • Temporal shift — older posts with outdated treatment protocols are semantically equivalent to new ones

For a developer building RAG, this means one thing: if the corpus consists of UGC, the distribution of opinions within it does not correspond to the actual distribution of cases in the population. The model learns from the tails, not the center of the distribution.

Analogy for the developer

Imagine your recommendation engine learns only from 1-star and 5-star reviews — there are almost no 3-star reviews in the corpus. The model will confidently recommend or reject, but will have no concept of “normal.”

Now take it a step further: imagine that 1-star reviews receive 10 times more likes and comments than 5-star reviews — and therefore dominate search results. The system doesn't just “not know normal” — it's actively trained on anomalies. This is exactly what happens when a RAG system uses forum UGC as the primary source for medical or any other YMYL queries.

The Garbage In, Garbage Out principle does not disappear with an increase in model parameters. A more complex architecture synthesizes a biased sample more convincingly — but does not make it representative.

Where RAG breaks down: chunk-collision and loss of clinical context

Retrieval-Augmented Generation is an architecture where the model first extracts relevant fragments from a corpus and then synthesizes an answer based on them. The point of failure is not generation. The point of failure is retrieval.

Let's look at a specific example. A user enters the query: “diet for pancreatic cancer”.

Step-by-step failure mechanism

  1. Embedding. The query is transformed into a vector in semantic space.
  2. Similarity search. The system searches for chunks with the highest cosine similarity. “Closeness” is determined by statistical word similarity — not clinical context.
  3. Chunk-collision. Two fragments enter the context window: one about “low-fat diet for pancreatitis,” the other about “nutrition during chemotherapy.” Both are semantically close to the query. Both are clinically incompatible in this context.
  4. Synthesis. The model forms an answer that is technically justified by the sources, but clinically incorrect: nutrition protocols during chemotherapy differ fundamentally from general recommendations for pancreatic diseases.

What this looks like at the retrieval query level

To understand the failure specifically, let's look at what the system actually “sees.” A simplified example of a retrieval result for the query “diet for pancreatic cancer”:

Query: "diet for pancreatic cancer"

Retrieved chunks (top-3 by cosine similarity):

[chunk_id: 4821]  score: 0.91
source: forum_post, date: 2021-08
"For pancreatitis, doctors recommend a low-fat diet —
no more than 40g of fat per day. The pancreas cannot cope
with fat processing, so restriction is critical..."

[chunk_id: 2203]  score: 0.88
source: health_blog, date: 2022-03
"During chemotherapy, it is important to maintain calorie intake.
Patients are often recommended high-calorie nutrition,
including fats — to preserve body mass..."

[chunk_id: 7734]  score: 0.85
source: forum_post, date: 2020-11
"After pancreatic resection, the diet changes drastically.
The first months — minimal fats, then gradual expansion..."
  

The model receives three chunks with cosine similarity 0.85–0.91 — all “relevant” by metric. But chunk 4821 describes chronic pancreatitis, chunk 2203 — oncology during chemotherapy, chunk 7734 — postoperative state. These are three different clinical contexts with opposite recommendations regarding fats.

The system has no mechanism to distinguish them — and synthesizes a “balanced” answer from contradictory sources.

Why cosine similarity doesn't see clinical context

Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle — the “closer” the documents. But what exactly is encoded in this space?

An embedding model learns to predict words from context (or vice versa). It learns that “pancreas,” “diet,” “fats,” “chemotherapy,” “pancreatitis” — are semantically related words. Therefore, documents containing these words, receive close vectors.

But embedding doesn't understand that “restrict fats” and “increase calorie intake through fats” — are opposing medical instructions for different conditions of the same organ. For a vector, these are just two documents about the pancreas and diet.

More details on the limitations of embedding models in domain-specific tasks — in the RAGAS study (2023).

Visual diagram: what happens in the context window

User query
        │
        ▼
┌───────────────────┐
│   Embedding model  │  → query vector
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Vector store      │  similarity search
│  (entire corpus)   │  → top-K chunks
└───────────────────┘
        │
        ├── chunk A: pancreatitis + low-fat diet    (score: 0.91)
        ├── chunk B: chemotherapy + high-calorie     (score: 0.88)  ← COLLISION
        └── chunk C: postoperative state               (score: 0.85)
        │
        ▼
┌───────────────────┐
│      LLM           │  synthesizes answer from A + B + C
│   (generation)     │  ← model 'doesn't know' that A and B are incompatible
└───────────────────┘
        │
        ▼
  Answer technically
  justified by sources,
  but clinically incorrect
  

Why this happens specifically in YMYL niches

Google defines YMYL (Your Money or Your Life) as a category of queries, where an incorrect answer can cause real harm. Medicine, law, finance. We covered in detail how YMYL niches work from an SEO perspective in a separate article — YMYL Niches: A Complete Guide to SEO. In the context of RAG architecture, the chunk-collision problem in these niches is most critical for two reasons:

  • High domain specificity. Medical terms have narrow meanings, that a general-purpose embedding space does not distinguish. “Pancreatitis” and “pancreatic cancer” — are adjacent topics for the model, but for a clinician — fundamentally different pathologies with different protocols.
  • Disproportionate cost of error. An inaccurate restaurant recommendation — is a bad meal. An inaccurate medical recommendation can have more serious consequences.

More details on YMYL criteria in the context of search quality evaluation — in Google's Search Quality Rater Guidelines.

Why Did Google Kill Its Medical AI Feature? The RAG Architecture Disaster Explained

How chunk-collision differs from hallucination — and why it's more important

Most developers configure RAG against hallucinations: they lower temperature, add grounding, check sources. But chunk-collision is a different category of error. The model plays by the rules. The problem is what it was fed.

Definition: what is hallucination in LLM

A hallucination is when a model generates a fact that is not in its sources and contradicts reality. The model “invents” information from parametric memory when confidence is high, but knowledge is false.

For details on the mechanisms of hallucinations, their types, and ways to reduce risk — see our separate article: AI Hallucinations: What they are, why they are dangerous, and how to avoid them .

Definition: what is chunk-collision

Chunk-collision is when a model synthesizes an answer based on real, existing fragments, that were correctly found by the retrieval system, but are incompatible in the specific context of the query.

Comparison: hallucination versus chunk-collision

Criterion Hallucination Chunk-collision
Source of error Model's parametric memory Retrieval — incorrectly selected chunks
Is the answer supported by sources? No — contradicts documents in context Yes — fully corresponds to chunks
Detected by grounding check? Yes No — answer is “grounded”
Detected by faithfulness metric? Yes No — answer is faithful to chunks
Where does it occur? At the generation stage At the retrieval stage
Does the model “know” it's wrong? No No — and cannot know

Why chunk-collision is harder to diagnose

A hallucination can be detected through a grounding check: compare the answer with documents in the context window. If the answer contradicts the chunks — it's a hallucination.

Chunk-collision is not caught by this method. Look at the example:

Model's answer:
"For pancreatic diseases, it is recommended to limit
fat intake to 40g per day. At the same time, to maintain body weight
during treatment, it is important to ensure sufficient calorie intake,
including healthy fats."

Grounding check:
✓ "limit fats to 40g" → confirmed by chunk_id: 4821
✓ "sufficient calorie intake, including fats" → confirmed by chunk_id: 2203

Check result: PASSED. Answer is grounded.
Real problem: chunk 4821 and chunk 2203 are different clinical protocols.
  

Standard RAG quality metrics — faithfulness, answer_relevancy, context_precision — also do not catch chunk-collision, because the answer is indeed faithful to the chunks in context. The system receives high quality scores — and remains dangerous.

Third class of errors: scheming

Hallucination and chunk-collision are unintentional errors: the model doesn't “want” to be wrong, it simply lacks a mechanism to distinguish right from wrong in this context.

There is also a third, fundamentally different class of deviations — scheming: when the model's behavior systematically diverges from stated goals, not due to error, but due to optimization for a hidden signal. This is a separate and more complex problem — more details on it: AI Scheming: How models deceive and how to protect yourself .

How to diagnose chunk-collision in your pipeline

1. Analyze chunk diversity in context

If retrieval returns fragments from different protocols, categories, or time periods — this is a risk signal. Introduce a context diversity score metric: how much the metadata of retrieved chunks differ for a single query. High diversity for a narrow query is a red flag.

2. Add metadata filtering before synthesis

Chunks with incompatible tags should not enter the same context. Example of a filter at the retrieval pipeline level:

def filter_incompatible_chunks(chunks: list[Chunk]) -> list[Chunk]:
    """
    Removes chunks with incompatible condition_type.
    Keeps only those that match the dominant query category.
    """
    condition_types = [c.metadata.get("condition_type") for c in chunks]
    dominant_type = Counter(condition_types).most_common(1)[0][0]

    return [
        c for c in chunks
        if c.metadata.get("condition_type") == dominant_type
    ]
  

This is a simplified example — the actual logic depends on the domain. But the principle is immutable: metadata must be part of the chunk from the moment of indexing, not added later.

3. Log at the retrieval level, not just generation

Most teams log final answers and evaluate their quality post-factum. Chunk-collision is not caught this way — the problem arises before generation.

Minimum set for logging each query:

  • query — original user query
  • retrieved_chunk_ids — identifiers of all chunks in context
  • chunk_metadata — metadata of each chunk (category, date, source)
  • similarity_scores — cosine similarity for each chunk
  • final_answer — generated answer

With this log, you will be able to retrospectively identify collision patterns: queries where retrieval consistently returns chunks from different categories.

4. Add cross-chunk consistency check before synthesis

Before passing chunks to the generation model, add an intermediate step: ask the LLM to assess whether the retrieved chunks are compatible with each other in the context of the query. If not — either filter out incompatible ones or return a clarifying query to the user.

consistency_prompt = f"""
User query: {query}

Below are the fragments found to answer this query.
Determine: do all of them belong to the same clinical/subject context?
If there are incompatible fragments — indicate their IDs and explain the incompatibility.

{format_chunks(retrieved_chunks)}

Answer in JSON format:
{{"compatible": true/false, "incompatible_ids": [], "reason": ""}}
"""
  

This adds latency — but for YMYL queries, it's a justified compromise.

Why Did Google Kill Its Medical AI Feature? The RAG Architecture Disaster Explained

Where the industry is heading: open search versus closed verified systems

Google's rollback is not a retreat from AI in medicine. It's a demarcation between two architectures: public search with an open corpus and closed systems with a controlled index.

Public search becomes conservative

For most medical queries, Google reverts to displaying links to authoritative sources without generative synthesis. The reason is not technical, but legal and reputational: the risks from “quick answers” outweigh their value, when the corpus is uncontrolled.

This is not a unique stance by Google. Microsoft Bing, which aggressively integrated GPT-4 into search in 2023, also added explicit disclaimers for medical queries and restricted generative answers on YMYL topics after a series of public incidents. The trend is the same: open corpus + generative synthesis = unmanageable risk for critical niches.

Closed systems: what “controlled index” means in practice

The term “controlled index” sounds abstract. In practice, it's a set of specific architectural solutions that distinguish a closed medical system from public RAG:

Verified sources with explicit scope of application
Each document in the corpus has undergone manual or automated verification. Metadata contains not just “medical article,” but condition: oncology, stage: chemotherapy, protocol_version: 2023-Q4, reviewed_by: oncologist. Retrieval filters by these fields before similarity search — not after.
Document versioning
Medical protocols are updated. In a controlled index, each document version is stored separately with its effective date. The query is automatically routed to the current version — an outdated chunk physically does not enter retrieval.
Deterministic answers for critical scenarios
In some scenarios, the model does not synthesize free text, but selects from a verified decision tree. For example, for a drug dosage query, the answer is not a generated paragraph, but a structured extract from a pharmacological database with precise values and a reference to the source.
Audit log of each inference query
For regulatory reporting (HIPAA in the USA, MDR in the European Union), every query, retrieved chunks, and generated answer are logged with full reproducibility. This is not an option — it's a certification requirement.

Examples of closed systems already in operation

A striking example of this logic is DeepMind's results in diagnostics : they are achieved in closed, certified environments with controlled input data, not through public search with an open corpus.

Other examples of the same architectural logic:

  • Epic Systems + AI — LLM integration into electronic health records (EHR) with a strictly controlled corpus: the model works only with specific patient data and verified clinical protocols, isolated from the open web.
  • FDA-cleared AI devices — a class of Software as a Medical Device (SaMD), where each version of the model and corpus undergoes separate certification. Any index update means a new regulatory cycle.
  • Internal corporate RAG systems — legal, financial, compliance bots in large companies, where the corpus consists exclusively of internal verified documents, not the open web.

Checklist: is your RAG ready for a critical niche

If you are building RAG for a task with a high cost of error — go through this list before opening the system to users:

  • ☐ Each document in the corpus has explicit metadata about its scope and validity date
  • ☐ Outdated document versions are isolated from retrieval or explicitly marked
  • ☐ Retrieval filters by query context before similarity search, not after
  • ☐ Not only the answer, but also retrieved chunks with similarity scores are logged
  • ☐ There is a mechanism for detecting chunk-collision before synthesis
  • ☐ For the most critical queries — deterministic extraction instead of generative synthesis
  • ☐ There is a process for regular auditing of the corpus for relevance and consistency

The question is not “which model to choose” — the question is “how the corpus is organized.” Architectural decisions at the level of indexing, chunking, and metadata determine the system's quality more than the choice between specific LLMs.

The open web remains useful for informational queries with low stakes. For tasks where the cost of error is high, the direction is clear: a verified corpus, a controlled index, detailed retrieval logging.

Conclusion

The rollback of What People Suggest is not a defeat for AI in medicine. It's a correction of an architectural error: an attempt to apply a general-purpose tool to a task that requires a controlled corpus and verified sources.

The source is more important than the model. This thesis is inconvenient because it makes it harder to sell new architectures. But it is accurate.

Now a question for you: how is the filtering of incompatible chunks organized in your current RAG pipeline? Is there metadata that allows the retrieval system to distinguish context — or does everything go into one index?

Frequently Asked Questions

What is RAG and why is it needed?

RAG (Retrieval-Augmented Generation) is an architecture where a language model, before generating an answer, extracts relevant fragments from an external document corpus. This allows the model to answer queries based on current data, and not just on parametric memory from training. More details — in the original RAG paper by Meta AI (2020).

What is YMYL in the context of search?

YMYL (Your Money or Your Life) is a category of search queries, defined by Google, where an incorrect answer can cause harm: medicine, law, finance, safety. For such queries, Google applies increased requirements for source quality. The official definition — in the Search Quality Rater Guidelines.

Can chunk-collision be completely avoided?

Completely — no, if the corpus is large and heterogeneous. But the risk is significantly reduced through query classification before retrieval, detailed metadata at the chunk level and filtering of incompatible fragments before synthesis.

What tools help evaluate RAG quality?

Among open frameworks for evaluating RAG systems:

  • RAGAS — faithfulness, answer relevancy, context precision metrics
  • RAGAS on GitHub — open source code
  • TruLens — evaluation and tracing of RAG pipelines

Останні статті

Читайте більше цікавих матеріалів

Що означає GPT-5.5 для ринку AI у 2026 році

Що означає GPT-5.5 для ринку AI у 2026 році

У лютому 2026 за 48 годин зникло $285 мільярдів з капіталізації технологічних компаній. Не через рецесію. Не через провальну звітність. Через одне питання, яке інвестори поставили собі одночасно: якщо AI-агент робить роботу десяти людей — навіщо платити за десять місць у...

GPT-5.5 vs GPT-5.4: що  змінилося у 2026 році

GPT-5.5 vs GPT-5.4: що змінилося у 2026 році

OpenAI випустив GPT-5.5 лише через шість тижнів після GPT-5.4 — і це не черговий патч. Спойлер: перша повністю перетренована базова модель з часів GPT-4.5 дає реальний стрибок у агентних задачах і довгому контексті, але у hallucinations не покращилась — і коштує на 20% дорожче, а...

DeepSeek V4 Flash у 2026: що це, скільки коштує і як запустити без GPU

DeepSeek V4 Flash у 2026: що це, скільки коштує і як запустити без GPU

TL;DR за 30 секунд: DeepSeek V4 Flash — MoE-модель з 284B параметрами (13B активних), контекстом 1M токенів і MIT-ліцензією. Вийшла 24 квітня 2026 року. Коштує $0.14/$0.28 за мільйон токенів — дешевше за Claude Haiku 4.5, Gemini 3.1 Flash і GPT-5.4 Nano. Доступна через Ollama Cloud на NVIDIA...

Claude Opus 4.7 для RAG: як я тестував модель на реальних документах

Claude Opus 4.7 для RAG: як я тестував модель на реальних документах

Коротко про що ця стаття: 17 квітня я взяв свіжий Claude Opus 4.7 і прогнав його через свою RAG-систему AskYourDocs на тестовому наборі з ~400 публічних юридичних документів (зразки договорів, нормативні акти, шаблони з відкритих джерел). Порівняв з Llama 3.3 70B, на якій у мене зараз...

Claude Opus 4.7: детальний огляд моделі Anthropic у 2026

Claude Opus 4.7: детальний огляд моделі Anthropic у 2026

TL;DR за 30 секунд: Claude Opus 4.7 — новий флагман Anthropic, який вийшов 16 квітня 2026 року. Головне: +10.9 пунктів на SWE-bench Pro (64.3% проти 53.4% у Opus 4.6), вища роздільна здатність vision (3.75 MP), нова memory на рівні файлової системи та новий рівень міркування xhigh. Ціна...

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати. Що таке MoE і чому 26B...