Google quietly rolled back the What People Suggest feature for medical queries. The official wording is “answer quality.” But behind this lies a specific architectural problem: the retrieval system extracted semantically similar but clinically incompatible fragments — and the model synthesized technically sound but dangerous answers from them.
Main point: this is not about medicine or corporate responsibility. This is a public case of how RAG architecture fails when the data corpus does not meet the requirements of the task. The lesson applies to any pipeline — from a legal bot to corporate search.
Why user opinion aggregation is not a valid sample
The What People Suggest feature was built on a logical but flawed premise: if enough people describe similar experiences, the aggregated result approaches the truth. In medicine, this model breaks down at the sample level itself.
For restaurant recommendations or hotel reviews, aggregation works — the sample is large and representative enough. A medical forum is a different population. People come here with atypical disease courses, dissatisfied with standard treatment, or those seeking confirmation of an already made decision.
Validation bias as a structural property of UGC platforms
Platforms like Reddit or Quora have a built-in validation bias: those who have something to say publish — meaning those whose experience deviates from the norm. A person who took ibuprofen and feels better a day later doesn't write a post. A person who experienced an unusual reaction does.
This is not a problem of specific platforms — it's a mathematical property of any voluntary content. Researchers call this reverse survivorship bias: extreme cases, not typical ones, prevail in the sample.
What UGC distribution looks like in practice
Imagine a hypothetical medical forum with 10,000 posts about ibuprofen side effects. Of these, 8,500 describe unusual or negative reactions — precisely why people came to seek answers or share experiences. Another 1,200 are queries like “is it normal that...”. And only 300 posts are in the style of “took it, helped, thank you” — because that person no longer had a reason to return to the forum.
Now the RAG system indexes this corpus and receives a query: “is ibuprofen safe?” Similarity search finds the 20 most relevant chunks. 17 of them describe complications. Not because ibuprofen is dangerous — but because safe cases are simply underrepresented in the corpus.
Medical forum versus clinical trial: difference in sample
A clinical trial is built on a controlled sample: a defined population, randomization, a control group. The result reflects the distribution in the real patient population.
A medical UGC forum is the opposite:
- Non-randomized sample — those who have something to say come
- No control group — no voice from those for whom “everything went fine”
- Confirmation bias amplifies the signal — posts with unusual experiences receive more replies and upvotes, rise higher, and appear in more chunks
- Temporal shift — older posts with outdated treatment protocols are semantically equivalent to new ones
For a developer building RAG, this means one thing: if the corpus consists of UGC, the distribution of opinions within it does not correspond to the actual distribution of cases in the population. The model learns from the tails, not the center of the distribution.
Analogy for the developer
Imagine your recommendation engine learns only from 1-star and 5-star reviews — there are almost no 3-star reviews in the corpus. The model will confidently recommend or reject, but will have no concept of “normal.”
Now take it a step further: imagine that 1-star reviews receive 10 times more likes and comments than 5-star reviews — and therefore dominate search results. The system doesn't just “not know normal” — it's actively trained on anomalies. This is exactly what happens when a RAG system uses forum UGC as the primary source for medical or any other YMYL queries.
The Garbage In, Garbage Out principle does not disappear with an increase in model parameters. A more complex architecture synthesizes a biased sample more convincingly — but does not make it representative.
Where RAG breaks down: chunk-collision and loss of clinical context
Retrieval-Augmented Generation is an architecture where the model first extracts relevant fragments from a corpus and then synthesizes an answer based on them. The point of failure is not generation. The point of failure is retrieval.
Let's look at a specific example. A user enters the query: “diet for pancreatic cancer”.
Step-by-step failure mechanism
- Embedding. The query is transformed into a vector in semantic space.
- Similarity search. The system searches for chunks with the highest cosine similarity. “Closeness” is determined by statistical word similarity — not clinical context.
- Chunk-collision. Two fragments enter the context window: one about “low-fat diet for pancreatitis,” the other about “nutrition during chemotherapy.” Both are semantically close to the query. Both are clinically incompatible in this context.
- Synthesis. The model forms an answer that is technically justified by the sources, but clinically incorrect: nutrition protocols during chemotherapy differ fundamentally from general recommendations for pancreatic diseases.
What this looks like at the retrieval query level
To understand the failure specifically, let's look at what the system actually “sees.” A simplified example of a retrieval result for the query “diet for pancreatic cancer”:
Query: "diet for pancreatic cancer"
Retrieved chunks (top-3 by cosine similarity):
[chunk_id: 4821] score: 0.91
source: forum_post, date: 2021-08
"For pancreatitis, doctors recommend a low-fat diet —
no more than 40g of fat per day. The pancreas cannot cope
with fat processing, so restriction is critical..."
[chunk_id: 2203] score: 0.88
source: health_blog, date: 2022-03
"During chemotherapy, it is important to maintain calorie intake.
Patients are often recommended high-calorie nutrition,
including fats — to preserve body mass..."
[chunk_id: 7734] score: 0.85
source: forum_post, date: 2020-11
"After pancreatic resection, the diet changes drastically.
The first months — minimal fats, then gradual expansion..."
The model receives three chunks with cosine similarity 0.85–0.91 — all “relevant” by metric. But chunk 4821 describes chronic pancreatitis, chunk 2203 — oncology during chemotherapy, chunk 7734 — postoperative state. These are three different clinical contexts with opposite recommendations regarding fats.
The system has no mechanism to distinguish them — and synthesizes a “balanced” answer from contradictory sources.
Why cosine similarity doesn't see clinical context
Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle — the “closer” the documents. But what exactly is encoded in this space?
An embedding model learns to predict words from context (or vice versa). It learns that “pancreas,” “diet,” “fats,” “chemotherapy,” “pancreatitis” — are semantically related words. Therefore, documents containing these words, receive close vectors.
But embedding doesn't understand that “restrict fats” and “increase calorie intake through fats” — are opposing medical instructions for different conditions of the same organ. For a vector, these are just two documents about the pancreas and diet.
More details on the limitations of embedding models in domain-specific tasks — in the RAGAS study (2023).
Visual diagram: what happens in the context window
User query
│
▼
┌───────────────────┐
│ Embedding model │ → query vector
└───────────────────┘
│
▼
┌───────────────────┐
│ Vector store │ similarity search
│ (entire corpus) │ → top-K chunks
└───────────────────┘
│
├── chunk A: pancreatitis + low-fat diet (score: 0.91)
├── chunk B: chemotherapy + high-calorie (score: 0.88) ← COLLISION
└── chunk C: postoperative state (score: 0.85)
│
▼
┌───────────────────┐
│ LLM │ synthesizes answer from A + B + C
│ (generation) │ ← model 'doesn't know' that A and B are incompatible
└───────────────────┘
│
▼
Answer technically
justified by sources,
but clinically incorrect
Why this happens specifically in YMYL niches
Google defines YMYL (Your Money or Your Life) as a category of queries, where an incorrect answer can cause real harm. Medicine, law, finance. We covered in detail how YMYL niches work from an SEO perspective in a separate article — YMYL Niches: A Complete Guide to SEO. In the context of RAG architecture, the chunk-collision problem in these niches is most critical for two reasons:
- High domain specificity. Medical terms have narrow meanings, that a general-purpose embedding space does not distinguish. “Pancreatitis” and “pancreatic cancer” — are adjacent topics for the model, but for a clinician — fundamentally different pathologies with different protocols.
- Disproportionate cost of error. An inaccurate restaurant recommendation — is a bad meal. An inaccurate medical recommendation can have more serious consequences.
More details on YMYL criteria in the context of search quality evaluation — in Google's Search Quality Rater Guidelines.