I added BM25 to my RAG service — and vector search stopped missing exact queries

Updated:
Ask AI about this article
I added BM25 to my RAG service — and vector search stopped missing exact queries

Pure vector search loses exact terms, prices, and document numbers. I fixed this in one day — without changing the LLM, without GPUs, without new dependencies.

My RAG service was working. Vector search found relevant chunks, the LLM generated answers in Ukrainian. But when a client asked "lawyer consultation 500 UAH" — vector search returned chunks about legal services in general, ignoring the exact price. And the query "Order No. 142" found everything about orders, except for document No. 142 itself.

The problem was not with the LLM or the embedding model. Pure vector search looks for *meaning* — but sometimes *text* is needed. I added BM25 alongside vector search, merged the results via RRF — and the retrieval quality noticeably increased. In this article — how exactly I did it in production on Spring Boot + pgvector, what mistakes I made, and what to consider before implementation.

⚡ In Short

  • Problem: vector search "blurs" exact terms, prices, codes, document numbers
  • Solution: hybrid search — BM25 (keywords) + vector (semantics) + RRF (merging)
  • Stack: Java 21, Spring Boot, PostgreSQL + pgvector, tsvector for BM25
  • Configuration: switching vector/hybrid via properties, without recompilation

📚 Article Contents

🎯 Why Hybrid Search: Where Vector Search Fails

I am building a commercial RAG service — business clients upload company documents (PDF, DOCX, CSV, FAQ), and their users ask questions in natural language and receive answers from the LLM based on the uploaded content. Stack: Java 21, Spring Boot + Spring AI, PostgreSQL with pgvector (IVFFlat index), Ollama locally (nomic-embed-text for embeddings, mistral-nemo for chat).

Before hybrid search, my search worked like this: the user's query is converted into a vector (768 dimensions via nomic-embed-text), pgvector finds the closest chunks by cosine similarity. This captures *meaning* well — a query like "how to protect company data" found chunks about "information security" and "personal data protection," even if the words didn't match.

But I noticed three types of queries where vector search consistently missed:

  • Exact prices and numbers: "500 UAH" — the embedding model converts this into a vector that describes the general "meaning" of the price, but the difference between 500 and 550 in vector space is minimal
  • Codes and document numbers: "Order No. 142" — the vector for "order" is similar to the vector for any other order, the number is lost
  • Specific terms: "amortization" — vector search returned semantically similar results ("wear and tear of fixed assets"), but not always the chunk with the exact term

This is a known problem with vector search, which I described in detail in the article on Hybrid Search and Reranking. The solution is to add keyword search (BM25), which looks for exact word matches, and combine the results with vector search.

📌 What is BM25 and Why the 1994 Algorithm Still Works

BM25 (Best Matching 25) is a text search ranking algorithm that was formalized by Robertson and Walker in 1994. It hasn't died in 30 years — and here's why.

BM25 evaluates document relevance based on three factors:

  • TF (Term Frequency) — how often a word appears in a specific chunk. The more frequent, the more relevant
  • IDF (Inverse Document Frequency) — how rare a word is in the entire collection. The word "document" appears everywhere — it's less valuable. The word "amortization" is rare — it's more important
  • Document length — normalization to ensure short and long chunks are on equal footing

For my use case, BM25 is critical because business documents contain exact terms, prices, numbers — things that vector search "blurs." BM25 finds a chunk with "500 UAH" in milliseconds, without neural networks, without GPUs.

BM25 limitations: it doesn't understand synonyms. "Car" ≠ "automobile." If a user writes "cancel subscription," but the document says "opt out of tariff," BM25 will find nothing. This is precisely why a *combination* is needed — vector search captures meaning, BM25 captures exact words.

Table: When What Works

Query TypeVector SearchBM25Hybrid
"lawyer consultation 500 UAH"⚠️ finds legal, ignores price✅ exact match✅✅
"how to protect company data"✅ semantics❌ no exact matches
"Order No. 142 on dismissal"⚠️ finds orders in general✅ "Order No. 142"✅✅
"refund"✅ semantics✅ exact match✅✅ both signals

A detailed comparison of BM25 vs. Dense Vector Search with benchmarks is in my article on Hybrid Search, Section 1.

📌 Database Preparation: Migration, tsvector, GIN Index

Before writing Java code, I prepared PostgreSQL. My vector_store table already had vectors (embeddings) for cosine similarity search. For BM25, an additional structure is needed — tsvector. This is a built-in PostgreSQL type where text is broken down into tokens (lexemes) with positions. Without it, full-text search using the @@ operator doesn't work.

Analogy: a vector (embedding) is the "understanding of meaning" of text. And tsvector — is an alphabetical index in a book. Different data structures are needed for different types of search.

Migration — Two Commands

-- 1. Add a column for full-text search
ALTER TABLE vector_store ADD COLUMN content_tsv tsvector;

-- 2. Create a GIN index for fast keyword search
CREATE INDEX idx_vector_store_content_tsv ON vector_store USING GIN (content_tsv);

The first command adds the content_tsv column. After migration, it will be NULL for all existing chunks — this is normal, we'll fill it later.

The second command creates a GIN (Generalized Inverted Index) — an index type optimized for full-text search. Without it, BM25 queries with the @@ operator would scan all rows. With GIN — the search is fast. It's like an IVFFlat index for vectors, only GIN is for text.

Populating tsvector for Existing Chunks

UPDATE vector_store
SET content_tsv = to_tsvector('simple', content)
WHERE content_tsv IS NULL;

⚠️ Pitfall: Choosing the Text Search Configuration

When PostgreSQL converts text to tsvector, it needs to know the language — to remove stop words ("and," "or," "on") and reduce words to their base form (stemming: "employees" → "employee").

For Ukrainian language, PostgreSQL does not have a built-in dictionary. I had two options:

  • simple — simply splits text into words, converts to lowercase. No stemming, no stop words. Reliable — there won't be situations where PostgreSQL incorrectly stems a Ukrainian word
  • russian — the closest built-in option. Stemming partially works for Ukrainian (languages are similar), but may incorrectly stem some words

I chose simple — less "smart," but reliable. BM25 with the simple config still finds exact keyword matches, and for "understanding meaning," I have vector search.

📌 Implementation of HybridSearchService: two searches + RRF

My HybridSearchService does three things in the search() method:

  1. Vector search — the same vectorStore.similaritySearch() via pgvector, cosine similarity
  2. BM25 search — SQL query with the @@ operator on the content_tsv column
  3. RRF merge — combining both result lists using the formula 1/(k + rank)

BM25 search: SQL query

For BM25, I use plainto_tsquery() — it automatically breaks down the user's query into words and searches for them using AND. Results are ranked using ts_rank() — a built-in PostgreSQL function that calculates a BM25-like score.

SELECT id, content, metadata FROM vector_store
WHERE content_tsv @@ plainto_tsquery(CAST(:tsconfig AS regconfig), :question)
ORDER BY ts_rank(content_tsv, plainto_tsquery(CAST(:tsconfig AS regconfig), :question)) DESC
LIMIT :topK

⚠️ Pitfall: CAST to regconfig

My first attempt without CAST resulted in a BadSqlGrammarException. PostgreSQL cannot automatically cast the string parameter ? (which comes as String via JdbcClient) to the regconfig type. An explicit cast is needed: CAST(:tsconfig AS regconfig). The same error occurred when populating content_tsv during indexing — it had to be fixed in two places.

RRF (Reciprocal Rank Fusion): how merging works

RRF was proposed by Cormack, Clarke, and Buettcher in 2009 (SIGIR '09) and has since become a standard for hybrid search. The formula is:

score(d) = Σ 1 / (k + rank)

where rank is the position of the document in each individual ranking, and k is a smoothing constant (I use the standard value of 60).

What k does: it controls how much "first place" differs from "fifth place".

  • k=60 (standard): 1st place = 1/61 = 0.0164, 5th = 1/65 = 0.0154. The difference is small — all results are "almost equal"
  • k=1 (small): 1st = 1/2 = 0.5, 5th = 1/6 = 0.167. The difference is threefold — top results dominate
  • k=200 (large): there is almost no difference — only whether the chunk was included in the results matters

Why 60? This value is from the original research paper. It is used by Elasticsearch, Qdrant, Weaviate.

Example with real data

Query: "Order No. 142 on dismissal"

Vector search returns (by cosine similarity — based on the meaning of "dismissal"):

  1. Chunk about the dismissal procedure (rank 1)
  2. Chunk about the employment contract (rank 2)
  3. Chunk "Order No. 142" (rank 5)

BM25 returns (by exact match of the words "Order No. 142"):

  1. Chunk "Order No. 142" (rank 1)
  2. Chunk "Order No. 155" (rank 2)

RRF score for the "Order No. 142" chunk:

vector rank=5: 1/(60+5) = 0.0154
BM25 rank=1:   1/(60+1) = 0.0164
total:                     0.0318 ← highest among all

The "Order No. 142" chunk wins — it is high in both rankings. Without hybrid search, it would have been in 5th position and might not have been included in the LLM context.

Populating tsvector during indexing of new documents

New documents go through PgVectorIndexingService. After vectorStore.add(documents) (which stores embeddings), I added an UPDATE to populate content_tsv:

private void updateTsVector(Long docId) {
    jdbcClient.sql(
        "UPDATE vector_store SET content_tsv = to_tsvector(CAST(:tsconfig AS regconfig), content) " +
        "WHERE metadata->>'doc_id' = :docId AND content_tsv IS NULL"
    )
    .param("tsconfig", tsConfig)  // @Value("${app.search.tsconfig:simple}")
    .param("docId", String.valueOf(docId))
    .update();
}

Why not a trigger: I considered the option with a PostgreSQL trigger, but a trigger is SQL, it doesn't know about Spring @Value. If the client changes the language (e.g., from simple to german for a German client) — the trigger would have to be recreated via migration. With Java code, everything is managed from application.properties.

📌 Configuration: vector vs hybrid via properties

I did not remove the old PgVectorSearchService. Instead, I implemented switching via @ConditionalOnProperty:

# application.properties
app.search.mode=hybrid       # or "vector" for pure vector search
app.search.tsconfig=simple   # language for tsvector: simple, russian, german, english...
app.search.rrf-k=60          # RRF smoothing constant
@ConditionalOnProperty(name = "app.search.mode", havingValue = "vector", matchIfMissing = true)
public class PgVectorSearchService implements SearchService { ... }

@ConditionalOnProperty(name = "app.search.mode", havingValue = "hybrid")
public class HybridSearchService implements SearchService { ... }

Why two modes

My service serves different business clients, and hybrid search is not optimal for all of them:

  • Hybrid search puts more load on the database — two queries instead of one (vector + BM25), plus an additional GIN index consumes RAM. For some clients with a small document base and simple queries, this is excessive
  • Fallback — if the BM25 part breaks or tsvector is not populated for some chunks, you can instantly switch back to pure vector search via a single property
  • A/B testing — you can compare the quality of responses between modes for the same queries

By default, matchIfMissing = true on PgVectorSearchService — if the property is not set, it works as before. Nothing breaks.

Configuration per client

For a Ukrainian client:

app.search.mode=hybrid
app.search.tsconfig=simple

For a German client:

app.search.mode=hybrid
app.search.tsconfig=german

For a client with a small database where hybrid is excessive:

app.search.mode=vector

The language for tsvector and tsquery must match — otherwise, the search will not work correctly. PostgreSQL natively supports: simple, english, german, french, spanish, russian, italian, dutch, turkish, and others.

⚠️ Pitfalls I encountered

1. BadSqlGrammarException with plainto_tsquery

Problem: the first run of BM25 search produced bad SQL grammar. PostgreSQL could not cast the string parameter ? to the regconfig type.

Solution: explicit cast CAST(:tsconfig AS regconfig) in two places — in the WHERE and ORDER BY parts of the SQL.

2. The same error during indexing of new documents

Problem: I fixed HybridSearchService, but when loading a new document — the same BadSqlGrammarException in PgVectorIndexingService.updateTsVector().

Solution: add CAST there too. Lesson learned — if to_tsvector() is used with a parameter via JdbcClient, CAST is needed *always*.

3. @RequiredArgsConstructor does not work with @Value

Problem: Lombok's @RequiredArgsConstructor generates a constructor only for final fields. A field with @Value is not final, so it doesn't get into the constructor.

Solution: replaced @RequiredArgsConstructor with an explicit constructor in classes where there are both final dependencies and @Value configuration.

4. content_tsv = NULL for existing documents

Problem: after the migration, the new content_tsv column was NULL for all existing chunks. BM25 returned 0 results.

Solution: a one-time UPDATE: UPDATE vector_store SET content_tsv = to_tsvector('simple', content) WHERE content_tsv IS NULL;

5. Similarity threshold for Ukrainian text

The default cosine similarity threshold for vector search is too high for Ukrainian text from nomic-embed-text. I lowered it to 0.1 — otherwise, vector search returned few results. More details on choosing an embedding model and threshold — in the article on embedding models.

📊 Results: vector=5, bm25=2, merged=5

Logs from my service after implementing hybrid search:

Stream query: 'How long does it take to get a refund?', sessionId=3
HybridSearchService: Hybrid search results: 5 chunks (vector=5, bm25=1, merged=5)

Stream query: 'How long does it take to develop a landing page?', sessionId=null
HybridSearchService: Hybrid search results: 5 chunks (vector=5, bm25=2, merged=5)

Stream query: 'write about Local deployment How it works?', sessionId=3
HybridSearchService: Hybrid search results: 5 chunks (vector=5, bm25=0, merged=5)

What we see:

  • "refund" — BM25 found 1 chunk with an exact word match, vector found 5 by meaning. The chunk that appeared in both rankings received the highest RRF score and ended up first
  • "landing page development" — BM25 found 2 chunks. Hybrid provided 5 results — chunks from both rankings, sorted by RRF score
  • "Local deployment" — BM25 found 0. This is normal: the query "write about Local deployment How it works" is semantic, without exact terms from the documents. Vector search handled it on its own

Main conclusion: hybrid search does not degrade results when BM25 finds nothing — it simply returns vector results. But when BM25 *does find something* — the quality of the final ranking improves, because RRF boosts chunks that appear in both searches.

About document chunking — how I split PDF, DOCX, and CSV into chunks for indexing, including semantic FAQ chunking — read in the article on Chunking Strategies. And about choosing Ollama models that work locally on 8 GB RAM — in the article on Ollama on 8 GB.

❓ Frequently Asked Questions (FAQ)

Is hybrid search necessary if the document base is small (< 100 documents)?

Not necessarily. With a small base, vector search is usually sufficient — there is less "noise" and relevant chunks reach the top. Hybrid is justified when documents contain exact terms, codes, or prices that vector search "blurs." That's why I implemented switching via app.search.mode — for small clients, I keep it as vector.

Why tsvector instead of Elasticsearch for BM25?

I already have PostgreSQL — it stores both documents and vectors (pgvector). Adding Elasticsearch as a separate service means DevOps overhead, monitoring, data synchronization. The built-in tsvector with a GIN index solves the BM25 search problem without additional infrastructure. For a scale of 10K+ documents with high QPS, it's worth considering Elasticsearch or Qdrant with native hybrid.

How does hybrid search affect latency?

Minimally. BM25 and vector search are executed sequentially (not yet in parallel), but BM25 via a GIN index takes milliseconds. RRF merging takes microseconds (rank calculation). The main time is spent embedding the query via nomic-embed-text and vector search via pgvector. Hybrid adds ~5-15ms to the total search time.

Can I use the russian config instead of simple for Ukrainian?

You can — stemming works partially (languages are similar). But there's a risk of incorrect stemming for some Ukrainian words. simple is more reliable: it simply tokenizes without stemming. For "understanding" words, I have vector search — BM25 is only needed for exact matches.

What to do if BM25 always returns 0 results?

Check three things: (1) is the content_tsv column populated — execute SELECT count(*) FROM vector_store WHERE content_tsv IS NOT NULL; (2) does the tsconfig in to_tsvector() and plainto_tsquery() match; (3) are there exact matches of the query words in the chunk text. If queries are predominantly semantic — BM25 returns 0, and this is normal behavior.

✅ Conclusions

  • 🔹 Vector search alone is not enough: it "blurs" exact terms, prices, document numbers. BM25 fills these gaps
  • 🔹 Hybrid search (BM25 + vector + RRF): two parallel searches, merged using the formula 1/(k + rank). A chunk that ranks high in both wins
  • 🔹 pgvector + tsvector: no Elasticsearch needed — PostgreSQL with a GIN index is sufficient for BM25 alongside vector search
  • 🔹 Client-specific configuration: app.search.mode=hybrid/vector, app.search.tsconfig=simple/german/english — all via properties, without recompilation
  • 🔹 Pitfalls: CAST(:tsconfig AS regconfig) is mandatory for JdbcClient, content_tsv needs to be populated for existing chunks, similarity threshold for Ukrainian text is 0.1
  • 🔹 Hybrid doesn't degrade: when BM25 finds nothing — results = vector search. When it finds something — quality improves

My main takeaway: hybrid search is the simplest and most effective step to improve RAG system quality after basic vector search. If your documents contain terms, prices, codes — the effect is immediately noticeable. And if not — hybrid simply works as vector search, breaking nothing.

📖 Sources

📚 Related articles