RAG with Ollama: Teach AI to Answer Based on Your Documents
You have documents — PDFs, articles, notes, a knowledge base. You want to ask questions
and get answers specifically from these documents, not from the model's general knowledge.
And all of this — locally, without sending data to the cloud.
This is exactly what RAG is for. In this article — an explanation of the concept,
a visual pipeline diagram, and a step-by-step Python example that actually works.
As well as production patterns that are not in the documentation.
📚 Table of Contents
🎯 What is RAG — and why it's not fine-tuning
Short answer:
RAG (Retrieval-Augmented Generation) is a way to give the model access
to your documents without changing the model itself. Before each query,
the system finds relevant snippets from your base and adds them
to the context. The model answers based on these snippets —
not just what it learned during training.
LLMs are trained on trillions of tokens from the internet — but not on your
internal documents, not on your codebase, and not on articles
you wrote last week. RAG bridges this gap.
Analogy: A lawyer and a case
Imagine an experienced lawyer. They know laws, precedents, general practice —
everything they've learned over the years. But when given a new case — they don't answer
from memory. They read the case materials, highlight key facts,
and only then formulate a position based on specific documents.
An LLM without RAG is like that same lawyer asked to answer
without access to the case materials. They'll say something, but it will be
a general opinion, not an answer to your specific case.
RAG gives them those materials.
Why not fine-tuning
Fine-tuning is retraining the model on your data. It sounds logical,
but in practice, it's expensive, slow, and inflexible. If your documents
change — you need to retrain again. If you made a mistake in the data —
the model "remembers" the wrong thing.
RAG doesn't touch the model at all. Updated documents — re-indexed in minutes.
Found an error — fixed it in the source. The model itself remains the same.
That's why for most tasks where you need to "answer based on documents,"
RAG is the right choice, not fine-tuning.
RAG vs Fine-tuning: When to choose what
| Criterion |
RAG |
Fine-tuning |
| Implementation complexity |
Low — a few days |
High — weeks |
| Computational resources |
Minimal — CPU is enough |
Requires GPU, significant costs |
| Knowledge update |
Re-indexing — minutes |
Retraining — days |
| Transparency |
✔️ Visible which documents were used |
❌ Difficult to explain the source of the answer |
| Data error |
Fixed source — done |
Retraining from scratch |
| Suitable for |
Knowledge base, documents, FAQ, search |
Changing style, tone, response format |
| Not suitable for |
Changing the model's behavior itself |
Frequently updated documents |
When RAG won't solve the problem
RAG is not a silver bullet. There are several scenarios where it won't help:
- ⚠️ Short queries (1-2 words) — semantic search
on short queries is less accurate than keyword search.
A fallback is needed.
- ⚠️ Questions outside the scope of documents — if
the answer is not in the base, the model will either say "I don't know" or
start hallucinating. The prompt must explicitly forbid the latter.
- ⚠️ Need to change the model's style or behavior —
RAG won't help here; this is a task for fine-tuning or
detailed system prompting.
Section conclusion: RAG is the right choice when
you need to answer based on specific documents, and these documents
change or are updated. Fine-tuning is needed when you need to
change the model's behavior itself, not expand its knowledge.
🎯 How the pipeline works: from document to answer
Short answer:
The RAG pipeline is divided into two independent phases. Indexing —
performed once or when documents are updated:
document → chunks → embeddings → vector store.
Retrieval and generation — with each user query:
question → embedding → similarity search → prompt with context → answer.
If you understand where everything happens — debugging becomes much easier.
One of the first things I realized in practice: the model never
"reads" the entire document. It only sees a few
snippets that the system deems most relevant. The quality of the answer
directly depends on the quality of this selection.
Phase 1 — Indexing
Indexing happens once — or on a schedule when content is updated.
The goal: to convert human text into a form that a vector store can search.
Step 1. Document Reading
Input formats: PDF, HTML, Markdown, plain text, web pages.
The main thing at this stage is to get clean text without markup.
HTML tags, PDF metadata, control characters — all this is noise that degrades
embedding quality. Jsoup for Java, BeautifulSoup for Python —
standard cleaning tools.
Step 2. Chunking — Splitting into Fragments
The document is split into smaller parts — chunks. Why not store the whole document?
Two reasons. First, embedding models have a token limit:
nomic-embed-text — 8192 tokens, all-minilm — 256. An article of 5000 words
simply won't fit. Second, search is more accurate on smaller fragments:
if you store an entire article as one vector — the vector averages all the article's topics
and becomes less specific.
Standard: 512 tokens with a 50-token overlap. Overlap is intentional overlap
between adjacent chunks so that the thought doesn't break at the fragment boundary.
Step 3. Embeddings
Each chunk is converted into a numerical vector — a list of 768 numbers
(for nomic-embed-text). This vector encodes the *meaning* of the text,
not specific words. That's why "turnkey web development"
and "how to create a website" will have similar vectors even without shared words.
This is the advantage of semantic search over LIKE-search.
Step 4. Vector Store
Vectors and original chunk text are stored in a vector database:
pgvector (extension for PostgreSQL), Chroma, FAISS, Milvus, or another.
If the project already has PostgreSQL — pgvector is a natural choice:
one database instance, nothing new to install.
Phase 2 — Retrieval and Generation
This phase runs with every query and takes seconds.
The user enters a question — and after a few steps, gets an answer.
Step 5. Query Embedding
The user's question is also converted into a vector —
using the same embedding model used during indexing.
This is crucial: if you indexed using nomic-embed-text,
the query must also go through nomic-embed-text.
Different models produce incompatible vectors — and the search will be irrelevant.
Step 6. Similarity Search
The vector store finds the N closest vectors to the query vector.
"Closest" means cosine similarity or dot product between vectors.
Result: top-5 (or however many you set) most relevant chunks.
A similarity threshold filters out results below the similarity threshold —
usually starting with 0.5 and adjusting based on logs.
Step 7. Prompt Formulation
The found chunks are inserted into the system prompt as context.
The quality of this prompt directly affects the quality of the answer.
The minimum that should be in the system prompt:
Answer ONLY based on the provided context.
If the answer is not in the context, honestly state that.
Do not invent information that is not in the documents.
Context:
{found chunks}
Without an explicit prohibition against invention — the model will hallucinate.
This is not a bug of a specific model, but a general behavior of LLMs.
Step 8. Answer Generation
The LLM (Ollama with llama3.3, mistral, or another model) generates an answer
based on the context from the previous step.
An important detail: show the user the sources — which specific articles
or documents the system used. This is both transparency and a way
to verify where the model got the information.
Section conclusion: Two independent processes —
indexing and retrieval+generation. Most RAG problems arise
either at the chunking step (incorrect size), or at the
similarity search step (incorrect threshold), or in the system prompt
(no prohibition against hallucinations). We will break down each in detail below.
🎯 Tools: LlamaIndex vs LangChain vs manual approach
Short answer:
In 2026, the choice between LlamaIndex and LangChain is not about "better,"
but about what you are building. LlamaIndex is specialized for working with documents
and provides better search quality with less code. LangChain is for complex agent systems
where RAG is one of the components. A manual approach via Spring AI or
direct REST API is for when Python doesn't suit you or you need maximum
control over each step.
A detailed 2026 analysis notes:
the comparison "LangChain = orchestration, LlamaIndex = data" which most articles repeat
is already outdated. Both frameworks are converging
and cover similar functionality. The choice now depends on
where you are starting and what is the foundation of your application.
LlamaIndex — if documents and search are at the center
LlamaIndex is built around the idea that the main problem is
reliably connecting LLMs to your data. Everything else — agents, pipelines,
orchestration — goes on top of this foundation.
Independent benchmarks show:
LlamaIndex is 40% faster than LangChain on document retrieval tasks
and in 2025 achieved +35% in search accuracy.
The reason: built-in abstractions for chunking, re-ranking, and context assembly
are optimized for retrieval — not for general pipelines.
Pros:
- ✔️ Easiest start: 10 lines for a working RAG
- ✔️ Built-in re-ranking, hybrid search (semantic + keyword)
- ✔️ Wide list of data connectors via LlamaHub: PDF, HTML,
Notion, Google Docs, S3, databases
- ✔️ Flexible chunking: SentenceSplitter, TokenTextSplitter,
SemanticSplitter — change with one line
- ✔️ Native integration with Ollama via separate packages
Cons:
- ⚠️ For complex agent systems with many external tools —
less convenient than LangChain
- ⚠️ Documentation is less comprehensive than LangChain for atypical scenarios
When to choose: you have documents, PDFs, a knowledge base —
you need to ask questions and get answers. Blog, FAQ, internal
company documentation, technical reference.
LangChain — if RAG is part of a more complex system
LangChain was built around the idea that building with LLMs is a
workflow problem: there's input data, there's output, and in between — models, tools,
memory, and external sources that need to be composed.
In 2026
this means LangGraph — a directed graph where nodes are functions,
and edges are transitions between states. RAG in LangChain is one of the agent's tools,
not the central abstraction.
Pros:
- ✔️ Largest ecosystem and community — 220% growth in GitHub stars
in 2024
- ✔️ LangGraph for complex stateful agents with branching and loops
- ✔️ LangSmith — observability, tracing, evaluation out-of-the-box
- ✔️ Multimodal support is stronger than in LlamaIndex
- ✔️ Better for conversational memory over long conversations
Cons:
- ⚠️ Steeper learning curve — more concepts for simple RAG
- ⚠️ For pure document retrieval — more code than with LlamaIndex
- ⚠️ Frequent breaking changes: original LangChain replaced by LangGraph
for production scenarios
When to choose: an AI agent that calls external APIs,
writes code, executes SQL queries — and also searches documents.
RAG is not the main task here, but one of the agent's tools.
Manual approach — maximum control without a framework
Ollama REST API directly + any vector DB + custom logic.
This is the approach behind Spring AI for Java, where the framework
provides ready-made abstractions but you control every detail of the pipeline.
Pros:
- ✔️ Full control over every step — chunking, scoring,
fallback logic
- ✔️ No dependency on Python — any language via REST API
- ✔️ No breaking changes from an external framework
- ✔️ Can implement custom logic that frameworks don't support
Cons:
- ⚠️ More code — chunking, batch indexing, fallback are written by you
- ⚠️ No built-in re-ranking and hybrid search
- ⚠️ More implementation time than with LlamaIndex
When to choose: Java / Spring Boot project (Spring AI),
non-standard pipeline requirements, or when the framework adds more
complexity than it solves problems.
Choice Matrix
| Task |
Recommendation |
| Answering based on PDFs and documents — simple start |
LlamaIndex |
| Semantic search on a blog or knowledge base |
LlamaIndex |
| Internal company documentation, FAQ |
LlamaIndex |
| AI agent that calls APIs and tools |
LangChain / LangGraph |
| Complex multi-step workflow with memory |
LangChain / LangGraph |
| Java / Spring Boot project |
Spring AI (manual approach) |
| Custom fallback and deduplication logic |
Manual approach |
| Fast prototype in any language |
Direct Ollama REST API |
Can they be combined
Yes — and it's a common practice in production.
Contabo describes a typical 2026 stack:
LlamaIndex as the knowledge layer (indexing and retrieval),
LangChain as the orchestration layer (agents and workflow),
and n8n or a custom service as the workflow engine.
But for most projects, one framework is better than two —
fewer dependencies, simpler maintenance.
Section conclusion: If the task is "answer based on documents,"
start with LlamaIndex. If you are building a complex AI agent where RAG is one
of the tools — LangChain. If you are on Java or need full control —
Spring AI or a direct REST API to Ollama.
🎯 Choosing a Model for Embeddings: nomic-embed-text and Alternatives
Short answer:
nomic-embed-text is the default choice for Ollama: lightweight (274 MB),
fast, with 768 dimensions. However, it has a significant limitation: it performs poorly with short queries and non-Latin languages. For Ukrainian text
and short queries, a fallback or a different model is needed.
An embedding model is the foundation of RAG. A poor embedding model will render the entire pipeline useless, regardless of the LLM's quality.
I learned this not from documentation, but when a query "LLM"
returned an irrelevant article with a score of 0.63,
while the correct one didn't even make it into the top 5.
What are Embeddings – An Analogy with Photography
Imagine you store photos in different formats.
A small 100x100 pixel photo takes up little space, but the details are blurry,
and similar faces are hard to distinguish. A 4000x3000 photo takes up much more,
but every detail is sharp and unique.
It's the same logic with embeddings. A vector is a "photograph" of text in a numerical space.
The vector dimension (384, 768, 1024) is like the resolution.
A larger vector encodes more details and nuances of meaning – and the search is more accurate.
However, a larger vector takes up more space in the database and is slower to generate.
all-minilm with a 384 vector is a 100x100 JPEG: lightweight and fast,
but it loses details. nomic-embed-text with a 768 vector is a 1920x1080 JPEG:
a good balance. mxbai-embed-large with a 1024 vector is a RAW format:
maximum detail, but costs twice as much in RAM.
And one more important detail regarding the photo analogy: if you stored all photos
in one format, you cannot compare them with photos in another format.
It's the same with embeddings: the model used for indexing must be
the same model used for embedding the query.
If you change the model, you need to re-index the entire database.
Comparison of Embedding Models for Ollama
| Model |
Size |
Vector |
Context |
When it's suitable |
Command |
| nomic-embed-text |
274 MB |
768 |
8192 tokens |
Default choice, long documents |
ollama pull nomic-embed-text |
| mxbai-embed-large |
669 MB |
1024 |
512 tokens |
High search accuracy, short chunks |
ollama pull mxbai-embed-large |
| all-minilm |
46 MB |
384 |
256 tokens |
Weak hardware, speed more important than accuracy |
ollama pull all-minilm |
Real Limitations of nomic-embed-text
From real-world experience indexing 500+ articles in four languages,
nomic-embed-text has specific weaknesses not mentioned in the official documentation:
- ⚠️ Short queries (1-2 words) – a query like "LLM" returned
an irrelevant article with a score of 0.63, while the correct one didn't even make
it into the top 5. The model's vector space cannot effectively distinguish
short, single-word queries.
- ⚠️ Non-Latin languages – Ukrainian and Cyrillic in general
yield lower quality compared to English text.
A query like "LLM vs RAG in 2026" finds correctly (score 0.69),
but a short Ukrainian query is already a problem.
- ⚠️ Mixed content – if documents are in multiple languages
simultaneously, search quality degrades: embeddings of English
and Ukrainian text reside in different "zones" of the vector space.
Practical Conclusion: When nomic-embed-text is Sufficient
nomic-embed-text is a good fit if:
- ✔️ Documents are predominantly in English
- ✔️ Queries are 3+ words (semantic search)
- ✔️ Large context is needed – 8192 tokens are sufficient for long articles
It requires a fallback or replacement if:
- ⚠️ Documents are predominantly in non-Latin languages
- ⚠️ Short search queries (1-2 words) – add keyword fallback
- ⚠️ Maximum accuracy is required – consider mxbai-embed-large
Section Conclusion: Choosing an embedding model is a trade-off
between accuracy, size, and speed – like choosing a photo format and resolution.
nomic-embed-text is a correct starting point. But know its limitations
and always test with your audience's real queries
before assuming the search "works."
🎯 Key Decisions in RAG Implementation – and What to Avoid
Short answer:
There is no universal working code for RAG – each project has its own
architecture, its own language, its own storage model, and its own search requirements.
However, there are several decisions that need to be made in any RAG project –
and common pitfalls in each of them.
Tutorials show how to launch RAG in 10 minutes.
Production shows that most of that time will be spent
on solving problems that are not present in the tutorial.
Decision 1 – How to Read and Clean Documents
Regardless of language or framework, at the input is "dirty" text.
PDFs contain headers, page numbers, formatting artifacts.
HTML contains tags, attributes, JavaScript, ad blocks.
All this noise gets into the embeddings and reduces search quality.
What's important to decide in advance:
- ✔️ What is the primary format and how to clean it
- ✔️ Whether to preserve structure (headings, sections)
or if plain text is sufficient
- ✔️ What to do with empty or very short documents
Typical pitfall: ignoring cleaning at the start – and getting
embeddings where 30% of the content is "Copyright 2024" and "Page 1 of 15."
Decision 2 – Chunk Size and Splitting Strategy
There is no correct chunk size for all cases.
512 tokens is standard for articles. 256 for technical documentation
where each paragraph is independent. 1024 if it's important to preserve context
between paragraphs.
What's important to decide:
- ✔️ Split by tokens or by sentences – different results
at chunk boundaries
- ✔️ What overlap is needed – 0 if chunks are independent,
50-100 tokens if ideas transition between paragraphs
- ✔️ Whether to store metadata in each chunk –
title, url, category, locale – for later filtering
Typical pitfall: setting the chunk size once and never checking again.
Examine the actual chunks after splitting – and you'll see where ideas are cut off.
Decision 3 – Chunk Identifiers
Seems trivial – but becomes critical during re-indexing.
If random UUIDs are used for each indexing,
the database accumulates duplicates. After a month, one document
can be represented by 10 versions of the same chunks.
The correct solution is a deterministic ID based on the document_id and the chunk's sequential number.
During re-indexing, the same chunk gets the same ID and is overwritten,
not duplicated. How to implement this – in Java using
UUID.nameUUIDFromBytes(), in Python using hashlib.md5().
The specific implementation depends on the stack – the principle is the same.
Decision 4 – Index Update Strategy
How to determine what needs to be re-indexed?
Two approaches:
- ✔️ Timestamp-based: store
indexed_at
and re-index where updated_at > indexed_at.
Simple and reliable – but there's a catch:
if @PreUpdate or a trigger updates updated_at
during saveAll() – and indexed_at becomes outdated
immediately after saving. Solution: set indexed_at
slightly in the future – now() + 10 seconds.
- ✔️ Hash-based: store a content hash and re-index
only if the hash has changed. More accurate, but more complex to implement.
Decision 5 – Batch Indexing
Trying to index 500 documents in one request ends
in either an Ollama timeout or OOM. Batching by 10-20 documents –
and if it fails, you continue from where you stopped,
not from the beginning. This is why indexed_at is more important
than it seems – it allows the scheduler to pick up where it left off.
Decision 6 – LAZY Loading and Transactions
If you use an ORM (JPA, SQLAlchemy, ActiveRecord) –
be careful with lazy-loaded associations during indexing.
Accessing post.getTranslations() outside a transaction
results in a LazyInitializationException in JPA or
a DetachedInstanceError in SQLAlchemy.
Solution: JOIN FETCH in the query or explicitly load
everything needed within the transaction.
What to Avoid – Briefly
- ❌ Don't start with a "working example from a tutorial" directly in production –
it lacks fallback, batching, and idempotency.
- ❌ Don't ignore text cleaning before indexing.
- ❌ Don't use random UUIDs for chunks if you plan to re-index.
- ❌ Don't set a threshold once and for all – log scores
and adjust based on real queries.
- ❌ Don't assume vector search will completely replace keyword search –
short queries require a fallback.
- ❌ Don't forget that changing the embedding model requires a full re-index.
Section Conclusion: RAG is not "copy and run."
It involves six architectural decisions, each with its own pitfalls.
Make them consciously before implementation – and you'll save
from several hours to several days of debugging.
🎯 Production Patterns: What Tutorials Don't Tell You
Short answer:
The difference between "hello world" RAG and production RAG lies in four patterns:
fallback to keyword search, idempotent indexing,
result deduplication, and batch processing.
None of these are mentioned in basic tutorials.
You discover each of them through a real problem.
I discovered all four patterns not through documentation –
but through specific production incidents.
Fallback – when the query "Spring" returned irrelevant results.
Idempotency – when after a month of the scheduler running,
one article was in the database 47 times.
Deduplication – when the top 5 results contained 4 chunks from the same article.
Batch – when Ollama froze on 500 documents.
Pattern 1 – Fallback: vector → keyword
Vector search is built on semantic similarity.
It's good at finding "turnkey web development" for the query "how to build a website."
But for queries like "Spring" or "RAG" – the similarity gets diluted across the entire
vector space, and accuracy drops.
A rule that works in practice: queries of 1-2 words
should be directed to keyword search, 3+ words – to vector search.
If vector search returns an empty result – fallback to keyword.
If vector search fails with an error – fallback to keyword.
The user notices no difference but always gets an answer.
Implementation depends on the stack – but the principle is unchanged:
two providers, a facade that decides which one to use,
and automatic switching between them.
Pattern 2 – Idempotent Indexing
The scheduler runs indexing every night.
If random UUIDs are generated for chunks with each run –
after a month, the same article is represented by 30 versions of the same chunks.
Search returns duplicates, quality degrades, the database grows without reason.
Solution: a deterministic ID based on the document identifier
and the chunk's sequential number. During re-indexing, the same chunk
gets the same ID and is overwritten – not added as a new record.
Specific implementation depends on the language – in Java it's
UUID.nameUUIDFromBytes(),
in Python – hashlib.md5() via uuid.UUID(bytes=...).
The principle is the same.
Pattern 3 – Result Deduplication
Vector search returns top-N chunks – not top-N documents.
If one article well matches the query – it can yield
4 chunks out of 5 in the top 5. The user will receive four links
to the same page.
Solution: after receiving results – deduplicate by document_id.
For each document, keep only the chunk with the highest score.
Result: the top 5 always contains 5 different documents,
not 5 fragments of one.
Important: metadata for each chunk must contain document_id –
without it, deduplication is impossible. Plan your metadata structure
before starting indexing.
Pattern 4 – Batch Indexing with Recovery
Ollama processes embeddings sequentially via CPU or GPU.
A request for 500 documents simultaneously – is either a timeout,
an OOM, or simply a freeze without an error.
Solution: batch by 10-20 documents. But more importantly –
immediately after each batch, save the indexed_at
timestamp for the processed documents. If the process fails on batch 7 out of 50 –
on the next run, the scheduler will pick up from batch 8,
not start from the beginning. This is the difference between "run and hope"
and reliable indexing.
Section Conclusion: Four patterns – fallback,
idempotency, deduplication, and batching with recovery –
are the difference between a prototype that works for a demo
and a system that runs in production for months.
The implementation varies by stack – the principles are the same for all.
🎯 Typical Problems: Chunking, Threshold, Hallucinations
Three classes of problems in RAG — and all three are only revealed with real queries,
not synthetic tests. Incorrect chunking: the model receives fragments
without context. Incorrect threshold: either too much noise or nothing is found.
Hallucinations: the model invents things even with context — due to a weak prompt
or irrelevant chunks that still made it to the top.
The most important thing I realized: most RAG problems
are invisible without logging scores. If you don't see
what similarity score each result returns —
you can't diagnose why the search is giving bad answers.
Add score logging immediately — and you'll save hours of debugging.
Problem 1 — Chunking: How to Tell When Something is Wrong
Symptom: the model gives answers that seem partially correct —
there's some connection to the question, but the details are wrong or truncated.
Cause: chunks are split in unfortunate places, and the thought is cut off at the fragment boundary.
Diagnosis: output a few real chunks after splitting and read them.
If a chunk starts or ends in the middle of a sentence — the chunk size or splitting strategy is wrong.
- ✔️ Short documents, blog posts — 512 tokens, overlap 50
- ✔️ Technical documentation with long sections — 1024 tokens, overlap 100
- ✔️ Code — 256-512 tokens, split by functions, not lines
- ✔️ Overlap — don't skimp: 50-100 tokens of overlap guarantee
that the thought won't be lost at chunk boundaries
Problem 2 — Similarity Threshold: How to Find the Right Value
Symptom A: the model answers confidently but incorrectly —
the threshold is too low, irrelevant chunks are included in the context.
Symptom B: the model constantly says "not found" even for obvious questions —
the threshold is too high, nothing passes the filter.
Diagnosis: log the score of each result — and look at the distribution.
If correct results have a score of 0.55-0.65 — a threshold of 0.7 cuts them all off.
If irrelevant results have a score of 0.55 — a threshold of 0.5 lets noise through.
- ✔️ Starting point: 0.5 — and adjust based on real query logs
- ✔️ Short queries (1-2 words) — lower to 0.4 or better use fallback
- ✔️ Long semantic queries — 0.65-0.7 gives a cleaner result
- ✔️ If all results are below the threshold — return "not found"
instead of empty context
Problem 3 — Hallucinations Even With Context
Symptom: the model gives a confident answer that doesn't match any
of the provided chunks — or incorrectly combines facts from different sources.
The cause is most often not in the model — but in the prompt or the fact
that irrelevant context still made it to the top.
Three levels of protection:
- ✔️ Strict system prompt — explicit prohibition of invention:
"Answer ONLY based on the provided context.
If there is no answer — say so honestly.
Do not combine information from different sources if they contradict each other."
- ✔️ Score-based filter — if the maximum score
among results is below 0.5, do not pass the context to the LLM at all.
Return "not found" — it's better to have an honest answer than a confident hallucination.
- ✔️ Show sources — a list of articles or documents
from which the answer was taken. The user will check themselves
and report if something is wrong.
Section Conclusion: All three problems are diagnosed
in the same way — by logging scores and reading the actual chunks
that make it into the context. Add this from day one —
and most questions like "why is RAG giving bad answers"
will have an obvious answer.
❓ Frequently Asked Questions (FAQ)
Is a GPU required for RAG with Ollama?
No. GPUs accelerate response generation and embeddings, but are not mandatory.
On CPU, embeddings through nomic-embed-text are generated slower,
but are quite feasible for production with moderate load.
On Apple Silicon (M1/M2/M3) — it's fast even without a dedicated GPU.
For more details on which models actually work on limited hardware —
Ollama on 8 GB RAM: Which Models Work in 2026.
How many documents can be indexed?
There are no limits — it depends on the size of the vector store and RAM.
pgvector handles millions of vectors without issues.
In practice, for a blog with 500 articles — this is ~2500 chunks of 512 tokens,
occupying a few MB in PostgreSQL. Generating embeddings via Ollama
for this volume takes 15-30 minutes on CPU — acceptable for a nightly scheduler.
RAG or regular full-text search?
Not "or", but "and". RAG is better for semantic queries of 3+ words where content is important —
"how to optimize PostgreSQL queries" will find an article about indexes
even without an exact word match. Full-text search is better for short queries
(1-2 words), names, and exact matches. The optimal strategy is
vector search with automatic fallback to FTS.
How to update the index when documents change?
Store indexed_at for each document and run
re-indexing where updated_at > indexed_at.
A deterministic ID via hash allows overwriting chunks without duplicates.
An important nuance: if the ORM updates updated_at automatically
on saveAll() — set indexed_at
to a bit in the future so the scheduler doesn't re-index everything every time.
Which Ollama model to choose for RAG?
For embeddings — nomic-embed-text as the standard, mxbai-embed-large
if higher accuracy is needed. For response generation — llama3.3:8b
on 8 GB RAM or qwen2.5:14b if you have 16 GB.
A full comparison of models with benchmarks and task-specific recommendations —
Which Ollama Model to Choose in 2026: A Comparison of Llama, Qwen, DeepSeek, and Mistral.
Are there Java solutions for RAG with Ollama?
Yes — Spring AI with native support for Ollama and pgvector.
A real case of implementing RAG for a blog on Spring AI with specific
errors and solutions not found in the documentation —
in the article How I Built RAG for webscraft.org:
Spring AI + pgvector + Real Experience.
Where to start if I've never worked with Ollama?
First, understand the platform itself — what Ollama is,
how it works, and what problems it solves.
What is Ollama and Why Developers Are Massively Moving to Local AI in 2026 —
an jargon-free overview that's easy to start with.
After that, RAG becomes a logical next step.
✅ Conclusions
RAG is one of those technologies that looks complex until the first implementation
and obvious afterward. The pipeline is conceptually simple: document → chunks →
embeddings → search → answer. Complexity arises in the details —
and you'll find most of these details not in the documentation,
but through real production incidents.
What I learned from implementing it for the blog:
- ✔️ Start with nomic-embed-text — but immediately log scores
on real queries. For non-Latin languages and short queries,
a fallback is needed — you'll learn this not from the documentation.
- ✔️ Deterministic UUID for chunks — not random.
Without this, re-indexing multiplies duplicates, and you won't immediately understand why.
- ✔️ Fallback to keyword search — vector search
doesn't replace regular search, it complements it.
- ✔️ Batch by 10-20 documents while saving indexed_at —
so that if it fails, you can continue from where you left off, not from the beginning.
- ✔️ Strict system prompt — without an explicit prohibition
against invention, the model will hallucinate. This is not a bug of a specific model.
With Ollama, the entire stack is free and works locally —
not a single document goes to the cloud. For a blog, internal documentation,
or a corporate knowledge base, this is the correct architecture.
If you work with Java and Spring Boot — the next article is exactly about that.
A real case of implementing RAG for webscraft.org on Spring AI + pgvector:
what mistakes I made, what pitfalls I found, and what I would do differently —
How I Built RAG for webscraft.org: Spring AI + pgvector + Real Experience.
If you are not yet familiar with Ollama —
start with the Ollama 2026 overview.
📎 Sources
- LlamaIndex: Introduction to RAG — official documentation
- Real Python: LlamaIndex in Python — A RAG Guide — practical tutorial
- Medium: LlamaIndex for Beginners 2025 — from zero to production
- DEV Community: RAG with LlamaIndex, ChromaDB, and Ollama
- Prem AI: LangChain vs LlamaIndex 2026 — Production RAG Comparison
- Contabo: LlamaIndex vs LangChain — Which One to Choose in 2026
- Ollama: nomic-embed-text — embedding model specifications
- Ollama: mxbai-embed-large — alternative embedding model
Ключові слова:
RAGRetrieval-Augmented Generationкраулинг AIGoogle AI OverviewsPerplexitySEO під RAG