Що таке RAG і чим він відрізняється від донавчання моделі?

RAG (Retrieval-Augmented Generation) — це підхід де модель отримує релевантний контекст з твоїх документів перед кожним запитом. Донавчання (fine-tuning) змінює ваги моделі і вимагає значних обчислень. RAG не змінює модель — вона залишається тією самою, просто отримує більше контексту. RAG простіший, дешевший і дозволяє оновлювати знання без перенавчання.

Який розмір чанка вибрати для RAG?

Стандартний розмір: 512-1024 токени з overlap 50-100 токенів. Для коротких статей блогу — 512 токенів достатньо. Для технічної документації з довгими секціями — 1024. Занадто малі чанки втрачають контекст, занадто великі — знижують точність пошуку.

Яку модель для ембедингів використовувати з Ollama?

nomic-embed-text — найпопулярніша ембединг модель для Ollama. Розмір 274 МБ, вектор 768 вимірів, контекст 8192 токени. Альтернатива: mxbai-embed-large. Важливо: nomic-embed-text гірше працює з короткими запитами (1-2 слова) і нелатинськими мовами — для таких випадків потрібен fallback на звичайний пошук

Чому RAG дає погані відповіді навіть коли документи є?

Три найпоширеніші причини: занадто низький або високий similarity threshold (логуй score кожного результату щоб діагностувати), неправильний розмір чанка (перевір реальні чанки — думка не має обриватись посередині речення), слабкий системний промпт без явної заборони вигадувати. Додай логування score з першого дня — більшість проблем стануть очевидними.

Що таке similarity threshold у vector search?

Similarity threshold — мінімальний поріг схожості (від 0 до 1) для включення результату. При 0.5 — повертаються навіть слабо релевантні результати. При 0.7+ — тільки дуже схожі. Для RAG рекомендовано 0.5-0.65: нижче дає більше шуму, вище — може не знайти нічого.

Чим LlamaIndex відрізняється від LangChain для RAG?

LlamaIndex спеціалізований на роботі з документами і RAG — простіший API для індексації і пошуку. LangChain — ширший фреймворк для побудови AI-ланцюжків і агентів, але складніший для простого RAG. Для початку і більшості RAG-задач — LlamaIndex простіший. Для складних агентних систем — LangChain

AI_TOOLS 22 March 2026 24 min read 2,892 view

RAG with Ollama: How to Train AI to Respond to Your Documents — From Pipeline to Production

Updated: 23 March 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Vadim Kharovyuk

CEO & Founder of WebsCraft. 8 years in web development, focused on bringing AI into real products.

RAG with Ollama: How to Train AI to Respond to Your Documents — From Pipeline to Production

RAG with Ollama: Teach AI to Answer Based on Your Documents

You have documents — PDFs, articles, notes, a knowledge base. You want to ask questions and get answers specifically from these documents, not from the model's general knowledge. And all of this — locally, without sending data to the cloud.

This is exactly what RAG is for. In this article — an explanation of the concept, a visual pipeline diagram, and a step-by-step Python example that actually works. As well as production patterns that are not in the documentation.

📚 Table of Contents

📌 What is RAG — and why it's not fine-tuning
📌 How the pipeline works: from document to answer
📌 Tools: LlamaIndex vs LangChain vs manual approach
📌 Choosing an embedding model: nomic-embed-text and alternatives
📌 Key decisions when implementing RAG — and what to avoid
📌 Production patterns: what tutorials don't tell you
📌 Common problems: chunking, threshold, hallucinations
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

🎯 What is RAG — and why it's not fine-tuning

Short answer:

RAG (Retrieval-Augmented Generation) is a way to give the model access to your documents without changing the model itself. Before each query, the system finds relevant snippets from your base and adds them to the context. The model answers based on these snippets — not just what it learned during training.

LLMs are trained on trillions of tokens from the internet — but not on your internal documents, not on your codebase, and not on articles you wrote last week. RAG bridges this gap.

Analogy: A lawyer and a case

Imagine an experienced lawyer. They know laws, precedents, general practice — everything they've learned over the years. But when given a new case — they don't answer from memory. They read the case materials, highlight key facts, and only then formulate a position based on specific documents.

An LLM without RAG is like that same lawyer asked to answer without access to the case materials. They'll say something, but it will be a general opinion, not an answer to your specific case. RAG gives them those materials.

Why not fine-tuning

Fine-tuning is retraining the model on your data. It sounds logical, but in practice, it's expensive, slow, and inflexible. If your documents change — you need to retrain again. If you made a mistake in the data — the model "remembers" the wrong thing.

RAG doesn't touch the model at all. Updated documents — re-indexed in minutes. Found an error — fixed it in the source. The model itself remains the same. That's why for most tasks where you need to "answer based on documents," RAG is the right choice, not fine-tuning.

RAG vs Fine-tuning: When to choose what

Criterion	RAG	Fine-tuning
Implementation complexity	Low — a few days	High — weeks
Computational resources	Minimal — CPU is enough	Requires GPU, significant costs
Knowledge update	Re-indexing — minutes	Retraining — days
Transparency	✔️ Visible which documents were used	❌ Difficult to explain the source of the answer
Data error	Fixed source — done	Retraining from scratch
Suitable for	Knowledge base, documents, FAQ, search	Changing style, tone, response format
Not suitable for	Changing the model's behavior itself	Frequently updated documents

When RAG won't solve the problem

RAG is not a silver bullet. There are several scenarios where it won't help:

⚠️ Short queries (1-2 words) — semantic search on short queries is less accurate than keyword search. A fallback is needed.
⚠️ Questions outside the scope of documents — if the answer is not in the base, the model will either say "I don't know" or start hallucinating. The prompt must explicitly forbid the latter.
⚠️ Need to change the model's style or behavior — RAG won't help here; this is a task for fine-tuning or detailed system prompting.

Section conclusion: RAG is the right choice when you need to answer based on specific documents, and these documents change or are updated. Fine-tuning is needed when you need to change the model's behavior itself, not expand its knowledge.

🎯 How the pipeline works: from document to answer

Short answer:

The RAG pipeline is divided into two independent phases. Indexing — performed once or when documents are updated: document → chunks → embeddings → vector store. Retrieval and generation — with each user query: question → embedding → similarity search → prompt with context → answer. If you understand where everything happens — debugging becomes much easier.

One of the first things I realized in practice: the model never "reads" the entire document. It only sees a few snippets that the system deems most relevant. The quality of the answer directly depends on the quality of this selection.

Phase 1 — Indexing

Indexing happens once — or on a schedule when content is updated. The goal: to convert human text into a form that a vector store can search.

Step 1. Document Reading

Input formats: PDF, HTML, Markdown, plain text, web pages. The main thing at this stage is to get clean text without markup. HTML tags, PDF metadata, control characters — all this is noise that degrades embedding quality. Jsoup for Java, BeautifulSoup for Python — standard cleaning tools.

Step 2. Chunking — Splitting into Fragments

The document is split into smaller parts — chunks. Why not store the whole document? Two reasons. First, embedding models have a token limit: nomic-embed-text — 8192 tokens, all-minilm — 256. An article of 5000 words simply won't fit. Second, search is more accurate on smaller fragments: if you store an entire article as one vector — the vector averages all the article's topics and becomes less specific.

Standard: 512 tokens with a 50-token overlap. Overlap is intentional overlap between adjacent chunks so that the thought doesn't break at the fragment boundary.

Step 3. Embeddings

Each chunk is converted into a numerical vector — a list of 768 numbers (for nomic-embed-text). This vector encodes the *meaning* of the text, not specific words. That's why "turnkey web development" and "how to create a website" will have similar vectors even without shared words. This is the advantage of semantic search over LIKE-search.

Step 4. Vector Store

Vectors and original chunk text are stored in a vector database: pgvector (extension for PostgreSQL), Chroma, FAISS, Milvus, or another. If the project already has PostgreSQL — pgvector is a natural choice: one database instance, nothing new to install.

Phase 2 — Retrieval and Generation

This phase runs with every query and takes seconds. The user enters a question — and after a few steps, gets an answer.

Step 5. Query Embedding

The user's question is also converted into a vector — using the same embedding model used during indexing. This is crucial: if you indexed using nomic-embed-text, the query must also go through nomic-embed-text. Different models produce incompatible vectors — and the search will be irrelevant.

Step 6. Similarity Search

The vector store finds the N closest vectors to the query vector. "Closest" means cosine similarity or dot product between vectors. Result: top-5 (or however many you set) most relevant chunks. A similarity threshold filters out results below the similarity threshold — usually starting with 0.5 and adjusting based on logs.

Step 7. Prompt Formulation

The found chunks are inserted into the system prompt as context. The quality of this prompt directly affects the quality of the answer. The minimum that should be in the system prompt:

Answer ONLY based on the provided context.
If the answer is not in the context, honestly state that.
Do not invent information that is not in the documents.

Context:
{found chunks}

Without an explicit prohibition against invention — the model will hallucinate. This is not a bug of a specific model, but a general behavior of LLMs.

Step 8. Answer Generation

The LLM (Ollama with llama3.3, mistral, or another model) generates an answer based on the context from the previous step. An important detail: show the user the sources — which specific articles or documents the system used. This is both transparency and a way to verify where the model got the information.

Section conclusion: Two independent processes — indexing and retrieval+generation. Most RAG problems arise either at the chunking step (incorrect size), or at the similarity search step (incorrect threshold), or in the system prompt (no prohibition against hallucinations). We will break down each in detail below.

🎯 Tools: LlamaIndex vs LangChain vs manual approach

Short answer:

In 2026, the choice between LlamaIndex and LangChain is not about "better," but about what you are building. LlamaIndex is specialized for working with documents and provides better search quality with less code. LangChain is for complex agent systems where RAG is one of the components. A manual approach via Spring AI or direct REST API is for when Python doesn't suit you or you need maximum control over each step.

A detailed 2026 analysis notes: the comparison "LangChain = orchestration, LlamaIndex = data" which most articles repeat is already outdated. Both frameworks are converging and cover similar functionality. The choice now depends on where you are starting and what is the foundation of your application.

LlamaIndex — if documents and search are at the center

LlamaIndex is built around the idea that the main problem is reliably connecting LLMs to your data. Everything else — agents, pipelines, orchestration — goes on top of this foundation.

Independent benchmarks show: LlamaIndex is 40% faster than LangChain on document retrieval tasks and in 2025 achieved +35% in search accuracy. The reason: built-in abstractions for chunking, re-ranking, and context assembly are optimized for retrieval — not for general pipelines.

Pros:

✔️ Easiest start: 10 lines for a working RAG
✔️ Built-in re-ranking, hybrid search (semantic + keyword)
✔️ Wide list of data connectors via LlamaHub: PDF, HTML, Notion, Google Docs, S3, databases
✔️ Flexible chunking: SentenceSplitter, TokenTextSplitter, SemanticSplitter — change with one line
✔️ Native integration with Ollama via separate packages

Cons:

⚠️ For complex agent systems with many external tools — less convenient than LangChain
⚠️ Documentation is less comprehensive than LangChain for atypical scenarios

When to choose: you have documents, PDFs, a knowledge base — you need to ask questions and get answers. Blog, FAQ, internal company documentation, technical reference.

LangChain — if RAG is part of a more complex system

LangChain was built around the idea that building with LLMs is a workflow problem: there's input data, there's output, and in between — models, tools, memory, and external sources that need to be composed. In 2026 this means LangGraph — a directed graph where nodes are functions, and edges are transitions between states. RAG in LangChain is one of the agent's tools, not the central abstraction.

Pros:

✔️ Largest ecosystem and community — 220% growth in GitHub stars in 2024
✔️ LangGraph for complex stateful agents with branching and loops
✔️ LangSmith — observability, tracing, evaluation out-of-the-box
✔️ Multimodal support is stronger than in LlamaIndex
✔️ Better for conversational memory over long conversations

Cons:

⚠️ Steeper learning curve — more concepts for simple RAG
⚠️ For pure document retrieval — more code than with LlamaIndex
⚠️ Frequent breaking changes: original LangChain replaced by LangGraph for production scenarios

When to choose: an AI agent that calls external APIs, writes code, executes SQL queries — and also searches documents. RAG is not the main task here, but one of the agent's tools.

Manual approach — maximum control without a framework

Ollama REST API directly + any vector DB + custom logic. This is the approach behind Spring AI for Java, where the framework provides ready-made abstractions but you control every detail of the pipeline.

Pros:

✔️ Full control over every step — chunking, scoring, fallback logic
✔️ No dependency on Python — any language via REST API
✔️ No breaking changes from an external framework
✔️ Can implement custom logic that frameworks don't support

Cons:

⚠️ More code — chunking, batch indexing, fallback are written by you
⚠️ No built-in re-ranking and hybrid search
⚠️ More implementation time than with LlamaIndex

When to choose: Java / Spring Boot project (Spring AI), non-standard pipeline requirements, or when the framework adds more complexity than it solves problems.

Choice Matrix

Task	Recommendation
Answering based on PDFs and documents — simple start	LlamaIndex
Semantic search on a blog or knowledge base	LlamaIndex
Internal company documentation, FAQ	LlamaIndex
AI agent that calls APIs and tools	LangChain / LangGraph
Complex multi-step workflow with memory	LangChain / LangGraph
Java / Spring Boot project	Spring AI (manual approach)
Custom fallback and deduplication logic	Manual approach
Fast prototype in any language	Direct Ollama REST API

Can they be combined

Yes — and it's a common practice in production. Contabo describes a typical 2026 stack: LlamaIndex as the knowledge layer (indexing and retrieval), LangChain as the orchestration layer (agents and workflow), and n8n or a custom service as the workflow engine. But for most projects, one framework is better than two — fewer dependencies, simpler maintenance.

Section conclusion: If the task is "answer based on documents," start with LlamaIndex. If you are building a complex AI agent where RAG is one of the tools — LangChain. If you are on Java or need full control — Spring AI or a direct REST API to Ollama.

🎯 Choosing a Model for Embeddings: nomic-embed-text and Alternatives

Short answer:

nomic-embed-text is the default choice for Ollama: lightweight (274 MB), fast, with 768 dimensions. However, it has a significant limitation: it performs poorly with short queries and non-Latin languages. For Ukrainian text and short queries, a fallback or a different model is needed.

An embedding model is the foundation of RAG. A poor embedding model will render the entire pipeline useless, regardless of the LLM's quality. I learned this not from documentation, but when a query "LLM" returned an irrelevant article with a score of 0.63, while the correct one didn't even make it into the top 5.

What are Embeddings – An Analogy with Photography

Imagine you store photos in different formats. A small 100x100 pixel photo takes up little space, but the details are blurry, and similar faces are hard to distinguish. A 4000x3000 photo takes up much more, but every detail is sharp and unique.

It's the same logic with embeddings. A vector is a "photograph" of text in a numerical space. The vector dimension (384, 768, 1024) is like the resolution. A larger vector encodes more details and nuances of meaning – and the search is more accurate. However, a larger vector takes up more space in the database and is slower to generate.

all-minilm with a 384 vector is a 100x100 JPEG: lightweight and fast, but it loses details. nomic-embed-text with a 768 vector is a 1920x1080 JPEG: a good balance. mxbai-embed-large with a 1024 vector is a RAW format: maximum detail, but costs twice as much in RAM.

And one more important detail regarding the photo analogy: if you stored all photos in one format, you cannot compare them with photos in another format. It's the same with embeddings: the model used for indexing must be the same model used for embedding the query. If you change the model, you need to re-index the entire database.

Comparison of Embedding Models for Ollama

Model	Size	Vector	Context	When it's suitable	Command
nomic-embed-text	274 MB	768	8192 tokens	Default choice, long documents	`ollama pull nomic-embed-text`
mxbai-embed-large	669 MB	1024	512 tokens	High search accuracy, short chunks	`ollama pull mxbai-embed-large`
all-minilm	46 MB	384	256 tokens	Weak hardware, speed more important than accuracy	`ollama pull all-minilm`

Real Limitations of nomic-embed-text

From real-world experience indexing 500+ articles in four languages, nomic-embed-text has specific weaknesses not mentioned in the official documentation:

⚠️ Short queries (1-2 words) – a query like "LLM" returned an irrelevant article with a score of 0.63, while the correct one didn't even make it into the top 5. The model's vector space cannot effectively distinguish short, single-word queries.
⚠️ Non-Latin languages – Ukrainian and Cyrillic in general yield lower quality compared to English text. A query like "LLM vs RAG in 2026" finds correctly (score 0.69), but a short Ukrainian query is already a problem.
⚠️ Mixed content – if documents are in multiple languages simultaneously, search quality degrades: embeddings of English and Ukrainian text reside in different "zones" of the vector space.

Practical Conclusion: When nomic-embed-text is Sufficient

nomic-embed-text is a good fit if:

✔️ Documents are predominantly in English
✔️ Queries are 3+ words (semantic search)
✔️ Large context is needed – 8192 tokens are sufficient for long articles

It requires a fallback or replacement if:

⚠️ Documents are predominantly in non-Latin languages
⚠️ Short search queries (1-2 words) – add keyword fallback
⚠️ Maximum accuracy is required – consider mxbai-embed-large

Section Conclusion: Choosing an embedding model is a trade-off between accuracy, size, and speed – like choosing a photo format and resolution. nomic-embed-text is a correct starting point. But know its limitations and always test with your audience's real queries before assuming the search "works."

🎯 Key Decisions in RAG Implementation – and What to Avoid

Short answer:

There is no universal working code for RAG – each project has its own architecture, its own language, its own storage model, and its own search requirements. However, there are several decisions that need to be made in any RAG project – and common pitfalls in each of them.

Tutorials show how to launch RAG in 10 minutes. Production shows that most of that time will be spent on solving problems that are not present in the tutorial.

Decision 1 – How to Read and Clean Documents

Regardless of language or framework, at the input is "dirty" text. PDFs contain headers, page numbers, formatting artifacts. HTML contains tags, attributes, JavaScript, ad blocks. All this noise gets into the embeddings and reduces search quality.

What's important to decide in advance:

✔️ What is the primary format and how to clean it
✔️ Whether to preserve structure (headings, sections) or if plain text is sufficient
✔️ What to do with empty or very short documents

Decision 2 – Chunk Size and Splitting Strategy

There is no correct chunk size for all cases. 512 tokens is standard for articles. 256 for technical documentation where each paragraph is independent. 1024 if it's important to preserve context between paragraphs.

What's important to decide:

✔️ Split by tokens or by sentences – different results at chunk boundaries
✔️ What overlap is needed – 0 if chunks are independent, 50-100 tokens if ideas transition between paragraphs
✔️ Whether to store metadata in each chunk – title, url, category, locale – for later filtering

Typical pitfall: setting the chunk size once and never checking again. Examine the actual chunks after splitting – and you'll see where ideas are cut off.

Decision 3 – Chunk Identifiers

Seems trivial – but becomes critical during re-indexing. If random UUIDs are used for each indexing, the database accumulates duplicates. After a month, one document can be represented by 10 versions of the same chunks.

The correct solution is a deterministic ID based on the document_id and the chunk's sequential number. During re-indexing, the same chunk gets the same ID and is overwritten, not duplicated. How to implement this – in Java using UUID.nameUUIDFromBytes(), in Python using hashlib.md5(). The specific implementation depends on the stack – the principle is the same.

Decision 4 – Index Update Strategy

How to determine what needs to be re-indexed? Two approaches:

✔️ Timestamp-based: store indexed_at and re-index where updated_at > indexed_at. Simple and reliable – but there's a catch: if @PreUpdate or a trigger updates updated_at during saveAll() – and indexed_at becomes outdated immediately after saving. Solution: set indexed_at slightly in the future – now() + 10 seconds.
✔️ Hash-based: store a content hash and re-index only if the hash has changed. More accurate, but more complex to implement.

Decision 5 – Batch Indexing

Trying to index 500 documents in one request ends in either an Ollama timeout or OOM. Batching by 10-20 documents – and if it fails, you continue from where you stopped, not from the beginning. This is why indexed_at is more important than it seems – it allows the scheduler to pick up where it left off.

Decision 6 – LAZY Loading and Transactions

If you use an ORM (JPA, SQLAlchemy, ActiveRecord) – be careful with lazy-loaded associations during indexing. Accessing post.getTranslations() outside a transaction results in a LazyInitializationException in JPA or a DetachedInstanceError in SQLAlchemy. Solution: JOIN FETCH in the query or explicitly load everything needed within the transaction.

What to Avoid – Briefly

❌ Don't start with a "working example from a tutorial" directly in production – it lacks fallback, batching, and idempotency.
❌ Don't ignore text cleaning before indexing.
❌ Don't use random UUIDs for chunks if you plan to re-index.
❌ Don't set a threshold once and for all – log scores and adjust based on real queries.
❌ Don't assume vector search will completely replace keyword search – short queries require a fallback.
❌ Don't forget that changing the embedding model requires a full re-index.

Section Conclusion: RAG is not "copy and run." It involves six architectural decisions, each with its own pitfalls. Make them consciously before implementation – and you'll save from several hours to several days of debugging.

🎯 Production Patterns: What Tutorials Don't Tell You

Short answer:

The difference between "hello world" RAG and production RAG lies in four patterns: fallback to keyword search, idempotent indexing, result deduplication, and batch processing. None of these are mentioned in basic tutorials. You discover each of them through a real problem.

I discovered all four patterns not through documentation – but through specific production incidents. Fallback – when the query "Spring" returned irrelevant results. Idempotency – when after a month of the scheduler running, one article was in the database 47 times. Deduplication – when the top 5 results contained 4 chunks from the same article. Batch – when Ollama froze on 500 documents.

Pattern 1 – Fallback: vector → keyword

Vector search is built on semantic similarity. It's good at finding "turnkey web development" for the query "how to build a website." But for queries like "Spring" or "RAG" – the similarity gets diluted across the entire vector space, and accuracy drops.

A rule that works in practice: queries of 1-2 words should be directed to keyword search, 3+ words – to vector search. If vector search returns an empty result – fallback to keyword. If vector search fails with an error – fallback to keyword. The user notices no difference but always gets an answer.

Implementation depends on the stack – but the principle is unchanged: two providers, a facade that decides which one to use, and automatic switching between them.

Pattern 2 – Idempotent Indexing

The scheduler runs indexing every night. If random UUIDs are generated for chunks with each run – after a month, the same article is represented by 30 versions of the same chunks. Search returns duplicates, quality degrades, the database grows without reason.

Solution: a deterministic ID based on the document identifier and the chunk's sequential number. During re-indexing, the same chunk gets the same ID and is overwritten – not added as a new record. Specific implementation depends on the language – in Java it's UUID.nameUUIDFromBytes(), in Python – hashlib.md5() via uuid.UUID(bytes=...). The principle is the same.

Pattern 3 – Result Deduplication

Vector search returns top-N chunks – not top-N documents. If one article well matches the query – it can yield 4 chunks out of 5 in the top 5. The user will receive four links to the same page.

Solution: after receiving results – deduplicate by document_id. For each document, keep only the chunk with the highest score. Result: the top 5 always contains 5 different documents, not 5 fragments of one.

Important: metadata for each chunk must contain document_id – without it, deduplication is impossible. Plan your metadata structure before starting indexing.

Pattern 4 – Batch Indexing with Recovery

Ollama processes embeddings sequentially via CPU or GPU. A request for 500 documents simultaneously – is either a timeout, an OOM, or simply a freeze without an error.

Solution: batch by 10-20 documents. But more importantly – immediately after each batch, save the indexed_at timestamp for the processed documents. If the process fails on batch 7 out of 50 – on the next run, the scheduler will pick up from batch 8, not start from the beginning. This is the difference between "run and hope" and reliable indexing.

Section Conclusion: Four patterns – fallback, idempotency, deduplication, and batching with recovery – are the difference between a prototype that works for a demo and a system that runs in production for months. The implementation varies by stack – the principles are the same for all.

🎯 Typical Problems: Chunking, Threshold, Hallucinations

Three classes of problems in RAG — and all three are only revealed with real queries, not synthetic tests. Incorrect chunking: the model receives fragments without context. Incorrect threshold: either too much noise or nothing is found. Hallucinations: the model invents things even with context — due to a weak prompt or irrelevant chunks that still made it to the top.

The most important thing I realized: most RAG problems are invisible without logging scores. If you don't see what similarity score each result returns — you can't diagnose why the search is giving bad answers. Add score logging immediately — and you'll save hours of debugging.

Problem 1 — Chunking: How to Tell When Something is Wrong

Symptom: the model gives answers that seem partially correct — there's some connection to the question, but the details are wrong or truncated. Cause: chunks are split in unfortunate places, and the thought is cut off at the fragment boundary.

Diagnosis: output a few real chunks after splitting and read them. If a chunk starts or ends in the middle of a sentence — the chunk size or splitting strategy is wrong.

✔️ Short documents, blog posts — 512 tokens, overlap 50
✔️ Technical documentation with long sections — 1024 tokens, overlap 100
✔️ Code — 256-512 tokens, split by functions, not lines
✔️ Overlap — don't skimp: 50-100 tokens of overlap guarantee that the thought won't be lost at chunk boundaries

Problem 2 — Similarity Threshold: How to Find the Right Value

Symptom A: the model answers confidently but incorrectly — the threshold is too low, irrelevant chunks are included in the context. Symptom B: the model constantly says "not found" even for obvious questions — the threshold is too high, nothing passes the filter.

Diagnosis: log the score of each result — and look at the distribution. If correct results have a score of 0.55-0.65 — a threshold of 0.7 cuts them all off. If irrelevant results have a score of 0.55 — a threshold of 0.5 lets noise through.

✔️ Starting point: 0.5 — and adjust based on real query logs
✔️ Short queries (1-2 words) — lower to 0.4 or better use fallback
✔️ Long semantic queries — 0.65-0.7 gives a cleaner result
✔️ If all results are below the threshold — return "not found" instead of empty context

Problem 3 — Hallucinations Even With Context

Symptom: the model gives a confident answer that doesn't match any of the provided chunks — or incorrectly combines facts from different sources. The cause is most often not in the model — but in the prompt or the fact that irrelevant context still made it to the top.

Three levels of protection:

✔️ Strict system prompt — explicit prohibition of invention: "Answer ONLY based on the provided context. If there is no answer — say so honestly. Do not combine information from different sources if they contradict each other."
✔️ Score-based filter — if the maximum score among results is below 0.5, do not pass the context to the LLM at all. Return "not found" — it's better to have an honest answer than a confident hallucination.
✔️ Show sources — a list of articles or documents from which the answer was taken. The user will check themselves and report if something is wrong.

Section Conclusion: All three problems are diagnosed in the same way — by logging scores and reading the actual chunks that make it into the context. Add this from day one — and most questions like "why is RAG giving bad answers" will have an obvious answer.

❓ Frequently Asked Questions (FAQ)

Is a GPU required for RAG with Ollama?

No. GPUs accelerate response generation and embeddings, but are not mandatory. On CPU, embeddings through nomic-embed-text are generated slower, but are quite feasible for production with moderate load. On Apple Silicon (M1/M2/M3) — it's fast even without a dedicated GPU. For more details on which models actually work on limited hardware — Ollama on 8 GB RAM: Which Models Work in 2026.

How many documents can be indexed?

There are no limits — it depends on the size of the vector store and RAM. pgvector handles millions of vectors without issues. In practice, for a blog with 500 articles — this is ~2500 chunks of 512 tokens, occupying a few MB in PostgreSQL. Generating embeddings via Ollama for this volume takes 15-30 minutes on CPU — acceptable for a nightly scheduler.

RAG or regular full-text search?

Not "or", but "and". RAG is better for semantic queries of 3+ words where content is important — "how to optimize PostgreSQL queries" will find an article about indexes even without an exact word match. Full-text search is better for short queries (1-2 words), names, and exact matches. The optimal strategy is vector search with automatic fallback to FTS.

How to update the index when documents change?

Store indexed_at for each document and run re-indexing where updated_at > indexed_at. A deterministic ID via hash allows overwriting chunks without duplicates. An important nuance: if the ORM updates updated_at automatically on saveAll() — set indexed_at to a bit in the future so the scheduler doesn't re-index everything every time.

Which Ollama model to choose for RAG?

For embeddings — nomic-embed-text as the standard, mxbai-embed-large if higher accuracy is needed. For response generation — llama3.3:8b on 8 GB RAM or qwen2.5:14b if you have 16 GB. A full comparison of models with benchmarks and task-specific recommendations — Which Ollama Model to Choose in 2026: A Comparison of Llama, Qwen, DeepSeek, and Mistral.

Are there Java solutions for RAG with Ollama?

Yes — Spring AI with native support for Ollama and pgvector. A real case of implementing RAG for a blog on Spring AI with specific errors and solutions not found in the documentation — in the article How I Built RAG for webscraft.org: Spring AI + pgvector + Real Experience.

Where to start if I've never worked with Ollama?

First, understand the platform itself — what Ollama is, how it works, and what problems it solves. What is Ollama and Why Developers Are Massively Moving to Local AI in 2026 — an jargon-free overview that's easy to start with. After that, RAG becomes a logical next step.

✅ Conclusions

RAG is one of those technologies that looks complex until the first implementation and obvious afterward. The pipeline is conceptually simple: document → chunks → embeddings → search → answer. Complexity arises in the details — and you'll find most of these details not in the documentation, but through real production incidents.

What I learned from implementing it for the blog:

✔️ Start with nomic-embed-text — but immediately log scores on real queries. For non-Latin languages and short queries, a fallback is needed — you'll learn this not from the documentation.
✔️ Deterministic UUID for chunks — not random. Without this, re-indexing multiplies duplicates, and you won't immediately understand why.
✔️ Fallback to keyword search — vector search doesn't replace regular search, it complements it.
✔️ Batch by 10-20 documents while saving indexed_at — so that if it fails, you can continue from where you left off, not from the beginning.
✔️ Strict system prompt — without an explicit prohibition against invention, the model will hallucinate. This is not a bug of a specific model.

With Ollama, the entire stack is free and works locally — not a single document goes to the cloud. For a blog, internal documentation, or a corporate knowledge base, this is the correct architecture.

If you work with Java and Spring Boot — the next article is exactly about that. A real case of implementing RAG for webscraft.org on Spring AI + pgvector: what mistakes I made, what pitfalls I found, and what I would do differently — How I Built RAG for webscraft.org: Spring AI + pgvector + Real Experience.

If you are not yet familiar with Ollama — start with the Ollama 2026 overview.

📎 Sources

LlamaIndex: Introduction to RAG — official documentation
Real Python: LlamaIndex in Python — A RAG Guide — practical tutorial
Medium: LlamaIndex for Beginners 2025 — from zero to production
DEV Community: RAG with LlamaIndex, ChromaDB, and Ollama
Prem AI: LangChain vs LlamaIndex 2026 — Production RAG Comparison
Contabo: LlamaIndex vs LangChain — Which One to Choose in 2026
Ollama: nomic-embed-text — embedding model specifications
Ollama: mxbai-embed-large — alternative embedding model

Categories

RAG with Ollama: Teach AI to Answer Based on Your Documents

📚 Table of Contents

🎯 What is RAG — and why it's not fine-tuning

Analogy: A lawyer and a case

Why not fine-tuning

RAG vs Fine-tuning: When to choose what

When RAG won't solve the problem

🎯 How the pipeline works: from document to answer

Phase 1 — Indexing

Phase 2 — Retrieval and Generation

🎯 Tools: LlamaIndex vs LangChain vs manual approach

LlamaIndex — if documents and search are at the center

LangChain — if RAG is part of a more complex system

Manual approach — maximum control without a framework

Choice Matrix

Can they be combined

🎯 Choosing a Model for Embeddings: nomic-embed-text and Alternatives

What are Embeddings – An Analogy with Photography

Comparison of Embedding Models for Ollama

Real Limitations of nomic-embed-text

Practical Conclusion: When nomic-embed-text is Sufficient

🎯 Key Decisions in RAG Implementation – and What to Avoid

Decision 1 – How to Read and Clean Documents

Decision 2 – Chunk Size and Splitting Strategy

Decision 3 – Chunk Identifiers

Decision 4 – Index Update Strategy

Decision 5 – Batch Indexing

Decision 6 – LAZY Loading and Transactions

What to Avoid – Briefly

🎯 Production Patterns: What Tutorials Don't Tell You

Pattern 1 – Fallback: vector → keyword

Pattern 2 – Idempotent Indexing

Pattern 3 – Result Deduplication

Pattern 4 – Batch Indexing with Recovery

🎯 Typical Problems: Chunking, Threshold, Hallucinations

Problem 1 — Chunking: How to Tell When Something is Wrong

Problem 2 — Similarity Threshold: How to Find the Right Value

Problem 3 — Hallucinations Even With Context

❓ Frequently Asked Questions (FAQ)

Is a GPU required for RAG with Ollama?

How many documents can be indexed?

RAG or regular full-text search?

How to update the index when documents change?

Which Ollama model to choose for RAG?

Are there Java solutions for RAG with Ollama?

Where to start if I've never worked with Ollama?

✅ Conclusions

📎 Sources

📬 Don't Miss New Articles

Ready to build a turnkey website?

Останні статті

Tool RAG: що робити коли у агента забагато інструментів

Grounding в AI агентах: що робити коли tool call повернув не те

Я змусив два AI посперечатись про vibe coding — ось що вийшло

Agent Chat: два AI агенти що сперечаються — Spring Boot 4 + Spring AI + Ollama / OpenRouter

GPT-Realtime-2 vs Gemini Live API: що обрати для голосового агента у 2026 році

GPT-5.5 в Codex: що змінилось для розробників у 2026