Чи можна змінити embedding-модель без переіндексації?

Ні. Різні моделі дають несумісні вектори. Якщо змінили модель — потрібно переіндексувати всю базу заново. Саме тому вибір embedding-моделі на старті — одне з найважливіших архітектурних рішень у RAG-системі.

Безкоштовні embedding-моделі гірші за платні?

Не завжди. BGE-M3 перевершує OpenAI text-embedding-3-small на мультимовних retrieval-задачах. Qwen3-Embedding-8B на рівні з OpenAI small за якістю, коштуючи вдвічі менше ($0.01 проти $0.02 за 1M токенів). Але бенчмарки — не гарантія, тестуйте на своїх даних.

Яку embedding-модель обрати для українського контенту?

OpenAI працює, але якість нижча ніж для англійської. Cohere embed-v4 і BGE-M3 значно краще з кирилицею. Якщо бюджет обмежений — Qwen3-Embedding-8B на OpenRouter за $0.01/1M токенів з мультимовною підтримкою.

Скільки коштує embedding для малого проєкту?

Для блогу з 200 статей на OpenAI text-embedding-3-small — менше $5 на місяць. На Qwen3 через OpenRouter — менше $0.50. Self-hosted моделі (BGE-M3, Nomic) — безкоштовно, але потрібен сервер з GPU для швидкої генерації.

Що таке Matryoshka Representation Learning (MRL)?

Техніка яка дозволяє зменшувати розмірність вектора після генерації без перетренування моделі. Наприклад, OpenAI text-embedding-3-large генерує 3072 вимірів, але можна використати тільки перші 256 — з мінімальною втратою якості. Це економить місце у vector store і прискорює пошук.

AI_TOOLS 22 March 2026 13 min read 16,003 view

Embedding Models for RAG in 2026 How to Choose + Full Provider Comparison

Updated: 24 March 2026

Language: 🇺🇦 🇬🇧

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

✦ Ask AI about this article

Embedding Models for RAG in 2026 How to Choose + Full Provider Comparison

You've built a RAG pipeline, connected an LLM, and set up a vector store — but the search returns irrelevant results. The problem is almost never with the LLM, but with the embedding model. It's this model that determines how accurately the system understands text content and finds the right snippets. Spoiler alert: for 90% of projects, OpenAI's text-embedding-3-small at $0.02/1M tokens is the optimal starting point. However, there are situations where it falls short compared to Cohere, Voyage AI, or free open-source models.

⚡ In a Nutshell

✅ Embeddings are the Foundation of RAG: a poor model will render the entire pipeline useless, regardless of LLM quality.
✅ OpenAI text-embedding-3-small: $0.02/1M tokens offers the best price-quality balance for getting started.
✅ For Multilingual Content: Cohere embed-v4 or BGE-M3 are significantly more accurate than OpenAI.
🎯 You'll Get: a comparison table of 10+ models, selection criteria, and a real-world WebCraft case study with 4 languages.
👇 Below: detailed explanations, tables, and production experience.

📚 Article Contents

📌 What are Embeddings and Why are They Needed for RAG
📌 Criteria for Choosing an Embedding Model
📌 Paid API Providers: OpenAI, Cohere, Voyage AI, Gemini, Jina
📌 OpenRouter: Embeddings via a Unified API
📌 Open-Source Models: BGE-M3, Nomic, Qwen3, E5
📌 Comparison Table of All Models
📌 Who Needs What — Selection Matrix
💼 My Experience: Embeddings for WebCraft's Multilingual Blog
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

🎯 Section 1. What are Embeddings and Why are They Needed for RAG

How Embeddings Work in a RAG Pipeline

An embedding model converts text into a numerical vector that encodes meaning, not specific words. This allows a query like "how to build a website" to find a document about "turnkey web development" — even without any shared words. In a RAG pipeline, embeddings are used twice: when indexing documents and when searching based on a user's query.

If the LLM is the brain of your RAG system, then embeddings are the eyes. The brain can be brilliant, but if the eyes see blurrily — it will receive incorrect information and provide an incorrect answer.

As detailed in the article RAG with Ollama: From Pipeline to Production, a RAG pipeline consists of two phases: indexing (document → chunks → embeddings → vector store) and retrieval (query → embedding → similarity search → context → LLM response). The embedding model is involved in both phases — and it's precisely its quality that determines which snippets end up in the LLM's context.

Analogy: Embeddings as Text Photography

Imagine you're storing photos in different formats. A 100x100 pixel photo takes up little space but has blurry details. A 4000x3000 photo is sharp but heavy. With embeddings, it's the same logic: a vector is a "photograph" of text in numerical space. The vector's dimensionality (384, 768, 1024, 1536, 3072) is like its resolution. A larger vector encodes more nuances of meaning but takes up more space and is slower to generate.

Why Choosing an Embedding Model is Critical

There's a fundamental rule: the model used for indexing must be the same model used for embedding the query. Different models produce incompatible vectors. This means: if you decide to change your embedding model — you'll need to re-index your entire database from scratch. Therefore, the choice at the start must be deliberate.

✔️ A poor embedding model → irrelevant chunks in context → LLM hallucinations
✔️ A correct embedding model → accurate retrieval → high-quality answers

For more on how embeddings fit into the overall RAG architecture and how RAG differs from "pure" LLM — see the article LLM vs RAG in 2026: Why They Aren't the Same and When to Use Which.

Conclusion: embeddings are the foundation of RAG quality. The choice of model impacts search accuracy more than the choice of LLM for generation.

📌 Section 2. Criteria for Choosing an Embedding Model

What I Look For When Choosing

When I was choosing embeddings for my blog in multiple languages, I narrowed down the options to six criteria: price per token, vector dimensionality, context window, multilingual support, retrieval quality on benchmarks (MTEB), and inference speed. For most projects, price and multilingual support are the main factors.

There's no single "best" embedding model — there's only the best for your specific task, content language, and budget.

Price per 1M Tokens

The range is from $0 (open-source, self-hosted) to $0.18 (Voyage AI). For a blog with 500 articles, the difference between $0.02 and $0.13 is cents. For a corpus of millions of documents, it's hundreds of dollars monthly.

Vector Dimensionality

384 (MiniLM) → 768 (nomic) → 1024 (mxbai) → 1536 (OpenAI small) → 3072 (OpenAI large). Higher dimensionality = more precise search, but more space in the vector store. For pgvector with a few thousand chunks, the size difference is negligible.

Context Window

How many tokens the model can process per request. 256 (MiniLM) — only for short snippets. 8192 (nomic-embed-text) — sufficient for long articles. 32K (Jina v5) — for entire documents without splitting.

Multilingual Support

Critical for projects with content in multiple languages. OpenAI works well with English, less so with Cyrillic. Cohere embed-v4 supports 100+ languages with consistent quality. BGE-M3 is the strongest open-source option for multilingual search.

Retrieval Quality (MTEB Benchmark)

MTEB (Massive Text Embedding Benchmark) is the standard for comparing embedding models. However, it's important to understand: benchmarks don't always reflect real-world quality on your specific data. Test with your own content.

Inference Speed

For indexing (batch, night scheduler) — not critical. For real-time search — important. API providers typically respond within 50–200ms. Self-hosted on CPU is slower.

Conclusion: determine your content language, budget, and scale — and these three factors will narrow down your choice to 2–3 models.

📌 Section 3. Paid API Providers

Five Major Providers

OpenAI — the standard and easiest start.
Cohere — the leader for multilingual content.
Voyage AI — the most accurate for code and technical documentation.
Google Gemini — multimodal embeddings (text + image + video).
Jina AI — the longest context window (32K) at the lowest price.

OpenAI: text-embedding-3-small and text-embedding-3-large

text-embedding-3-small — 1536 dimensions, $0.02/1M tokens. A good balance of price and quality for English content. text-embedding-3-large — 3072 dimensions, $0.13/1M tokens. Supports Matryoshka Representation Learning: you can reduce dimensionality to 256 with minimal quality loss — saving on storage.

Suitable for: general RAG, English content, quick start. Not suitable for: multilingual content with Cyrillic, code, and technical docs (Voyage is more accurate).

Cohere: embed-v4

The leader in multilingual benchmarks. 100+ languages with English-level quality. $0.10/1M tokens. Supports binary quantization — reducing storage size by 90% with minimal quality loss. Separate input types for search and classification — optimization without changing the model.

Suitable for: multilingual products, global applications, RAG with content in 3+ languages. Not suitable for: if all content is in English — OpenAI is cheaper and has a broader SDK ecosystem.

Voyage AI: voyage-3-large

Highest accuracy for code, API documentation, and technical content. $0.18/1M tokens — the most expensive on the list. A separate model, voyage-code-3, is specifically for code search. Recommended by Anthropic as a preferred embedding provider.

Suitable for: developer tools, code search, technical documentation. Not suitable for: general content (overpaying without a quality gain).

Google Gemini Embedding 2

Multimodal embeddings: text, images, video, and even audio without transcription. $0.15/1M tokens. Leader of the MTEB leaderboard (score 68.32). Native support for Matryoshka Representation Learning.

Suitable for: multimodal content (text + images), Google Cloud ecosystem. Not suitable for: if only text embeddings are needed — more expensive than OpenAI.

Jina AI: jina-embeddings-v5

32K context window — the longest among commercial models. 89 languages. Binary quantization. Task-specific LoRA adapters for retrieval, matching, and classification. Price is lower than most competitors.

Suitable for: long documents without splitting into chunks, multilingual content at a moderate price. Not suitable for: if the ecosystem of tutorials and integrations is critical (smaller community than OpenAI).

Conclusion: from my experience, for a quick start, I recommend OpenAI — the simplest and most stable option. Cohere is worth choosing if you work with multilingual content. Voyage is well-suited for technical tasks and code. Gemini makes sense when you need to work with images and video. Jina is an optimal choice if a long context window and low cost are critical.

📌 Section 4. OpenRouter: Embeddings via a Unified API

One API Key for Dozens of Models

I use this option myself: OpenRouter is an aggregator that provides access to embedding models from various providers through a single OpenAI-compatible API. Prices are without markup: the same as directly from the provider. Convenient for comparing models and quick switching.

Available Embedding Models on OpenRouter

openai/text-embedding-3-small — $0.02/1M tokens, 1536 dimensions. openai/text-embedding-3-large — $0.13/1M tokens, 3072 dimensions. qwen/qwen3-embedding-8b — $0.01/1M tokens, multilingual, optimized for retrieval and classification. qwen/qwen3-embedding-0.6b — ~$0.004/1M tokens, basic quality, minimal price.

When OpenRouter is the Right Choice

✔️ Need to quickly compare several embedding models on the same data
✔️ Project already uses OpenRouter for LLMs (single API key and billing)
✔️ Need a fallback: if one provider goes down — automatic switching

When It's Better to Go Directly to the Provider

⚠️ Need specific provider features (Cohere binary quantization, Jina LoRA adapters)
⚠️ Enterprise SLA and availability guarantees

For local development with Ollama, embedding models run for free without any API. More details — Ollama in 2026: What It Is and Why Developers Are Massively Switching to Local AI.

Conclusion: for me, OpenRouter is a convenient aggregator for comparing models and getting started quickly. For production with specific requirements, I recommend using the provider's direct API.

📌 Section 5. Open-source Models (Self-Hosted)

Free, with Full Control

Open-source embedding models run locally without API keys or payment. BGE-M3 is the strongest for multilingual hybrid search. Nomic Embed V2 is for long documents. Qwen3-Embedding is the most flexible in terms of dimensionality. E5 and MiniLM are for rapid prototyping.

BGE-M3 (BAAI)

Hybrid search: dense + sparse vectors in one model. Multilingual. Ideal for RAG where both semantic and keyword search are needed simultaneously. As described in the article RAG with Ollama, fallback from vector search to keyword search is a critical production pattern, and BGE-M3 solves this at the model level.

Nomic Embed Text V2

Mixture-of-Experts architecture. Works well with long documents. Multilingual. Popular in local RAG stacks with Ollama as an alternative to the first version of nomic-embed-text.

Qwen3-Embedding (0.6B and 8B)

The latest series from Qwen. Flexible vector sizes (from 256 to 2048 dimensions — you choose). Multilingual, instruction-aware. The 8B version is available on OpenRouter for $0.01/1M tokens — cheaper than OpenAI small.

E5 and MiniLM

Lightweight models for rapid prototyping. MiniLM is 46 MB, 384 dimensions. Runs even on weak hardware. E5 is slightly more accurate, but both are inferior to BGE-M3 and Qwen3 on retrieval tasks.

Pros and Cons of Self-Hosting

✔️ Free — zero API costs
✔️ Full control over data — nothing goes to the cloud
✔️ No rate limits or dependency on external services
⚠️ Requires a GPU for fast generation (slower on CPU)
⚠️ Infrastructure is on you: updates, scaling, monitoring

Conclusion: open-source is the right choice for confidential data, minimal budget, or complete control. BGE-M3 and Qwen3-Embedding are leaders in 2026.

📌 Section 6. Comparison Table of All Models

All Models in One Place

The table includes 10+ embedding models with price, dimensionality, context window, multilingualism, and recommended use case.

Provider	Model	Price / 1M Tokens	Dimensionality	Context	Multilingualism	Best Use Case
OpenAI	text-embedding-3-small	$0.02	1536	8191	Medium	General RAG, English content
OpenAI	text-embedding-3-large	$0.13	3072	8191	Medium	Maximum accuracy, MRL
Cohere	embed-v4	$0.10	1024	512	100+ languages	Multilingual content, enterprise
Voyage AI	voyage-3-large	$0.18	1024	32000	Medium	Code, API docs, technical documentation
Google	Gemini Embedding 2	$0.15	3072	8192	100+ languages	Multimodal (text+image+video)
Jina AI	jina-embeddings-v5	~$0.05	1024	32000	89 languages	Long documents, multilingual search
OpenRouter	qwen3-embedding-8b	$0.01	256–2048	8192	High	Budget RAG, multilingual
Ollama	nomic-embed-text	Free	768	8192	Low (Cyrillic)	Local RAG, English
Ollama	mxbai-embed-large	Free	1024	512	Medium	Local RAG, high accuracy
Open-source	BGE-M3	Free	1024	8192	100+ languages	Hybrid search, multilingual RAG
Open-source	Nomic Embed V2	Free	768	8192	High	Long documents, self-hosted
Open-source	all-MiniLM-L6	Free	384	256	English	Prototyping, weak hardware

Conclusion: the table is a starting point. The final choice is only after testing on your real data.

📌 Section 7. Who is it for — Selection Matrix

Specific Recommendations for Specific Situations

I recommend choosing models like this:

Startup with a minimal budget → OpenAI small or Qwen3 via OpenRouter.
Multilingual product → Cohere embed-v4 or BGE-M3.
Code and technical docs → Voyage AI.
Confidential data → self-hosted BGE-M3.
Multimodal content → Gemini Embedding 2.

Situation	Recommendation	Why
Startup, minimal budget	OpenAI small or Qwen3 ($0.01)	Cents per month, simple API
English blog/knowledge base	OpenAI text-embedding-3-small	Industry standard, widest ecosystem
Multilingual product (3+ languages)	Cohere embed-v4 or BGE-M3	Consistent quality across all languages
Code, API documentation, technical docs	Voyage AI voyage-3-large	Highest accuracy for technical content
Text + images + video	Google Gemini Embedding 2	Unified multimodal model with audio
Confidential data (medicine, finance)	BGE-M3 or Qwen3 self-hosted	Data does not leave the server
Long documents without chunking	Jina v5 (32K context)	Longest context window
Local development with Ollama	nomic-embed-text	Free, 8K context, lightweight
Fast prototype on weak hardware	all-MiniLM (46 MB)	Minimal resources

Conclusion: identify your situation in the table — and you'll get a specific recommendation to start with.

💼 Section 8. My Experience: Embeddings for a Multilingual WebCraft Blog

A Case Study with 4 Languages on Spring Boot

The webscraft.org blog runs on Spring Boot + pgvector + OpenRouter. It features over 2500 articles in Ukrainian, English, Spanish, and German. Embeddings are generated using text-embedding-3-small. A RAG chatbot answers based on the blog's content. Here's what I learned in practice.

🚀 Need RAG, embedding, or AI chatbot integration for your project? We've already implemented this in production — Spring Boot, pgvector, multilingual search. Contact us — we'll consult and help with implementation.

Most embedding issues I encountered weren't in the documentation — but through real user queries where the search returned irrelevant results.

Stack

Java 21, Spring Boot, Spring AI, PostgreSQL with pgvector on Railway. Embeddings: openai/text-embedding-3-small via OpenRouter. LLM for generation: meta-llama/llama-3.3-70b-instruct (free version). Indexing: nightly scheduler, batches of 10–20 documents.

Why I Chose text-embedding-3-small

$0.02 per million tokens — for 2500 articles, this is literally cents per month. 1536 dimensions — sufficient "resolution" for blog content. OpenAI-compatible API via OpenRouter — minimal code changes when switching models.

Problems I Encountered

Cyrillic: Search quality for Ukrainian queries was noticeably lower than for English ones. A query like "як зробити сайт" (how to make a website) returned less relevant results than the equivalent "how to build a website". This is critical for a multilingual blog.

Short Queries: Queries with 1–2 words (e.g., "RAG" or "Spring") yielded vague results with low scores. Solution: fallback to keyword search for short queries, as described in the article on RAG with Ollama.

Duplicates During Re-indexing: Without deterministic chunk IDs, the database accumulated duplicates. Solution: UUID based on document_id + chunk number.

What I Would Do Differently

For multilingual content with Cyrillic, I would consider Cohere embed-v4 or Qwen3-Embedding-8B ($0.01/1M tokens, multilingual). The price difference is negligible, but the quality for non-Latin languages could be significantly higher. This is the next step — testing with real queries.

Real Numbers

✔️ ~500 articles × 4 languages = ~2000 documents
✔️ ~10,000 chunks of 512 tokens each
✔️ Indexing: ~15–20 minutes via API
✔️ Embedding Cost: less than $1 per month
✔️ Search: 100–300ms per query

A full breakdown of RAG architecture for production — RAG in 2026: From PoC to Production.

Conclusion: OpenAI small works for a multilingual blog, but not perfectly with Cyrillic. For similar projects, it's worth testing Cohere or Qwen3 — the price difference is minimal, and the quality can be significantly higher.

❓ Frequently Asked Questions (FAQ)

Can I change the embedding model without re-indexing?

No. Different models produce incompatible vectors. If you change the model, you need to re-index the entire database from scratch. This is why choosing an embedding model at the start is one of the most crucial architectural decisions in a RAG system.

Are free models worse than paid ones?

Not always. BGE-M3 outperforms OpenAI small on multilingual retrieval tasks. Qwen3-Embedding-8B is on par with OpenAI small in quality, costing half as much. However, benchmarks are not a guarantee. Test with your own data.

How much does embedding cost for a small project?

For a blog with 500 articles on OpenAI small — less than $1 per month. On Qwen3 via OpenRouter — less than $0.50. Self-hosted — $0, but requires a GPU server for fast generation.

Which model should I choose for Ukrainian content?

OpenAI works, but the quality is lower than for English. Cohere embed-v4 and BGE-M3 are significantly better with Cyrillic. If the budget is limited — Qwen3-Embedding-8B on OpenRouter ($0.01/1M tokens) with multilingual support.

What is Matryoshka Representation Learning (MRL)?

A technique that allows reducing the dimensionality of a vector after generation without retraining the model. For example, OpenAI large generates 3072 dimensions, but you can use only the first 256 — with minimal quality loss. This saves space in the vector store and speeds up search.

✅ Conclusions

🔹 The embedding model is the foundation of RAG. A poor model will render the entire pipeline useless, regardless of LLM quality.
🔹 OpenAI text-embedding-3-small ($0.02/1M) is an optimal starting point for English content.
🔹 For multilingual content with Cyrillic — Cohere embed-v4 or BGE-M3 are significantly more accurate.
🔹 Changing the embedding model = full re-indexing. Choose wisely at the start.
🔹 Benchmarks (MTEB) are a useful guide, but the final decision is only after testing with your real data and queries.

Main takeaway: I recommend not spending weeks choosing the "perfect" embedding model. Start with OpenAI small or Qwen3, log the scores on real queries, and only replace the model if you see specific search quality issues.

If you're just starting with RAG — RAG with Ollama: From Pipeline to Production will provide a complete picture of the architecture. If you need to understand the difference between LLM and RAG — LLM vs RAG in 2026. And a full guide from PoC to production — RAG in 2026: From PoC to Production.

Categories