You've built a RAG pipeline, connected an LLM, and set up a vector store —
but the search returns irrelevant results. The problem is almost never with the LLM,
but with the embedding model. It's this model that determines how accurately the system understands
text content and finds the right snippets.
Spoiler alert: for 90% of projects, OpenAI's text-embedding-3-small
at $0.02/1M tokens is the optimal starting point. However, there are situations where it falls short
compared to Cohere, Voyage AI, or free open-source models.
⚡ In a Nutshell
- ✅ Embeddings are the Foundation of RAG: a poor model will render the entire pipeline useless, regardless of LLM quality.
- ✅ OpenAI text-embedding-3-small: $0.02/1M tokens offers the best price-quality balance for getting started.
- ✅ For Multilingual Content: Cohere embed-v4 or BGE-M3 are significantly more accurate than OpenAI.
- 🎯 You'll Get: a comparison table of 10+ models, selection criteria, and a real-world WebCraft case study with 4 languages.
- 👇 Below: detailed explanations, tables, and production experience.
📚 Article Contents
🎯 Section 1. What are Embeddings and Why are They Needed for RAG
How Embeddings Work in a RAG Pipeline
An embedding model converts text into a numerical vector that encodes meaning,
not specific words. This allows a query like "how to build a website" to find
a document about "turnkey web development" — even without any shared words.
In a RAG pipeline, embeddings are used twice: when indexing documents
and when searching based on a user's query.
If the LLM is the brain of your RAG system, then embeddings are the eyes.
The brain can be brilliant, but if the eyes see blurrily —
it will receive incorrect information and provide an incorrect answer.
As detailed in the article
RAG with Ollama: From Pipeline to Production,
a RAG pipeline consists of two phases: indexing (document → chunks → embeddings → vector store)
and retrieval (query → embedding → similarity search → context → LLM response).
The embedding model is involved in both phases — and it's precisely its quality that determines
which snippets end up in the LLM's context.
Analogy: Embeddings as Text Photography
Imagine you're storing photos in different formats.
A 100x100 pixel photo takes up little space but has blurry details.
A 4000x3000 photo is sharp but heavy. With embeddings, it's the same logic:
a vector is a "photograph" of text in numerical space.
The vector's dimensionality (384, 768, 1024, 1536, 3072) is like its resolution.
A larger vector encodes more nuances of meaning but takes up more space
and is slower to generate.
Why Choosing an Embedding Model is Critical
There's a fundamental rule: the model used for indexing must be the same model
used for embedding the query. Different models produce incompatible vectors.
This means: if you decide to change your embedding model —
you'll need to re-index your entire database from scratch. Therefore, the choice at the start
must be deliberate.
- ✔️ A poor embedding model → irrelevant chunks in context → LLM hallucinations
- ✔️ A correct embedding model → accurate retrieval → high-quality answers
For more on how embeddings fit into the overall RAG architecture
and how RAG differs from "pure" LLM — see the article
LLM vs RAG in 2026: Why They Aren't the Same and When to Use Which.
Conclusion: embeddings are the foundation of RAG quality.
The choice of model impacts search accuracy more than the choice of LLM for generation.
📌 Section 2. Criteria for Choosing an Embedding Model
What I Look For When Choosing
When I was choosing embeddings for my blog in multiple languages, I narrowed down the options
to six criteria: price per token, vector dimensionality,
context window, multilingual support, retrieval quality on benchmarks (MTEB), and inference speed. For most projects,
price and multilingual support are the main factors.
There's no single "best" embedding model — there's only the best for your specific task,
content language, and budget.
Price per 1M Tokens
The range is from $0 (open-source, self-hosted) to $0.18 (Voyage AI).
For a blog with 500 articles, the difference between $0.02 and $0.13 is cents.
For a corpus of millions of documents, it's hundreds of dollars monthly.
Vector Dimensionality
384 (MiniLM) → 768 (nomic) → 1024 (mxbai) → 1536 (OpenAI small) → 3072 (OpenAI large).
Higher dimensionality = more precise search, but more space in the vector store.
For pgvector with a few thousand chunks, the size difference is negligible.
Context Window
How many tokens the model can process per request.
256 (MiniLM) — only for short snippets.
8192 (nomic-embed-text) — sufficient for long articles.
32K (Jina v5) — for entire documents without splitting.
Multilingual Support
Critical for projects with content in multiple languages.
OpenAI works well with English, less so with Cyrillic.
Cohere embed-v4 supports 100+ languages with consistent quality.
BGE-M3 is the strongest open-source option for multilingual search.
Retrieval Quality (MTEB Benchmark)
MTEB (Massive Text Embedding Benchmark) is the standard for comparing embedding models.
However, it's important to understand: benchmarks don't always reflect real-world quality
on your specific data. Test with your own content.
Inference Speed
For indexing (batch, night scheduler) — not critical.
For real-time search — important. API providers typically
respond within 50–200ms. Self-hosted on CPU is slower.
Conclusion: determine your content language, budget, and scale —
and these three factors will narrow down your choice to 2–3 models.
📌 Section 3. Paid API Providers
Five Major Providers
- OpenAI — the standard and easiest start.
- Cohere — the leader for multilingual content.
- Voyage AI — the most accurate for code and technical documentation.
- Google Gemini — multimodal embeddings (text + image + video).
- Jina AI — the longest context window (32K) at the lowest price.
OpenAI: text-embedding-3-small and text-embedding-3-large
text-embedding-3-small — 1536 dimensions, $0.02/1M tokens.
A good balance of price and quality for English content.
text-embedding-3-large — 3072 dimensions, $0.13/1M tokens.
Supports Matryoshka Representation Learning: you can reduce dimensionality
to 256 with minimal quality loss — saving on storage.
Suitable for: general RAG, English content, quick start.
Not suitable for: multilingual content with Cyrillic, code, and technical docs
(Voyage is more accurate).
Cohere: embed-v4
The leader in multilingual benchmarks. 100+ languages with English-level quality.
$0.10/1M tokens. Supports binary quantization — reducing storage size
by 90% with minimal quality loss.
Separate input types for search and classification — optimization without changing the model.
Suitable for: multilingual products, global applications, RAG with content
in 3+ languages.
Not suitable for: if all content is in English — OpenAI is cheaper
and has a broader SDK ecosystem.
Voyage AI: voyage-3-large
Highest accuracy for code, API documentation, and technical content.
$0.18/1M tokens — the most expensive on the list. A separate model, voyage-code-3,
is specifically for code search. Recommended by Anthropic as a preferred embedding provider.
Suitable for: developer tools, code search, technical documentation.
Not suitable for: general content (overpaying without a quality gain).
Google Gemini Embedding 2
Multimodal embeddings: text, images, video, and even audio
without transcription. $0.15/1M tokens. Leader of the MTEB leaderboard (score 68.32).
Native support for Matryoshka Representation Learning.
Suitable for: multimodal content (text + images),
Google Cloud ecosystem.
Not suitable for: if only text embeddings are needed — more expensive than OpenAI.
Jina AI: jina-embeddings-v5
32K context window — the longest among commercial models.
89 languages. Binary quantization. Task-specific LoRA adapters
for retrieval, matching, and classification. Price is lower than most competitors.
Suitable for: long documents without splitting into chunks,
multilingual content at a moderate price.
Not suitable for: if the ecosystem of tutorials and integrations is critical
(smaller community than OpenAI).
Conclusion: from my experience, for a quick start, I recommend OpenAI — the simplest and most stable option.
Cohere is worth choosing if you work with multilingual content.
Voyage is well-suited for technical tasks and code.
Gemini makes sense when you need to work with images and video.
Jina is an optimal choice if a long context window and low cost are critical.
📌 Section 4. OpenRouter: Embeddings via a Unified API
One API Key for Dozens of Models
I use this option myself: OpenRouter is an aggregator that provides access to embedding models
from various providers through a single OpenAI-compatible API.
Prices are without markup: the same as directly from the provider.
Convenient for comparing models and quick switching.
Available Embedding Models on OpenRouter
openai/text-embedding-3-small — $0.02/1M tokens, 1536 dimensions.
openai/text-embedding-3-large — $0.13/1M tokens, 3072 dimensions.
qwen/qwen3-embedding-8b — $0.01/1M tokens, multilingual, optimized
for retrieval and classification.
qwen/qwen3-embedding-0.6b — ~$0.004/1M tokens, basic quality,
minimal price.
When OpenRouter is the Right Choice
- ✔️ Need to quickly compare several embedding models on the same data
- ✔️ Project already uses OpenRouter for LLMs (single API key and billing)
- ✔️ Need a fallback: if one provider goes down — automatic switching
When It's Better to Go Directly to the Provider
- ⚠️ Need specific provider features (Cohere binary quantization, Jina LoRA adapters)
- ⚠️ Enterprise SLA and availability guarantees
For local development with Ollama, embedding models run
for free without any API. More details —
Ollama in 2026: What It Is and Why Developers Are Massively Switching to Local AI.
Conclusion: for me, OpenRouter is a convenient aggregator for comparing models and getting started quickly.
For production with specific requirements, I recommend using the provider's direct API.
📌 Section 5. Open-source Models (Self-Hosted)
Free, with Full Control
Open-source embedding models run locally without API keys or payment.
BGE-M3 is the strongest for multilingual hybrid search.
Nomic Embed V2 is for long documents. Qwen3-Embedding is the most flexible
in terms of dimensionality. E5 and MiniLM are for rapid prototyping.
BGE-M3 (BAAI)
Hybrid search: dense + sparse vectors in one model.
Multilingual. Ideal for RAG where both semantic
and keyword search are needed simultaneously. As described in the article
RAG with Ollama,
fallback from vector search to keyword search is a critical production pattern,
and BGE-M3 solves this at the model level.
Nomic Embed Text V2
Mixture-of-Experts architecture. Works well with long documents.
Multilingual. Popular in local RAG stacks with Ollama
as an alternative to the first version of nomic-embed-text.
Qwen3-Embedding (0.6B and 8B)
The latest series from Qwen. Flexible vector sizes
(from 256 to 2048 dimensions — you choose).
Multilingual, instruction-aware. The 8B version is available
on OpenRouter for $0.01/1M tokens — cheaper than OpenAI small.
E5 and MiniLM
Lightweight models for rapid prototyping. MiniLM is 46 MB, 384 dimensions.
Runs even on weak hardware. E5 is slightly more accurate,
but both are inferior to BGE-M3 and Qwen3 on retrieval tasks.
Pros and Cons of Self-Hosting
- ✔️ Free — zero API costs
- ✔️ Full control over data — nothing goes to the cloud
- ✔️ No rate limits or dependency on external services
- ⚠️ Requires a GPU for fast generation (slower on CPU)
- ⚠️ Infrastructure is on you: updates, scaling, monitoring
Conclusion: open-source is the right choice
for confidential data, minimal budget, or complete control.
BGE-M3 and Qwen3-Embedding are leaders in 2026.
📌 Section 6. Comparison Table of All Models
All Models in One Place
The table includes 10+ embedding models with price, dimensionality,
context window, multilingualism, and recommended use case.
| Provider |
Model |
Price / 1M Tokens |
Dimensionality |
Context |
Multilingualism |
Best Use Case |
| OpenAI |
text-embedding-3-small |
$0.02 |
1536 |
8191 |
Medium |
General RAG, English content |
| OpenAI |
text-embedding-3-large |
$0.13 |
3072 |
8191 |
Medium |
Maximum accuracy, MRL |
| Cohere |
embed-v4 |
$0.10 |
1024 |
512 |
100+ languages |
Multilingual content, enterprise |
| Voyage AI |
voyage-3-large |
$0.18 |
1024 |
32000 |
Medium |
Code, API docs, technical documentation |
| Google |
Gemini Embedding 2 |
$0.15 |
3072 |
8192 |
100+ languages |
Multimodal (text+image+video) |
| Jina AI |
jina-embeddings-v5 |
~$0.05 |
1024 |
32000 |
89 languages |
Long documents, multilingual search |
| OpenRouter |
qwen3-embedding-8b |
$0.01 |
256–2048 |
8192 |
High |
Budget RAG, multilingual |
| Ollama |
nomic-embed-text |
Free |
768 |
8192 |
Low (Cyrillic) |
Local RAG, English |
| Ollama |
mxbai-embed-large |
Free |
1024 |
512 |
Medium |
Local RAG, high accuracy |
| Open-source |
BGE-M3 |
Free |
1024 |
8192 |
100+ languages |
Hybrid search, multilingual RAG |
| Open-source |
Nomic Embed V2 |
Free |
768 |
8192 |
High |
Long documents, self-hosted |
| Open-source |
all-MiniLM-L6 |
Free |
384 |
256 |
English |
Prototyping, weak hardware |
Conclusion: the table is a starting point.
The final choice is only after testing on your real data.
📌 Section 7. Who is it for — Selection Matrix
Specific Recommendations for Specific Situations
I recommend choosing models like this:
- Startup with a minimal budget → OpenAI small or Qwen3 via OpenRouter.
- Multilingual product → Cohere embed-v4 or BGE-M3.
- Code and technical docs → Voyage AI.
- Confidential data → self-hosted BGE-M3.
- Multimodal content → Gemini Embedding 2.
| Situation |
Recommendation |
Why |
| Startup, minimal budget |
OpenAI small or Qwen3 ($0.01) |
Cents per month, simple API |
| English blog/knowledge base |
OpenAI text-embedding-3-small |
Industry standard, widest ecosystem |
| Multilingual product (3+ languages) |
Cohere embed-v4 or BGE-M3 |
Consistent quality across all languages |
| Code, API documentation, technical docs |
Voyage AI voyage-3-large |
Highest accuracy for technical content |
| Text + images + video |
Google Gemini Embedding 2 |
Unified multimodal model with audio |
| Confidential data (medicine, finance) |
BGE-M3 or Qwen3 self-hosted |
Data does not leave the server |
| Long documents without chunking |
Jina v5 (32K context) |
Longest context window |
| Local development with Ollama |
nomic-embed-text |
Free, 8K context, lightweight |
| Fast prototype on weak hardware |
all-MiniLM (46 MB) |
Minimal resources |
Conclusion: identify your situation in the table —
and you'll get a specific recommendation to start with.
💼 Section 8. My Experience: Embeddings for a Multilingual WebCraft Blog
A Case Study with 4 Languages on Spring Boot
The webscraft.org blog runs on Spring Boot + pgvector + OpenRouter.
It features over 2500 articles in Ukrainian, English, Spanish, and German.
Embeddings are generated using text-embedding-3-small. A RAG chatbot answers
based on the blog's content. Here's what I learned in practice.
🚀 Need RAG, embedding, or AI chatbot integration for your project?
We've already implemented this in production — Spring Boot, pgvector, multilingual search.
Contact us — we'll consult
and help with implementation.
Most embedding issues I encountered weren't in the documentation —
but through real user queries where the search returned irrelevant results.
Stack
Java 21, Spring Boot, Spring AI, PostgreSQL with pgvector on Railway.
Embeddings: openai/text-embedding-3-small via OpenRouter.
LLM for generation: meta-llama/llama-3.3-70b-instruct (free version).
Indexing: nightly scheduler, batches of 10–20 documents.
Why I Chose text-embedding-3-small
$0.02 per million tokens — for 2500 articles, this is literally cents per month.
1536 dimensions — sufficient "resolution" for blog content.
OpenAI-compatible API via OpenRouter — minimal code changes when switching models.
Problems I Encountered
Cyrillic: Search quality for Ukrainian queries was noticeably lower
than for English ones. A query like "як зробити сайт" (how to make a website) returned less relevant
results than the equivalent "how to build a website".
This is critical for a multilingual blog.
Short Queries: Queries with 1–2 words (e.g., "RAG" or "Spring")
yielded vague results with low scores. Solution: fallback to keyword search
for short queries, as described in the
article on RAG with Ollama.
Duplicates During Re-indexing: Without deterministic chunk IDs,
the database accumulated duplicates. Solution: UUID based on
document_id + chunk number.
What I Would Do Differently
For multilingual content with Cyrillic, I would consider Cohere embed-v4
or Qwen3-Embedding-8B ($0.01/1M tokens, multilingual).
The price difference is negligible, but the quality for non-Latin languages
could be significantly higher. This is the next step — testing with real queries.
Real Numbers
- ✔️ ~500 articles × 4 languages = ~2000 documents
- ✔️ ~10,000 chunks of 512 tokens each
- ✔️ Indexing: ~15–20 minutes via API
- ✔️ Embedding Cost: less than $1 per month
- ✔️ Search: 100–300ms per query
A full breakdown of RAG architecture for production —
RAG in 2026: From PoC to Production.
Conclusion: OpenAI small works for a multilingual blog,
but not perfectly with Cyrillic. For similar projects, it's worth testing
Cohere or Qwen3 — the price difference is minimal, and the quality can be significantly higher.
❓ Frequently Asked Questions (FAQ)
Can I change the embedding model without re-indexing?
No. Different models produce incompatible vectors. If you change the model,
you need to re-index the entire database from scratch. This is why choosing
an embedding model at the start is one of the most crucial architectural decisions
in a RAG system.
Are free models worse than paid ones?
Not always. BGE-M3 outperforms OpenAI small on multilingual retrieval tasks.
Qwen3-Embedding-8B is on par with OpenAI small in quality, costing half as much.
However, benchmarks are not a guarantee. Test with your own data.
How much does embedding cost for a small project?
For a blog with 500 articles on OpenAI small — less than $1 per month.
On Qwen3 via OpenRouter — less than $0.50. Self-hosted — $0,
but requires a GPU server for fast generation.
Which model should I choose for Ukrainian content?
OpenAI works, but the quality is lower than for English.
Cohere embed-v4 and BGE-M3 are significantly better with Cyrillic.
If the budget is limited — Qwen3-Embedding-8B on OpenRouter ($0.01/1M tokens)
with multilingual support.
What is Matryoshka Representation Learning (MRL)?
A technique that allows reducing the dimensionality of a vector after generation
without retraining the model. For example, OpenAI large generates 3072 dimensions,
but you can use only the first 256 — with minimal quality loss.
This saves space in the vector store and speeds up search.
✅ Conclusions
- 🔹 The embedding model is the foundation of RAG. A poor model will render the entire pipeline useless,
regardless of LLM quality.
- 🔹 OpenAI
text-embedding-3-small ($0.02/1M) is
an optimal starting point for English content.
- 🔹 For multilingual content with Cyrillic — Cohere embed-v4 or BGE-M3
are significantly more accurate.
- 🔹 Changing the embedding model = full re-indexing.
Choose wisely at the start.
- 🔹 Benchmarks (MTEB) are a useful guide, but the final decision
is only after testing with your real data and queries.
Main takeaway: I recommend not spending weeks choosing the "perfect" embedding model.
Start with OpenAI small or Qwen3, log the scores on real queries,
and only replace the model if you see specific search quality issues.
If you're just starting with RAG —
RAG with Ollama: From Pipeline to Production
will provide a complete picture of the architecture.
If you need to understand the difference between LLM and RAG —
LLM vs RAG in 2026.
And a full guide from PoC to production —
RAG in 2026: From PoC to Production.