Що таке embedding у штучному інтелекті?

Embedding — це перетворення тексту на числовий вектор (список від 384 до 3072 чисел), який кодує сенс слова або речення. Слова зі схожим значенням отримують близькі вектори, слова з різним значенням — далекі. Саме завдяки embeddings AI може знаходити документи за змістом, а не лише за точним збігом слів.

Яка метрика схожості краща — Cosine Similarity чи Dot Product?

Cosine Similarity — стандарт для більшості моделей: вимірює кут між векторами незалежно від їхньої довжини. Деякі моделі (наприклад, Gemini Embedding 2 від Google або частина моделей Cohere) оптимізовані під Dot Product. Якщо нормалізувати вектори до одиничної довжини, Cosine Similarity стає математично еквівалентна Dot Product — і пошук прискорюється. Завжди перевіряйте документацію моделі перед налаштуванням метрики у Vector DB.

Що таке cross-lingual пошук і як він працює?

Cross-lingual пошук — це можливість знаходити документи однією мовою за запитом іншою мовою без перекладу. Мультимовні embedding-моделі (BGE-M3, Cohere embed-v4) навчались на паралельних корпусах перекладів, тому 'договір' (українська) і 'contract' (англійська) опиняються в суміжних точках одного векторного простору.

Чим embedding відрізняється від токена?

Токен — це одиниця тексту (слово або його частина), яку AI виділяє при розбитті тексту. Embedding — це числовий вектор, що кодує сенс цього токена або цілого речення. Токени — рівень синтаксису (як виглядає текст), embeddings — рівень семантики (що він означає).

Що таке розмірність вектора і яку обрати?

Розмірність — це кількість чисел у векторі: 384, 768, 1024, 1536 або 3072. Більша розмірність кодує більше нюансів сенсу, але займає більше пам'яті і уповільнює пошук. Для більшості RAG-проєктів оптимальний вибір — 768–1536 вимірів. Моделі з підтримкою Matryoshka (MRL) дозволяють обрізати вектор до потрібної розмірності без перегенерації.

Коли embeddings не підходять і що використовувати натомість?

Embeddings погано працюють з точними числами, датами, артикулами, ID і короткими запитами з 1–2 слів — модель розмиває точні значення у загальний сенс. Для таких задач потрібен keyword-пошук (BM25) або metadata-фільтрація. Найкраще рішення для production — hybrid search: поєднання векторного пошуку з BM25, що дає +15–40% якості.

Як навчається embedding-модель?

Embedding-модель навчається методом контрастного навчання на мільярдах текстових пар. Речення зі схожим сенсом (позитивні пари) отримують близькі вектори, речення з різним сенсом (негативні пари) — далекі. Ніхто не задає вручну які слова схожі — модель сама виводить це зі статистики реальних текстів.

Чи можна використовувати один embedding для пошуку і класифікації?

Технічно так, але якість буде нижчою. Деякі моделі (Cohere embed-v4, Jina v5) підтримують task-specific режими: ви вказуєте тип задачі (retrieval, classification, matching) і модель оптимізує вектор під неї. Якщо задач кілька — обирайте моделі з підтримкою таких режимів.

Яку embedding-модель обрати для українського контенту у 2026?

Для старту — text-embedding-3-small від OpenAI ($0.02/1M токенів) або nomic-embed-text локально через Ollama (безкоштовно). Для мультимовного контенту з кирилицею якість OpenAI помітно нижча: краще BGE-M3 (open-source, self-hosted) або Cohere embed-v4 ($0.10/1M токенів) — обидві підтримують 100+ мов з однаковою якістю для латиниці і кирилиці.

Чи потрібно переіндексувати базу при зміні embedding-моделі?

Так, обов'язково. Різні моделі генерують вектори в несумісних просторах — їх не можна порівнювати між собою. Якщо ви змінили embedding-модель, потрібно перегенерувати всі вектори заново. Саме тому вибір моделі на старті — одне з ключових архітектурних рішень у RAG-системі.

AI_TOOLS 27 March 2026 21 min read 1,084 view

What Are Embeddings: How AI Understands Meaning, Not Just Words

Updated: 27 March 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

What Are Embeddings: How AI Understands Meaning, Not Just Words

Have you ever wondered why ChatGPT finds a connection between "car" and "automobile" — even though they are different words? Or why a RAG system finds the right document even if the query doesn't contain a single word from the text? Spoiler: one technology stands behind it — embedding. It's a way to convert any text into a set of numbers so that texts with similar meanings have similar numbers.

⚡ TL;DR

✅ Embedding is not a word, but a number: each word or sentence is converted into a vector of hundreds of numbers that encodes its meaning
✅ Similar meaning = similar numbers: "cat" and "kitten" have close vectors, "cat" and "rocket" have distant ones
✅ No RAG, semantic search, or recommendations without embeddings: it's the foundation of most modern AI systems
🎯 You will get: a clear understanding of what embedding is, how it's trained, where it's used, and when it's not suitable
👇 Below is an explanation with analogies, diagrams, and practical takeaways without unnecessary theory

📚 Article Content

📌 Section 1. From tokens to numbers: why AI cannot work with words directly
📌 Section 2. Embedding = coordinates in the space of meaning
📌 Section 3. Where do these numbers come from? How the model learned to encode meaning
📌 Section 4. Dimensionality: why 768 ≠ 1536 and what it means in practice
📌 Section 5. Where embeddings are used: 5 real-world applications
📌 Section 6. Limitations of embeddings: when they won't help
💼 Section 7. Models of 2026: where to start
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

🎯 From tokens to numbers: why AI cannot work with words directly

Why numbers are needed

A neural network is a mathematical function. It cannot process words as they are, because it doesn't know what "cat" or "contract" means. It only knows how to multiply, add, and transform numbers. Therefore, each word (or token) must first be converted into a numerical vector — and this is exactly what an embedding model does.

A token is a unit of text that the AI sees. An embedding is what that token *means*, written in numbers.

If you've read the article about tokens, you know: AI divides text into fragments (tokens) — these can be whole words, parts of words, or even individual characters. But a token is not yet meaning. It's just a unit of segmentation.

Embedding is the next step. Each token gets its unique numerical "fingerprint" — a vector. This vector encodes not the spelling of the word, but its meaning in the context of language. That's why "car" and "automobile" are different tokens, but their vectors will be very close.

Analogy: dictionary vs. city map

Imagine two ways to describe a word. The first is a dictionary: "cat — a domestic animal of the feline family." This is useful for humans, but useless for mathematics. The second is a map: in the city of concepts, "cat" is located next to "kitten," "paw," "purring" — and far from "rocket" or "accounting." An embedding is precisely a map where the position of each word is determined by its semantic neighborhood. OpenAI Embeddings documentation on how to convert text into a numerical vector in a high-dimensional space.

✔️ Token = unit of text (what is written)
✔️ Embedding = numerical vector (what it means)
✔️ Without converting text to numbers — no neural network can work with it

Conclusion: Embedding is the essential bridge between text that humans understand and numbers that AI works with.

Why different models give different results: comparison table

Parameter	OpenAI text-embedding-3-small	BGE-M3 (BAAI)	E5-large (Microsoft)
Training signal (what it was trained on)	Synthetic pairs generated by GPT-4 + NLI datasets + web corpus. Closed dataset, details not disclosed	Open dataset: 1.2M pairs from 570+ sources — MS MARCO, NLI, parallel translations of 100+ languages, code	Web corpus + Microsoft NLI datasets. Instruction-tuning: an instruction like "query: " or "passage: " is added before each query
Context window (max tokens)	8,191 tokens — enough for long articles and documents	8,192 tokens — similarly, plus support for long queries in hybrid search	512 tokens — limitation for long documents, requires chunking
Architecture	Transformer encoder, details closed. Supports Matryoshka (MRL): can be truncated to 256 dimensions	XLM-RoBERTa as a base model. Uniqueness: simultaneously generates dense + sparse + ColBERT vectors	Transformer encoder based on DeBERTa. Instruction-aware: the result depends on the prefix instruction
Vector dimensionality	1,536 (or less via MRL)	1,024	1,024
Similarity metric	Cosine Similarity (or Dot Product after normalization)	Cosine Similarity for dense, inner product for sparse	Cosine Similarity — but a prefix must be added, otherwise quality drops
Multilingualism	Supported, but quality on Cyrillic is lower than on Latin	100+ languages, same quality — leader of MTEB multilingual benchmarks	Primarily English. Multilingual version — separate model multilingual-e5
Hybrid search	Dense vectors only — hybrid requires a separate BM25	Native hybrid: dense + sparse in one model without additional tools	Dense only — hybrid via separate BM25 or Elasticsearch
Price	$0.02 / 1M tokens via API	Free — self-hosted, requires GPU for speed	Free — self-hosted via HuggingFace
When to choose	Quick start, English or mixed content, minimal infrastructure	Multilingual content, Cyrillic, hybrid search, confidential data	English retrieval, have GPU, need high quality on English benchmarks

Why the same text yields different results in different models

There are three main reasons for discrepancies in results between models:

1. Different training signal. OpenAI trained the model on synthetic pairs from GPT-4 — this provides good general quality, but the dataset details are closed. BGE-M3 was trained on an open 1.2M pair dataset with an explicit multilingual signal — hence better on Cyrillic. E5 uses instruction-tuning: the model expects the prefix "query: " before the query and "passage: " before the document. If the prefix is not added — quality drops by 5–15% even on English content.

2. Different context window. E5-large processes only 512 tokens — a long document will be truncated, and the tail part will simply disappear from the vector. OpenAI and BGE-M3 process up to 8K tokens, allowing entire articles to be embedded without information loss. If your documents are longer than 400 words and you use E5 — chunking is definitely required.

3. Different space geometry. Each model builds its own vector space. The vector for the word "contract" from OpenAI and the vector for the same word from BGE-M3 are incompatible numbers in incompatible spaces. That's why you cannot mix vectors from different models in the same Vector DB and cannot compare them. More details on this are in the MTEB Leaderboard, where models are compared on standardized benchmarks.

📌 Embedding = coordinates in the space of meaning

What is a vector

An embedding is a list of numbers, for example: [0.21, -0.84, 0.03, 0.67, ...] — from 384 to 3072 numbers depending on the model. Each set of numbers is a point in a multidimensional space. Words with similar meanings end up close in this space, words with different meanings are far apart.

If regular GPS coordinates are two numbers (latitude and longitude), then an embedding is a GPS in a space with thousands of dimensions, where each "coordinate" encodes some aspect of meaning.

Let's imagine a very simplified space — only two dimensions. The X-axis represents how much the word is related to the living world. The Y-axis represents how much it is related to movement. Then:

"lawyer" → high X (human), low Y (not about movement) → coordinate [0.9, 0.1]
"courier" → high X, high Y → [0.9, 0.8]
"car" → low X, high Y → [0.1, 0.9]
"rocket" → low X, very high Y → [0.05, 0.95]

In reality, there are not two, but hundreds or thousands of dimensions — each encodes some subtler aspect: emotional tone, syntactic role, thematic relevance, etc. We cannot interpret them individually, but the mathematics of distances between points in this space works flawlessly.

How similarity between vectors is measured: Cosine Similarity

The distance between two vectors is measured not by a straight line (Euclidean distance), but by the angle between them — the so-called cosine similarity. The smaller the angle between two vectors, the more semantically similar the words or sentences are. "Contract" and "agreement" point in almost the same direction in the space of meaning — the angle between them is small, similarity is high. "Contract" and "rocket" are almost perpendicular, similarity is close to zero.

Cosine similarity is the de facto standard for most models and vector DBs. But there is an important nuance that I see in production errors: some models are optimized for a different metric — Dot Product. For example, Google's Gemini Embedding 2 and some Cohere models in certain modes expect Dot Product, not Cosine. If you set the wrong metric in your Vector DB — search results will be worse even with the correct model. Always check the model's documentation before setting the distance in ChromaDB, Qdrant, or pgvector.

Lifehack: normalizing vectors speeds up search

There is a practical trick worth knowing. If you normalize vectors to unit length — that is, bring each vector to a norm of 1 — then Cosine Similarity becomes mathematically equivalent to Dot Product. And Dot Product is calculated faster: it's just the sum of element-wise products, without additional division by norms. Many vector DBs (Qdrant, pgvector with vector_cosine_ops) do this automatically — but if you use your own search or FAISS, normalize vectors before saving and you'll get a speed boost without any loss of quality.

More details on how cosine similarity and Dot Product are used in search in practice are in the article Vector Search for Beginners.

✔️ Embedding — a list of numbers from 384 to 3072 values
✔️ Similarity is measured by the angle between vectors (Cosine Similarity) — but check the model's documentation: some are optimized for Dot Product
✔️ Normalized vectors: Cosine = Dot Product → faster search without quality loss

Conclusion: Embedding transforms abstract "meaning" into concrete geometry — and this is precisely what allows AI to compare text meanings mathematically. But choosing the right similarity metric is no less important than the model itself.

📌 Section 3. Where do these numbers come from? How the model learned to encode meaning

How an embedding model is trained

An embedding model is trained on billions of text pairs using contrastive learning: sentences with similar meanings get close vectors, sentences with different meanings get distant ones. No one manually tells the model that "termination of contract" is similar to "cessation of agreement" — it deduces this itself from the statistics of billions of real texts.

The model doesn't read a dictionary and doesn't know grammar. It reads billions of sentences — and learns which words and phrases co-occur, and which never appear together.

Behind this lies the idea of distributional semantics, formulated by linguist John Firth back in 1957: "A word is characterized by the company it keeps." If two words consistently appear in similar contexts — they most likely have similar meanings. This very principle became the basis for Word2Vec (Google, 2013) — the first large-scale model that converted words into vectors through context statistics, rather than manual labeling. Original paper by Mikolov et al., 2013 — arXiv:1301.3781 .

From Word2Vec to modern contrastive learning

Word2Vec training was simple: for each word — predict neighboring words within a window of 5–10 tokens. If "lawyer" and "attorney" frequently appear in similar sentences — their vectors get closer. This already worked. But in Word2Vec, each word had *one* vector regardless of context: the word "key" received the same vector in the sentence "key to the lock" and "key to success" — even though the meanings are different.

Modern models (BERT by Google, 2018) solved this problem: the vector now depends on the entire sentence, not on a static dictionary. The word "key" gets a different vector depending on what is next to it. This became possible thanks to the transformer architecture and the self-attention mechanism. BERT: Pre-training of Deep Bidirectional Transformers — arXiv:1810.04805 .

What contrastive learning looks like in practice

The current standard for training embedding models is contrastive learning, specifically the SimCSE approach and its derivatives. The model receives three sentences simultaneously:

Anchor: "How to cancel a service subscription?"
Positive example: "Instructions for unsubscribing from a plan" — should be close to the anchor
Negative example: "Recipe for traditional borscht" — should be far from the anchor

The loss function (contrastive loss or InfoNCE loss) penalizes the model if the positive vector is far from the anchor — or if the negative vector is too close. Through billions of such triplets, the model learns to build a space where semantic proximity is reflected by geometric proximity. SimCSE: Simple Contrastive Learning of Sentence Embeddings — arXiv:2104.08821 .

Where do they get pairs for training? Mostly from three sources: natural question-answer pairs (NLI datasets, Stack Overflow, Reddit), parallel translations (for multilingual models), and synthetic pairs generated by LLMs. For example, to train text-embedding-3-small, OpenAI used synthetic data generated by GPT-4, as described in OpenAI's blog on new embedding models .

What the model actually encodes in the vector

Researchers have found that different dimensions of the vector encode different aspects of meaning — but not as straightforwardly as one might wish. One dimension might be responsible for "legal context," another for "emotional tone," a third for "syntactic role in the sentence." But the dimensions are not isolated: meaning is encoded distributively across all numbers simultaneously. This means that no single number in the vector has a human-interpretable meaning — only the entire combination together.

A well-known example from Word2Vec, which has carried over to modern models: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). The arithmetic of meanings works in the vector space — this is one of the most striking confirmations that the model has truly captured the structure of language, not just memorized words.

Why multilingual models understand "договір" and "contract" as the same

Multilingual embedding models (Cohere embed-v4, BGE-M3, multilingual-e5) were trained simultaneously on texts in dozens of languages and on parallel translation corpora. Because of this, "договір" (Ukrainian) and "contract" (English) end up in adjacent points in the same space — because they appeared in similar contexts and were linked through translations.

This enables cross-lingual search: a query in Ukrainian finds documents in English without any translation. In practice — if your knowledge base is filled mostly with English documents, and users query in Ukrainian, a high-quality multilingual model will bridge this gap. In our WebsCraft case, this was a real problem: some content exists only in one language, and queries come in another. BGE-M3 handles this significantly better than OpenAI small, where quality on Cyrillic is noticeably lower than on Latin — which is also confirmed by the MTEB Multilingual Leaderboard results .

✔️ Distributional semantics: meaning = co-occurrence context
✔️ Contrastive learning: positive pairs move closer, negative ones move apart
✔️ Modern models (BERT and beyond): vector depends on context, not static
✔️ Multilingual models: translations in the same space → cross-lingual search without a translator

Conclusion: The numbers in a vector are not arbitrary and not hardcoded. They are the result of training on billions of examples and reflect the real statistical structure of language — that's why they capture meaning so well.

📌 Dimensionality: why 768 ≠ 1536 and what it means in practice

What is dimensionality

Dimensionality is the number of numbers in a vector. A 384-dimensional vector is small and fast, but less accurate. A 3072-dimensional vector encodes more nuances of meaning, but takes up more memory and is slower to search. For most RAG projects, 768–1536 dimensions offer an optimal balance.

The dimensionality of a vector is like the resolution of a photograph: 100×100 pixels are enough for an icon, 4000×3000 for a poster. What you need depends on the task.

More dimensions = more "channels" for encoding meaning. A small vector (384 dimensions, like in all-MiniLM) can distinguish "cat" and "rocket," but might confuse subtle nuances: for example, it might not differentiate "early termination of contract" from "expiration of contract term." A large vector (3072 dimensions, like in text-embedding-3-large) will capture this difference too, but will cost more and take up more space in the database.

Trade-off in practice: speed, quality, cost

Dimensionality	Model example	Retrieval quality	Memory footprint	Suitable for
384	all-MiniLM-L6-v2	Basic	~1.5 KB/vector	Prototypes, low-end hardware
768	nomic-embed-text	Good	~3 KB/vector	Local RAG, Ollama
1024	BGE-M3, Cohere embed-v4	High	~4 KB/vector	Production, multilingual
1536	text-embedding-3-small	High	~6 KB/vector	General RAG via API
3072	text-embedding-3-large	Maximum	~12 KB/vector	Maximum accuracy, MRL

Matryoshka embeddings: high dimensionality without overpaying

In 2024–2025, the Matryoshka Representation Learning (MRL) technique gained popularity. The idea is simple: the model is trained such that the first N dimensions of the vector already carry most of the useful information. This means you can generate a 3072-dimensional vector but store and search only the first 256 — with minimal loss of quality. OpenAI text-embedding-3-large and Gemini Embedding 2 support this approach: generate once, and choose the dimensionality for the task.

✔️ Higher dimensionality = more nuances, but more expensive and slower
✔️ For most projects, 768–1536 dimensions is optimal
✔️ Matryoshka (MRL): you can truncate vectors without re-generation

Section conclusion: Dimensionality is a trade-off between quality and resources; start with 768–1536 and increase only if there's a specific search quality problem.

📌 Where embeddings are used: 5 real-world applications

Where embeddings are needed

Embeddings are the foundation of semantic search, RAG systems, recommendations, zero-shot classification, and duplicate detection. Any task that requires comparing or grouping texts by meaning needs embeddings.

If keyword search is like searching by address, then embedding search is like searching for what's inside the house.

1. Semantic search

Traditional search looks for exact word matches. Semantic search finds documents by meaning: a query "how to cancel subscription" finds an article "instructions for tariff cancellation" — even if there are no common words. This is how internal searches in Notion, Confluence, and corporate knowledge bases work.

2. RAG (Retrieval-Augmented Generation)

In RAG systems, embeddings are used twice: during indexing (document → vector → store in vector DB) and during search (query → vector → find similar documents → LLM response). Without quality embeddings, RAG returns irrelevant fragments, and even the best LLM will hallucinate. More details in the article LLM vs RAG in 2026 and the complete RAG guide.

3. Recommendation systems

"You might like..." is often embeddings in action. The article you're reading is converted into a vector. The system finds other articles with similar vectors and suggests them. The same applies to streaming services: a movie you watched → vector → search for similar ones by mood and theme, not just genre.

4. Zero-shot classification

Traditional classification requires thousands of labeled examples for each class. With embeddings, you can classify texts without any examples: calculate the vector of the text and compare it with the vectors of class names. "Is this a complaint or praise?" — calculate the cosine similarity with the vectors of the words "complaint" and "praise" and see which one is closer.

5. Duplicate detection and cross-lingual search

Embeddings find nearly identical texts even if they are paraphrased or written in different languages. This is useful for knowledge base deduplication, plagiarism detection, or — especially important for Ukrainian content — searching when the query is in one language, but the documents exist in another. A multilingual embedding model places "rental agreement" (Ukrainian) and "rental agreement" (English) in adjacent points in space.

✔️ Semantic search: by meaning, not by words
✔️ RAG: the foundation of retrieval quality
✔️ Recommendations, classification, deduplication, cross-lingual search

Conclusion: Embeddings are not a standalone technology, but a horizontal tool: they appear everywhere AI needs to understand or compare text meaning.

📌 Limitations of embeddings: when they won't help

Where embeddings break down

Embeddings work poorly with exact numbers, dates, product codes, names, and short queries of 1-2 words. The model "blurs" exact values into general meaning — and two different product codes might get almost identical vectors. For such tasks, keyword search (BM25) or metadata filtering is needed.

An embedding answers "what is this text about?", but not "what is the exact value in row 7?"

Problem 1: exact numbers, dates, IDs

The query "contract №4521" and "contract №4522" are almost the same to an embedding model: both are "about a contract with a number." The difference of one digit is not reflected in the vector as it is important to a human. The same applies to dates: "what happened on February 15, 2023?" — the embedding doesn't "know" it's a specific date, not just "February."

Problem 2: proper nouns and terms

Rare proper nouns, abbreviations, product names, or technical terms are often poorly represented in the embedding space. If a name appeared rarely in the training data, its vector will be "blurry" and unreliable.

Problem 3: short queries

A query like "RAG" or "Spring" provides too little context for a quality vector. The model doesn't know what you're interested in: RAG as a technology, RAG as a movie title, or Spring as a season. The vector will be averaged and fuzzy. The solution is to fall back to keyword search for short queries, as described in the article Vector Search for Beginners.

What to do: hybrid search

For production systems, the right answer is to combine semantic search with traditional keyword search (BM25). Hybrid search provides a +15–40% quality improvement compared to pure vector search and solves all three problems above. More details in the complete RAG guide.

✔️ Exact numbers, IDs, dates → keyword search or metadata filters
✔️ Short queries → fallback to BM25
✔️ Production → hybrid search (BM25 + vector)

Conclusion: Embeddings are a powerful tool for semantic tasks, but they don't replace exact search; know their limits and combine approaches.

Common mistakes with embeddings and how to avoid them

Mistake	Symptom	Cause	Solution
Bad embeddings → bad RAG	LLM gives irrelevant or hallucinated answers, even though the model is good	The problem isn't with the LLM — retrieval is returning incorrect chunks. The embedding model is not suitable for your content type	Log the cosine similarity score of real queries. If the top 3 results have a score < 0.6, change the model, not the LLM
Different domains → need a different model	A general model searches well for news but poorly for medical or legal documents	The model was trained on a general web corpus. Specialized terminology is poorly represented or absent from the training data	For medicine — BiomedBERT or PubMedBERT. For code — voyage-code-3. For legal content — fine-tune a general model on your corpus
Multilingual issues	A Ukrainian query doesn't find English documents. Or vice versa — it finds them, but they are irrelevant	The model is not truly multilingual: OpenAI small was trained primarily on an English corpus, with lower quality for Cyrillic	Replace with BGE-M3 or Cohere embed-v4 — both were trained on 100+ languages with equal quality for Latin and Cyrillic scripts
Incorrect metric in Vector DB	Search returns strange results even with the correct model	The Vector DB is set to Cosine, but the model is optimized for Dot Product (or vice versa)	Check the model's documentation. Normalize vectors — then Cosine = Dot Product, and the error disappears
Mixed vectors from different models	Search returns complete nonsense after migrating or updating the model	Two different models' vectors are stored in the same Vector DB — their spaces are incompatible	When changing models, perform a full re-indexing of the database. Store the model name along with each vector as metadata
Missing prefix in E5	E5-large gives poor results despite good benchmarks	E5 expects an explicit prefix: `"query: "` for queries and `"passage: "` for documents. Without it, quality drops by 5–15%	Add the prefix programmatically before each embedding: `f"query: {user_query}"` and `f"passage: {document_chunk}"`

💼 2026 Models: Where to Start

Which Model to Choose

To start, use OpenAI text-embedding-3-small ($0.02/1M tokens) or nomic-embed-text locally via Ollama (free). For multilingual content with Cyrillic, use BGE-M3 or Cohere embed-v4, as they are significantly more accurate on non-Latin languages.

Don't spend weeks choosing the "perfect" model. Start, log real results, and only replace if you see specific problems.

This section is a navigation guide. A full comparison of 10+ models with prices, benchmarks, and a selection matrix can be found in the anchor article Embedding Models for RAG in 2026. Here are three scenarios for a quick start.

Scenario 1: Local Development Without Costs

nomic-embed-text via Ollama — free, 768 dimensions, 8K context window. Runs with a single command: ollama pull nomic-embed-text. Suitable for prototypes and learning, your data never leaves your computer.

Scenario 2: API Start with Minimal Costs

text-embedding-3-small from OpenAI — $0.02 per million tokens, 1536 dimensions. For a blog with 500 articles, this is less than $1 per month. The broadest integration ecosystem, the easiest start via API. Downside: quality on Ukrainian is lower than on English.

Scenario 3: Multilingual Content with Cyrillic

BGE-M3 (open-source, self-hosted) or Cohere embed-v4 ($0.10/1M tokens) — both support 100+ languages with equal quality for Latin and Cyrillic. BGE-M3 also supports hybrid search "out of the box": dense + sparse vectors in one model.

Scenario	Model	Price	Dimensionality
Local / Free	nomic-embed-text (Ollama)	$0	768
API Start	text-embedding-3-small	$0.02/1M	1536
Multilingual / Cyrillic	BGE-M3 or Cohere embed-v4	$0 / $0.10/1M	1024

✔️ Start simple — nomic or OpenAI small
✔️ Multilingual content → BGE-M3 or Cohere
✔️ Details and benchmarks → anchor article

My recommendation: model choice is a less critical decision than the quality of your data and your chunking strategy; start fast and optimize on real queries.

❓ Frequently Asked Questions (FAQ)

How does an embedding differ from a token?

A token is a unit of text (a word or part of it) that AI identifies when breaking down text. An embedding is a numerical vector that encodes the meaning of that token or an entire sentence. Tokens are at the syntax level, embeddings are at the semantics level. More details on tokens can be found in the article about tokens.

Do I need to re-index the database when changing the embedding model?

Yes, absolutely. Different models generate vectors in incompatible spaces: a vector from OpenAI and a vector from BGE-M3 cannot be compared to each other. If you change the model, you need to regenerate all vectors from scratch. This is precisely why the initial model choice is an important architectural decision.

Can I use one embedding for search and classification?

Technically, yes, but the quality will be lower. Some models (e.g., Cohere embed-v4 or Jina v5) support "task-specific" modes: you specify the task type (retrieval, classification, matching), and the model optimizes the vector for it. If you have multiple tasks, consider such models.

How can I check the quality of embeddings for my content?

The simplest way: take 10–20 real queries, perform a search, and look at the top 3 results with their cosine similarity scores. If relevant results have scores below 0.6 or irrelevant ones above 0.7, the model is not suitable for your content. Benchmarks (MTEB) are a useful guide, but testing on real data is always more important. For a systematic evaluation approach, see the guide to metrics and evaluation pipelines for RAG.

Can I create embeddings for images?

Yes. Multimodal models (e.g., Google Gemini Embedding 2) generate vectors for both text and images in the same space. This allows you to search for images by text description or vice versa, and find images that are similar in meaning, not just pixels.

✅ Conclusions

🔹 Embedding is the conversion of text into a numerical vector, where the proximity of numbers reflects the proximity of meaning.
🔹 The model learns through contrastive learning on billions of text pairs — without manual labeling.
🔹 Dimensionality (384–3072) is a trade-off between quality and resources; 768–1536 is sufficient for most tasks.
🔹 Embeddings are the foundation of RAG, semantic search, recommendations, and classification.
🔹 They are not precise with numbers, IDs, and short queries — hybrid search is needed for this.
🔹 For starting: nomic-embed-text locally or text-embedding-3-small via API; for Cyrillic — BGE-M3 or Cohere.

Main takeaway: Embedding is the language AI uses to describe meaning. By understanding how it works, you can build systems that find the right information — not just the right words.

If you want to understand how embeddings are used in search at a mathematical level, read Vector Search for Beginners. If you're ready to choose a specific model, read Embedding Models for RAG in 2026.

Categories