1536 vs 3072 embeddings: comparison for document search and RAG

Updated:
1536 vs 3072 embeddings: comparison for document search and RAG
Briefly
  • Doubling dimensionality (1536 → 3072) doubles RAM, storage, and latency, but yields only ~3–5% MTEB Retrieval score improvement for OpenAI models.
  • On financial documents, BM25 outperforms text-embedding-3-large in Precision@5. Lexical search remains important.
  • Hybrid retrieval + reranker achieves Recall@5 = 0.816 compared to 0.587 for pure dense retrieval — a +39% increase without changing embedding dimensionality.
  • By 2026, text-embedding-3-large is no longer the leader: Qwen3-Embedding-8B (75.22 MTEB) and Gemini Embedding 2 (67.71 MTEB Retrieval) surpass it.
  • For most RAG scenarios: start with hybrid retrieval + reranker, measure results, and only then consider changing the model.
Parameter 1536 (text-embedding-3-small) 3072 (text-embedding-3-large)
Vector size ~6 KB ~12 KB
RAM per 1M chunks ~6.1 GB ~12.2 GB
ANN search speed Faster ~1.5–2× slower
Embedding API cost $0.02 / 1M tokens $0.13 / 1M tokens (6.5×)
MTEB Retrieval (nDCG@10) 0.5108 0.5544 (+8.5%)
pgvector HNSW index ✅ Supported ⚠️ Not supported in GCP
Suitable for most RAG ✅ Yes ⚠️ Not always justified

When a team builds a RAG pipeline and retrieval returns irrelevant results, the typical first reaction is: "we need a larger embedding model." The logic is understandable: more dimensions → more information → better search. But the data says otherwise.

This article is a technical breakdown for developers designing RAG under specific constraints of RAM, cost, and quality. We look at the numbers: MTEB benchmarks, storage costs, latency, and compare alternatives — particularly reranking, which in most scenarios provides a greater increase in retrieval quality at a lower cost than switching to a 3072-dimensional model.

If you are still at the stage of choosing a document parsing approach before vectorization, first read the previous articles in the series: "How OCR Affects RAG System Quality" and "Vision RAG vs OCR: Comparing Approaches". Poor input data nullifies any advantage from a better embedding model.

What is an embedding vector and what does its dimensionality mean

An embedding model converts text into a numerical vector — a list of fixed-length numbers. This vector is the "coordinate" of the text in an n-dimensional semantic space. Texts with similar meanings have close vectors (high cosine similarity); texts with different meanings are far apart.

Vector dimensionality is the number of numbers in this list. text-embedding-3-small returns 1536 numbers per input text. text-embedding-3-large returns 3072.

What dimensionality measures

Dimensionality What it theoretically provides What it means in practice
Higher (3072+) A larger "alphabet" for encoding semantic differences Potentially better separation between similar but distinct concepts
Lower (768–1536) A less detailed space Sufficient for most practical queries; twice as cheap to store and search
Very low (256–512) Compressed representation Fast and cheap; quality degrades on complex semantic tasks

Important: dimensionality is a property of the model's architecture and training, not simply "more is better." Research on production RAG systems shows that reducing from 1536 to 384 dimensions in some scenarios does not result in a measurable drop in retrieval accuracy — while halving latency and reducing storage by 75%.

The cost of choosing dimensionality: specific numbers

100,000 chunks

  • 1536 dimensions → ~600 MB RAM
  • 3072 dimensions → ~1.2 GB RAM (+2×)

1,000,000 chunks

  • 1536 dimensions → ~6.1 GB RAM
  • 3072 dimensions → ~12.2 GB RAM (+2×)

10,000,000 chunks

  • 1536 dimensions → ~61 GB RAM
  • 3072 dimensions → ~122 GB RAM — a different server tier, a different budget

The growth is linear: double the dimensionality, double all storage and search costs. The formula is simple: N chunks × dimensionality × 4 bytes (float32)

Matryoshka Representation Learning (MRL): use large, store less

text-embedding-3-large and text-embedding-3-small support MRL — an approach where the first dimensions of the vector carry the most general semantic signal, and each subsequent one adds detail. This allows "truncating" the vector after generation without full retraining of the model.

In practice: you generate a 3072-dimensional vector via API, but store only the first 256, 512, 1024, or 1536 dimensions using the dimensions parameter. Quality does not drop proportionally — it's non-linear:

Stored dimensionality MTEB Retrieval (nDCG@10) Relative to full 3072 RAM per 1M chunks
3072 (full) 0.5544 100% ~12.2 GB
1536 0.5453 98.4% ~6.1 GB (−50%)
1024 0.5390 97.2% ~4.1 GB (−66%)
512 0.5233 94.4% ~2.0 GB (−83%)
256 0.4969 89.6% ~1.0 GB (−92%)

Reducing from 3072 to 1536 dimensions provides a −50% RAM reduction with only a 1.6% quality loss. Down to 1024 — a −66% RAM reduction with a −2.8% quality loss. This is the most practical compromise for most scenarios.

Important: MRL only works with models explicitly trained with this technique (OpenAI's text-embedding-3-*, some Cohere and Nomic models). Simply "truncating" a vector from an arbitrary model will result in degradation without guarantees.

How dimensionality affects semantic search quality

The theoretical advantage of higher dimensionality is not always realized uniformly. The impact depends on three factors: the nature of the documents, the type of queries, and the presence of other pipeline components.

Where dimensionality truly matters

Scenario Why higher dimensionality helps Practical effect
Subtle semantic differences "Termination of contract" vs "cessation of contract" — similar words, different legal meaning Fewer false positives in retrieval on legal documents
Multilingual content Larger space for encoding cross-lingual semantic relationships Better cross-lingual retrieval
Technical documentation with narrow terminology Technical terms are often OOV or rare; a larger space provides a better position Slight improvement on specialized corpora

Where dimensionality hardly matters

Scenario Why dimensionality doesn't solve it
FAQs and short answers Queries and documents are semantically straightforward; 768 dimensions are more than sufficient
Texts corrupted by OCR artifacts Noise in the text shifts the vector regardless of dimensionality; the problem lies in the input data
Precise numerical queries "Amount of 47,500 UAH" — dense retrieval is inferior to BM25; dimensionality won't help here
Very small collections (< 10,000 chunks) With a small collection, the difference between models is negligible — any will find the correct chunk

text-embedding-3-small (1536) vs text-embedding-3-large (3072): real differences

MTEB benchmark: the numbers

MTEB (Massive Text Embedding Benchmark) is a standard benchmark for comparing embedding models. For RAG, the most relevant metric is MTEB Retrieval (nDCG@10).

Model Dimensionality MTEB Retrieval nDCG@10 Cost (per 1M tokens)
text-embedding-3-small 1536 0.5108 $0.02
text-embedding-3-large 3072 0.5544 $0.13
Difference +2× (size) +4.36 pp (~8.5%) 6.5× more expensive

Source: ICLERB Benchmark, MTEB Leaderboard November 2024.

An 8.5% nDCG@10 increase is a real difference, but it's important to understand the context: this is a benchmark on a general set of tasks. On your specific corpus, the difference may be larger or smaller. And for 6.5× higher embedding cost plus 2× higher storage cost.

Market context: OpenAI is no longer the leader

text-embedding-3-large has not been updated since January 2024. According to the MTEB Leaderboard April 2026, the landscape has significantly changed:

Model Dimensionality MTEB (Eng v2) MTEB Retrieval License
text-embedding-3-large 3072 66.43 55.44 Proprietary
Gemini Embedding 2 3072 67.71 Proprietary
Qwen3-Embedding-8B 4096 75.22 Apache 2.0 (self-hosted)
Qwen3-Embedding-4B 2560 74.60 Apache 2.0 (self-hosted)
Qwen3-Embedding-0.6B 1024 70.70 Apache 2.0 (self-hosted)

Qwen3-Embedding-0.6B with 1024 dimensions (70.70 MTEB) surpasses text-embedding-3-large with 3072 dimensions (66.43 MTEB) — with lower dimensionality and full self-hosted availability. This echoes the main message of the article: dimensionality ≠ quality.

Practical difference on real tasks

Task small (1536) large (3072) Difference in practice
Direct FAQ queries ("what is the price of X?") Good Good Negligible
Semantic queries on legal texts Satisfactory Better Noticeable, but inferior to BM25 on exact terms
Technical documentation, narrow terminology Satisfactory Noticeably better Real difference in deep technical retrieval
Precise numerical queries Poor Poor Both are inferior to BM25; dimensionality doesn't solve it
Multilingual content (uk/en mix) Satisfactory Better Noticeable for cross-lingual queries
1536 vs 3072 embeddings: comparison for document search and RAG

Impact on RAM, Disk Space, and Search Speed

The cost of storing vectors is linear with dimensionality. Each float32 takes 4 bytes. 1536 dimensions × 4 = 6,144 bytes (~6 KB) per vector. 3072 dimensions is exactly double.

Specific Numbers for Typical Collections

Collection Size 1536 Dimensions (RAM) 3072 Dimensions (RAM) Difference
100,000 chunks ~0.6 GB ~1.2 GB +0.6 GB
1,000,000 chunks ~6.1 GB ~12.2 GB +6.1 GB
10,000,000 chunks ~61 GB ~122 GB +61 GB — different server tier

Source: Qdrant documentation; arXiv 2505.00105 — Optimization of embeddings storage for RAG.

Managed Vector DB Costs at Scale

According to Vector DB Costs 2026 analysis, one embedding call to text-embedding-3-large generates a 3072-dimensional vector — 10M documents yield 120 GB of raw vector data before indexing. Management costs for Pinecone for 10M vectors with 3072 dimensions are ~$70–120+/month versus $35–65 for 1536 dimensions.

For self-hosted Qdrant: 1M vectors of 1536 dimensions require a minimum 4GB RAM tier (~$36/month on Qdrant Cloud). 3072 dimensions for the same volume require an 8GB RAM tier, ~$72/month.

Search Speed (ANN Latency)

Metric 1536 Dimensions 3072 Dimensions Detail
Cosine similarity computation Baseline +100% operations Dot product of 3072 float32s is twice as expensive
HNSW index build time Baseline ~1.5–2× longer Larger graph; more memory during build
Query latency (p95, 1M vectors) ~5–15 ms ~10–30 ms Approximate estimates; depends on hardware and index params
Embedding generation (API call) Baseline ~1.2–1.5× longer Larger API response; larger payload

Practical Advice: pgvector and 3072 Dimensions

If you are using pgvector, pay attention to an important limitation: GCP PostgreSQL does not allow creating HNSW indexes for vectors with >2000 dimensions. text-embedding-3-large with 3072 dimensions on pgvector in GCP means brute-force search without an index — a catastrophic degradation of latency on collections of 100,000+ chunks. Either reduce to 1536 via the API's dimensions parameter, or switch to Qdrant/Weaviate.

Comparison on Typical RAG Scenarios: Legal Documents, Technical Documentation, FAQs

Scenario 1: Legal Documents

Legal texts are a complex case for dense retrieval. They contain both precise numerical values (sums, dates, articles) and subtle semantic differences (responsibility vs. obligation).

Query Type Optimal Approach Why
"What is the penalty amount for delay?" BM25 or hybrid Precise terms and numbers; BM25 outperforms dense retrieval on financial documents for Precision@5
"What are the consequences of breaching the contract terms?" Dense (1536 or 3072) + reranker Semantic query; dense better understands synonyms and paraphrasing
"Customer's liability in a construction contract" Hybrid (BM25 + dense) + reranker Combination of precise terms and semantics; hybrid provides the best Recall

Key takeaway for legal documents: the difference between 1536 and 3072 in the dense component is minimal compared to the effect of adding BM25 in a hybrid pipeline. Research on financial documents from T2-RAGBench (2026) shows that hybrid + reranker yields Recall@5 = 0.816 versus 0.587 for pure dense retrieval with text-embedding-3-large — a +39% improvement without changing the embedding model.

Scenario 2: Technical Documentation

Technical documentation (API references, specifications, RFCs) contains highly specialized terminology. Here, a 3072-dimensional model provides a more noticeable effect — narrow terms are encoded more precisely in a larger space.

However, there's an important nuance here too: if the documentation contains exact names (functions, parameters, versions), BM25 or hybrid remains the better option for precise queries. The dense component helps with "how to do X" queries.

Scenario 3: FAQs and Helpdesk

FAQs are the simplest scenario for embedding models. Queries and answers are semantically straightforward. For FAQ and helpdesk systems, even a 384-dimensional model provides acceptable quality — the difference between 1536 and 3072 on this data is minimal.

For FAQs, the optimal strategy is to start with text-embedding-3-small or even BGE-small (384 dimensions), measure Recall@5 on a test set of questions, and only if there's a measurable deficit, move to a larger model.

Overall Table by Scenario

Scenario Recommended Model Reranker Needed? BM25 Important?
FAQ / helpdesk (< 50K chunks) text-embedding-3-small or BGE-small (384) Optional No
General corporate documents text-embedding-3-small (1536) Yes, recommended Yes, hybrid
Legal documents text-embedding-3-small + hybrid + reranker Mandatory Yes, mandatory
Technical documentation (narrow terminology) text-embedding-3-large (3072) or Qwen3-8B Yes Yes for precise queries
Multilingual archive (uk + en) text-embedding-3-large or Qwen3-Embedding (multilingual) Yes Yes
Large archive (> 5M chunks), limited budget Qwen3-Embedding-0.6B (1024 dimensions, self-hosted) Yes Yes

Why Model Quality Matters More Than Dimensionality

Comparing text-embedding-3-small (1536) and text-embedding-3-large (3072) is a comparison of two products from the same vendor. But the real picture is broader: there are models with lower dimensionality and higher quality.

The already mentioned Qwen3-Embedding-0.6B (1024 dimensions, MTEB 70.70) outperforms text-embedding-3-large (3072 dimensions, MTEB 66.43) on the overall MTEB benchmark with lower dimensionality. Apache 2.0, fully self-hostable.

Factors Determining Embedding Quality Beyond Size

Factor Impact What to Do
Quality of model's training data Critical — determines where vectors "live" in space Check if the model was trained on domain-specific data (legal, medical, etc.)
Quality of input text Critical — OCR artifacts shift the vector regardless of dimensionality Post-processing text before embedding — higher priority than model selection
Chunk length and chunking strategy Significant — too large a chunk dilutes the signal; too small loses context Test 256, 512, 1024 tokens with overlap on your data
Presence of hybrid retrieval (BM25 + dense) Significant — hybrid + reranker provides >30% Recall increase compared to pure dense Implement hybrid before switching to a larger model
Reranker Significant — cross-encoders are much more accurate than bi-encoders at the same scale Add a reranker before changing the embedding model

The "FAISS Hybrid Paradox": When a Larger Model is Worse

The "Rethinking Hybrid Retrieval" study (2025) revealed a counterintuitive result: MiniLM-v6 (a compact model) consistently outperforms BGE-Large when integrated with LLM-based reranking in tri-modal hybrid retrieval. The difference: up to 23.1% better nDCG@10 and 36.5% better nDCG@1 in the financial domain — with 93% fewer parameters and 63% smaller embeddings.

The reason is the "FAISS Hybrid Paradox": the embedding space of large models may not align with the relevance criteria of an LLM reranker. A larger model creates a more "stretched" space that the LLM reranker re-ranks less effectively than a compact one.

Conclusion: if your pipeline includes a reranker, it's not guaranteed that a larger embedding model will improve results. Test on your own data.

Reranking as an Alternative to Increasing Dimensionality

A reranker (or cross-encoder) is the second stage of retrieval. In the first stage, an embedding model finds top-k candidates based on cosine similarity (fast, but not precise). In the second stage, the reranker re-ranks these candidates, evaluating each (query, document) pair together (slow, but much more precise).

Why a Reranker is More Precise Than an Embedding

A bi-encoder (embedding model) encodes the query and document independently. This allows for fast ANN search, but means the model does not "see" the query and document simultaneously during encoding.

A cross-encoder (reranker) encodes the (query + document) pair together. Self-attention "sees" both texts simultaneously and can evaluate subtle semantic interactions. This is computationally much more expensive — but also much more precise.

Real Numbers: What a Reranker Provides

Configuration Recall@5 MRR@3 Source
Dense retrieval (text-embedding-3-large) 0.587 T2-RAGBench (2026), financial documents
BM25 0.644
Hybrid RRF (BM25 + dense) 0.695 0.433
Hybrid RRF + Cohere Rerank v4 0.816 0.605

Hybrid + reranker yields Recall@5 = 0.816 versus 0.587 for pure dense — a +39% improvement without changing the embedding model. For comparison, switching from text-embedding-3-small to large provides ~8.5% improvement on MTEB Retrieval.

Popular Reranker Solutions

Reranker Type Feature Suitable For
Cohere Rerank v4 Cloud API Production-ready, high quality on document tasks; Hit Rate 0.932 + JinaAI in LlamaIndex benchmark Cloud-based systems with API access
bge-reranker-large Open-source, self-hosted Strong cross-encoder; integrates with LangChain, LlamaIndex Self-hosted, GDPR
Qwen3-Reranker-4B Open-source, self-hosted +2.6% recall compared to 0.6B version; Apache 2.0 Self-hosted with GPU
Voyage AI Rerank Cloud API Competitive with Cohere; various options by cost Cloud-based systems

When a Reranker Isn't Enough

A reranker only re-ranks what made it into the top-k from the first stage. If a relevant chunk didn't even make it into the top-50 candidates, the reranker won't help. In this case, a better embedding model or hybrid retrieval with a larger k is needed.

Also important: a reranker adds latency (~100–500 ms for 50 candidates). For real-time systems with end-to-end requirements of <500 ms, this can be a limitation.

Matrix Selection Logic: Collection Size × Accuracy × Budget

Collection Size Accuracy Requirements Budget Recommendation
< 50K chunks Basic Any text-embedding-3-small (1536) or BGE-small (384). With a small collection, the difference between models is minimal.
< 50K chunks High Any text-embedding-3-small + Cohere Rerank or bge-reranker-large. Reranker will provide more benefit than switching to 3072.
50K – 1M chunks Basic Limited text-embedding-3-small (1536) + hybrid (BM25 + dense). Cheaper and more effective than large.
50K – 1M chunks High, technical domain Medium text-embedding-3-large (3072) or Qwen3-Embedding-4B (self-hosted) + reranker + hybrid.
> 1M chunks Any Limited Self-hosted: Qwen3-Embedding-0.6B (1024) or BGE-M3 (1024). Storage cost for 3072 dimensions at this scale is critical.
> 1M chunks Maximum, GDPR Sufficient (GPU) Qwen3-Embedding-8B (4096, Apache 2.0, self-hosted) + bge-reranker-large + hybrid. Best MTEB with full self-hosting.
Any size High, multilingual content (uk+en) Any text-embedding-3-large or Qwen3-Embedding (multilingual-strong). For multilingualism, dimensionality is less important than multilingual training.

Decision-Making Algorithm

Question Yes → Action No → Next Question
Is there a measurable Recall deficit (below target)? Continue Don't change anything; the current model is sufficient
Is hybrid retrieval (BM25 + dense) implemented? Next question Add BM25 first; measure the effect
Is a reranker implemented? Next question Add a reranker; it usually provides more benefit than switching to 3072
Does the deficit remain after hybrid + reranker? Next question The problem is not with the embedding model; check text quality and chunking
Collection > 1M chunks? Consider self-hosted Qwen3-Embedding (better MTEB, lower cost) Switch to text-embedding-3-large or Qwen3-Embedding; test on real data
GDPR / self-hosting mandatory? Qwen3-Embedding-0.6B / 4B / 8B (Apache 2.0) text-embedding-3-large or Gemini Embedding 2

FAQ: Frequently Asked Questions About Embedding Dimensionality

Is it worth switching from 1536 to 3072?

In my experience, in most cases, no, at least not as a first step. Switching from small to large provides ~8.5% MTEB Retrieval gain but costs 6.5× more via API and twice as much for storage. Before changing the model, I always check first: is hybrid retrieval (BM25 + dense) implemented, is there a reranker. These two components together provide +39% Recall@5 — without changing the dimensionality at all. If the quality deficit remains after this — then yes, switching to 3072 or a better model is justified.

How much does RAM consumption increase?

Exactly double — the dependency is linear. 1M chunks at 1536 dimensions occupy ~6.1 GB RAM, at 3072 — ~12.2 GB. For 10M chunks, this is already the difference between 61 GB and 122 GB — effectively a different server tier and a different infrastructure budget. For small collections (up to 100K chunks), the difference is not critical: 600 MB vs 1.2 GB fits into any modern instance. But at scale, it becomes the first limitation.

Does dimensionality affect pgvector speed?

It does, and there's an important practical pitfall. Firstly, higher dimensions mean more operations when calculating cosine similarity — query latency increases. Secondly, and this is critical: GCP PostgreSQL does not support creating HNSW indexes for vectors with more than 2000 dimensions. If you use text-embedding-3-large (3072) with pgvector on GCP — you get brute-force search without an index, which on collections of 100K+ chunks leads to catastrophic latency degradation. Solution: either reduce to 1536 via the dimensions parameter in the API, or switch to Qdrant or Weaviate.

Are 3072 dimensions needed for small RAG?

No. For collections up to 50K chunks, the difference between 1536 and 3072 is practically unnoticeable — any model will find the correct chunk in a small index. I recommend starting with text-embedding-3-small or even BGE-small (384 dimensions), measure Recall@5 on a test set of questions, and only then make a decision. Most teams I've worked with found that the problem wasn't dimensionality, but the quality of the input text or the lack of a reranker.

What is more important: model or dimensionality?

The model is always more important. The best example: Qwen3-Embedding-0.6B with 1024 dimensions scores 70.70 MTEB, while text-embedding-3-large with 3072 dimensions scores 66.43. Smaller dimensionality, better result. In my experience, the priority order for RAG optimization looks like this: first, input text quality, then chunking strategy, then hybrid retrieval and reranker — and only after that choosing or changing the embedding model. Dimensionality is the last variable worth tweaking.

Conclusions

The choice between 1536 and 3072 dimensions is not the main decision when designing a RAG pipeline. It ranks much lower in priority than input data quality, chunking strategy, and the presence of hybrid retrieval with a reranker.

Key Takeaways

Thesis Detail
Higher dimensionality ≠ better retrieval Qwen3-Embedding-0.6B (1024 dimensions, MTEB 70.70) outperforms text-embedding-3-large (3072 dimensions, MTEB 66.43). Architecture and training quality matter more than the number of dimensions.
Hybrid + reranker > model change Hybrid RRF + Cohere Rerank: Recall@5 = 0.816 vs 0.587 for pure dense (+39%). This is a greater effect than a small → large switch (~8.5% MTEB).
3072 dimensions cost 2× more in storage and latency 1M chunks: 6 GB RAM (1536) vs 12 GB (3072). Managed vector DB: ~2× higher tier. At scale > 1M chunks, this is a decisive factor.
text-embedding-3-large is no longer the leader in 2026 The model has not been updated since January 2024. Gemini Embedding 2 (67.71 MTEB Retrieval), Qwen3-Embedding (Apache 2.0, self-hosted) are better alternatives.
Priority order for RAG optimization 1. Input text quality → 2. Chunking → 3. Hybrid retrieval → 4. Reranker → 5. Embedding model selection/change.

How these principles are implemented in the product

At AskYourDocs, we apply a hybrid approach: hybrid retrieval (BM25 + dense), post-processing of OCR text before indexing, and routing complex documents to Vision OCR. The choice of embedding model and dimensionality is tailored to the specific archive type — not as a fixed constant, but as a parameter measured on the client's real data.

→ Learn more about AskYourDocs

Article series on OCR, RAG, and document search:
  1. OCR in Modern AI Systems: From Scanned Documents to RAG — general overview for a non-technical audience
  2. How OCR Affects RAG System Quality: A Technical Breakdown — where the pipeline breaks and how to fix it
  3. Vision RAG vs OCR: Comparing Approaches to Document Processing — GPT-4o, Qwen2.5-VL, olmOCR, Docling