Skip to main content

Embedding Tier — The Vectors Under Your RAG

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: The embedding model decides what "similar" means for retrieval — and it's nearly always the same answer (text-embedding-3-small) until you have evidence to climb.

In plain English

An embedding model turns a piece of text into a list of numbers, placed so that texts with similar meaning end up near each other — like assigning every sentence a spot on a giant map. When your app searches for relevant documents, it's really asking which spots on that map are closest to the question. This page decides which mapmaker to use, and the honest answer is that the popular default is fine for almost everyone. The one rule you genuinely cannot break: documents and questions must be mapped by the exact same model, or your search quietly returns nonsense.

What's in this tier (as of 2026)

Tier 1 — adopt by default

ModelProviderDimsCost per 1M tokensNotes
text-embedding-3-smallOpenAI1536$0.02The boring-tech answer; great quality, cheap, fast
text-embedding-3-largeOpenAI3072$0.13When 3-small's quality isn't enough; ~10% better on MTEB, ~6x cost

Tier 2 — reach for when there's evidence

ModelProviderNotes
Voyage v3 / voyage-3-largeVoyage AIOften beats OpenAI on technical-doc / code corpora; competitive pricing
Cohere Embed v3CohereStrong multilingual; good multimodal options
Gemini text-embedding-004GoogleFree tier; tightly integrated with Google Cloud stack
bge-m3 / bge-largeBAAI (open-source)Self-hostable; good for data-residency requirements
nomic-embedNomicOpen-source, long-context (8k tokens), commercial-friendly license

Tier 3 — niche

  • Custom fine-tuned embeddings — only when domain-specific recall matters AND you have ground-truth pairs to train on. Rare for app-shaped AI.
  • Multilingual / domain-specialized models — useful for non-English-heavy corpora or specific domains (medical, legal).

Why embeddings matter (and how much)

The embedding model decides how "close" two texts are in vector space. That determines what gets retrieved in your RAG, which determines what your LLM sees, which determines the answer quality. Embedding quality dominates retrieval quality on tough corpora.

That said, on most corpora the differences between top-tier embeddings are small (3–8% retrieval recall). The bigger lever is chunking and hybrid search (see Retrieval quality (Part III)).

The pragmatic rule

Start with text-embedding-3-small. Build the rest of the pipeline. Measure recall on your eval set. If recall is below ~75%, your problem is almost certainly chunking or pure-vector-vs-hybrid, NOT the embedding model. If those are fixed and you're still below 80%, try text-embedding-3-large or Voyage v3.

Dimensions and the cost trade

Embeddings have a fixed dimension (e.g., 1536 for 3-small). More dimensions = more storage, more compute for similarity, marginally better recall.

DimStorage per chunkNotes
384 (bge-small)~1.5 KBCheap; OK quality
768 (older models)~3 KBLegacy default
1536 (3-small)~6 KBThe right balance
3072 (3-large)~12 KBBetter but expensive at scale

For 1M chunks, that's 6 GB at 1536 vs 12 GB at 3072. Indexing cost (HNSW build time) and query latency both rise sub-linearly with dimension. For ≤10M chunks, dimension is rarely the binding constraint.

Matryoshka embeddings (3-small supports this) let you truncate the vector to a smaller dim if storage is tight, with graceful quality degradation. Useful at very large scale.

The same-model-everywhere rule

You MUST embed your documents and your queries with the same model and same version. Mixing:

  • text-embedding-3-small for docs + text-embedding-3-large for queries → silent garbage retrieval. The vectors don't live in the same space.
  • v1 of an open-source model for the index, v2 for queries → also broken.
  • One provider's model for docs and a different provider's for queries → completely broken.

When you upgrade your embedding model, you must re-embed the entire corpus before any new queries hit. The migration is often slow (hours-to-days for large corpora) and expensive.

When to consider non-OpenAI embeddings

Voyage v3

  • Often the winner for technical-doc corpora (code, API docs, engineering writeups).
  • Competitive pricing.
  • The right A/B test target if you're plateaued at 80% recall.

Cohere v3

  • Strong on non-English corpora.
  • Built-in multimodal (text + image) variants.

Gemini text-embedding-004

  • Free tier is generous.
  • Integrates cleanly with Google Cloud (Vertex AI, BigQuery vector search).

Open-source (bge-m3, nomic-embed)

  • Self-hostable: data-residency requirements, air-gapped deployments.
  • Often quality-competitive with OpenAI for the right corpus.
  • The latency / ops cost of self-hosting often dwarfs the API savings unless you're at very high volume.

Embedding for code (special case)

If your RAG is over codebases, dedicated code-embedding models can win:

  • Voyage code-3 — currently the best at code retrieval.
  • OpenAI 3-large — surprisingly strong on code.
  • CodeBERT, codet5 family — older, specialized, but limited.

Run an eval on your code corpus before defaulting to generic text embeddings.

Embedding for multimodal (special case)

  • CLIP family — text + image retrieval (find images matching a text query).
  • Cohere embed-multilingual + image — text and image in the same vector space.
  • Gemini multimodal embedding — same space for text + image + audio.

For most apps "search images by text description" is the actual use case; CLIP-style models nail it.

Cost intuition

Embedding cost is small compared to LLM cost — but at scale it's still meaningful.

Example: 1M documents, average 1000 tokens each, embedded once with 3-small:

  • 1M × 1000 = 1B tokens
  • 1B / 1M × $0.02 = $20

One-time cost of $20 for a million-doc corpus. Re-embedding for an upgrade: another $20. Embedding 100M queries/year: $2000.

The cost dwarves only matters at very large scale — and even then, embeddings are typically <5% of total AI cost.

When fine-tuning embeddings is worth it

Almost never for app-shaped AI. The conditions where it IS worth it:

  • Domain-specific language that off-the-shelf embeddings don't capture well (medical, legal, internal jargon).
  • Ground-truth pairs are available — sets of (query, correct-document) pairs to train on.
  • Retrieval is the bottleneck — you've fixed chunking, hybrid search, reranking, and you're STILL below acceptable recall.
  • High enough query volume to justify the engineering cost.

For 95% of apps, off-the-shelf embeddings + careful retrieval engineering wins.

Common mistakes

Where people commonly trip up
  • Embedding model mismatch between index and query. Silent garbage; one of the hardest bugs to spot. Always pin the model version explicitly in code.
  • Defaulting to the biggest embedding model. 3-large is ~6x the cost of 3-small for ~10% recall lift. Only worth it on hard corpora where chunking and hybrid are exhausted.
  • Ignoring chunking and hybrid; tweaking embedding instead. 80% of retrieval-quality wins come from chunking and hybrid search. Embedding swaps are the long-tail optimization.
  • Self-hosting embeddings prematurely. "We'll save money self-hosting bge-m3 on our own GPUs!" — then you spend two engineers' salaries on ops. At <10M queries/month, hosted embeddings are nearly free; self-hosting is theater.
  • Not pinning version. OpenAI may update embedding models. If your code says model="text-embedding-3-small" and they release a v2 with the same name, your retrieval silently breaks. Pin model strings and verify on provider docs.
🤔 Quick checkQuick check

→ Next: Vector DB pick — where these embeddings actually live.