Skip to main content

Embedding models

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: A cheaper, smaller model that turns a piece of text (or image, or audio) into a fixed-length vector — the unit currency of RAG, semantic search, classification, and clustering.

In plain English

A regular LLM reads tokens and writes tokens. An embedding model reads tokens and writes one list of numbers — typically 768 to 3,072 of them. That list is a coordinate in "meaning space": two texts that mean similar things end up close together. You store those vectors in a vector database, then at query time embed the user's question and grab the nearest neighbors. That's RAG in one sentence.

The major options (May 2026)

ModelProviderDimensionsMax inputStrengths$ / Mtok
text-embedding-3-smallOpenAI1536 (configurable)8kDefault; cheap; ubiquitous$0.02
text-embedding-3-largeOpenAI3072 (configurable)8kHigher quality$0.13
voyage-3Voyage AI102432kBest general retrieval (MTEB top)$0.06
voyage-code-3Voyage AI102432kCode search king$0.06
embed-v4Cohere1024 (configurable)128kMultilingual; long doc$0.10
text-embedding-005Google7682kVertex AI native$0.025
bge-large-en-v1.5BAAI (open)1024512Self-host defaultfree (your GPU)
nomic-embed-text-v2Nomic (open)7688kLong-context open optionfree / $0.01 hosted
mxbai-embed-largemixedbread (open)1024512Strong open challengerfree / $0.01 hosted

Default pick for most teams

OpenAI text-embedding-3-small at 1536 dimensions. It's cheap ($0.02/Mtok), good enough for 90% of retrieval, and every vector DB, RAG framework, and tutorial assumes this shape.

Upgrade to Voyage voyage-3 when you measure retrieval quality and it's the bottleneck — Voyage consistently tops the MTEB leaderboard and is worth the 3× price for quality-sensitive RAG.

When to deviate

  • Code search (a function, an API call, a stack trace): voyage-code-3. It's not subtle — it just wins on code.
  • Multilingual corpus (especially non-Latin scripts): Cohere embed-v4 or voyage-3.
  • Long documents you'd rather embed whole than chunk: Cohere embed-v4 (128k) or nomic-embed-text-v2 (8k).
  • Self-host required (data residency, on-prem): bge-large-en-v1.5 via Hugging Face TEI or Infinity.
  • Latency-sensitive on-device: a small open model like bge-small quantized.
  • You're already in Vertex AI: Google's text-embedding-005 saves the integration step.

Minimum integration

from openai import OpenAI
client = OpenAI()

def embed(text: str) -> list[float]:
r = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return r.data[0].embedding # 1536 floats

vec = embed("How do I rotate a Postgres password?")
# Store `vec` next to the source text in pgvector / Pinecone / Qdrant.

Batch mode is dramatically cheaper — pass a list of up to ~2,000 strings in a single call:

r = client.embeddings.create(
model="text-embedding-3-small",
input=["doc 1 text", "doc 2 text", "doc 3 text"],
)
vectors = [d.embedding for d in r.data]

Critical operational rules

  • Pick once, embed everything once. Re-embedding ten million chunks on a model swap is a one-week project. Decide deliberately.
  • Same model for queries and corpus. A query embedded with model A cannot match a corpus embedded with model B. The vectors live in incompatible spaces.
  • Match dimensionality to your index config. pgvector with HNSW indexed at 1024 dims will not serve queries from 3072-dim vectors. Lock dimensions early.
  • Store the model name alongside every vector. Future you, mid-migration, will need to know which rows are which.
  • Normalize if your DB expects it. Cosine similarity assumes unit vectors. Some DBs L2-normalize for you; some don't.

Pricing & cost notes

Embeddings are the cheap part of the AI stack. A typical RAG corpus — say, 1 million chunks at 500 tokens each — costs:

  • text-embedding-3-small: ~$10 to embed the whole thing once.
  • voyage-3: ~$30.
  • Self-hosted bge-large on a single A10: ~free, but your GPU-hour bill.

The expensive part is re-embedding on a model change. Plan for it.

Pitfalls

  • Switching embedding models mid-project without re-embedding the corpus. Retrieval quality silently degrades; you spend a week debugging your prompt before noticing the embeddings don't match.
  • Embedding 50k-token "documents" whole. Most models cap at 512–8192 tokens and silently truncate the rest. Chunk first.
  • Truncating embeddings (Matryoshka) without re-indexing. 3072-dim vectors truncated to 256 dims are not the same as native 256-dim vectors. Read the model docs.
  • Mixing models across query and corpus during an A/B test. The "B" arm looks terrible because nothing matches, not because the model is worse.
  • Skipping a re-ranker on small-k retrieval. Embeddings alone bring back coarse matches. A cross-encoder rerank (Cohere Rerank 3, Voyage Rerank) on top-50 → top-5 is the single biggest RAG quality win for the cost.
  • Putting embeddings in JSONB instead of a vector column. Works for 10 rows, dies at 10k. Use pgvector or a real vector DB.
🤔 Quick checkQuick check

→ Next: LLM SDKs