Embeddings & semantic search
In one line: Embeddings are useful by themselves — for search, recommendations, deduplication, and cheap classification — even when no LLM ever sees the query.
RAG made embeddings famous, but the most under-used pattern is "embeddings without the chat model." Once every item in your DB has a vector, a huge set of features — semantic search, "more like this," dedupe, k-NN classification, anomaly detection — collapse into "find the closest neighbours." Cheap at query time, no LLM call, no hallucination surface.
What you can do with just embeddings
| Feature | Pattern |
|---|---|
| Semantic search | Embed query → top-K nearest items |
| "More like this" / recs | Use the source item's vector as the query |
| Dedup / near-duplicate detection | Self-join, keep pairs with distance < threshold |
| Classification (k-NN) | Top-K nearest labelled items, majority vote |
| Clustering / topic discovery | k-means or HDBSCAN over the vectors |
| Anomaly detection | Distance from nearest centroid above threshold |
| Autocomplete / "did you mean" | Embed prefix, return nearest items as suggestions |
| Smart routing | Embed user query, route to closest pre-defined intent |
None of these require an LLM at query time. The cost is the embedding (sub-cent per call) plus the vector search.
The shape
Worked example — semantic search + k-NN classification
The same vector index powers two features on our support assistant: search across past tickets, and a routing classifier that picks the support queue without an LLM call.
Schema (Postgres + pgvector):
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE tickets (
id bigint PRIMARY KEY,
tenant_id text NOT NULL,
text text NOT NULL,
category text NOT NULL, -- known label, hand-curated
created_at timestamptz DEFAULT now(),
embedding vector(1536),
ts tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED
);
-- HNSW for fast ANN search at >100k rows
CREATE INDEX tickets_embedding_hnsw
ON tickets USING hnsw (embedding vector_cosine_ops);
-- BM25 alongside, for hybrid
CREATE INDEX tickets_ts_gin ON tickets USING gin(ts);
Ingestion (Python):
def ingest_ticket(t: dict):
vec = embed_cached(t["text"], model="text-embedding-3-small")
db.execute(
"INSERT INTO tickets(id, tenant_id, text, category, embedding) "
"VALUES (%s, %s, %s, %s, %s) "
"ON CONFLICT (id) DO UPDATE SET text=EXCLUDED.text, embedding=EXCLUDED.embedding",
(t["id"], t["tenant_id"], t["text"], t["category"], vec),
)
Semantic search:
def search_tickets(query: str, tenant_id: str, k: int = 20) -> list[dict]:
qvec = embed_cached(query)
rows = db.fetch(
"""
SELECT id, text, category, embedding <=> %s AS distance
FROM tickets
WHERE tenant_id = %s
ORDER BY embedding <=> %s
LIMIT %s
""",
(qvec, tenant_id, qvec, k),
)
return rows
k-NN classification (no LLM, no training):
from collections import Counter
def classify_by_knn(text: str, tenant_id: str, k: int = 7) -> tuple[str, float]:
"""Predict category by majority vote of the k nearest labelled tickets."""
qvec = embed_cached(text)
neighbours = db.fetch(
"SELECT category, embedding <=> %s AS dist "
"FROM tickets WHERE tenant_id = %s "
"ORDER BY embedding <=> %s LIMIT %s",
(qvec, tenant_id, qvec, k),
)
if not neighbours:
return "other", 0.0
counts = Counter(r["category"] for r in neighbours)
label, votes = counts.most_common(1)[0]
confidence = votes / k
return label, confidence
This is your first-line ticket router — sub-cent per call, no LLM, and the accuracy floor is "as good as your labelled examples." Use the LLM triage from the structured output page only when k-NN's confidence is low. Hybrid is best of both.
Near-duplicate detection:
def find_duplicates(tenant_id: str, threshold: float = 0.04) -> list[tuple[int, int]]:
return db.fetch(
"""
SELECT a.id AS a_id, b.id AS b_id, a.embedding <=> b.embedding AS d
FROM tickets a JOIN tickets b ON a.id < b.id
WHERE a.tenant_id = %s AND b.tenant_id = %s
AND a.embedding <=> b.embedding < %s
""",
(tenant_id, tenant_id, threshold),
)
A nightly job using this collapses "I emailed you about this yesterday" duplicates before a human reads them.
When embeddings beat keyword (and when they don't)
Embeddings shine on:
- Synonyms and paraphrases — "refund" matching "money back."
- Cross-language matches with multilingual models (
multilingual-e5, Cohereembed-multilingual-v3.0). - Concept queries like "billing problems" matching tickets that never use that phrase.
Embeddings lose to keyword search on:
- Exact-string lookups — product SKUs, error codes, person names, version strings.
- Boolean filters —
status=open AND priority=high. Use SQL. - Rare or domain-specific terms the embedding model wasn't trained on.
Production search is almost always hybrid: BM25 + vector, fused (reciprocal-rank fusion or learned). Don't argue about which; ship both.
Watch out for
- Mixing embedding models in one index. Vectors from
text-embedding-3-smallandvoyage-3live in different spaces — distances are meaningless across them. Tag every row with the model name + version, never query across. - No ANN index. A million-row pgvector table without HNSW does a full sequential scan on every query. Add
hnsworivfflatup front; tuneef_search/liststo your latency target. - Embedding raw HTML or boilerplate. Nav menus, cookie banners, footers will dominate the embedding. Strip chrome; normalize first.
- Cosine vs. L2 mismatch. Cosine assumes unit-length vectors. Most modern providers return normalized vectors; some don't. Pick the metric the provider documents.
- Stale embeddings on edited content. Whenever the text changes, the embedding must too. Recompute on update; don't lazily refresh.
- K too small. Top-3 misses correct answers that are at rank 5. Retrieve top 20–50 candidates; then rerank or filter down. Recall first, precision second.
- Querying without a tenant filter. Index-wide search returns other customers' data. Always filter by tenant (and by ACL — see safety).
2026 stack
| Layer | Default pick |
|---|---|
| Embedding model | OpenAI text-embedding-3-small (default), text-embedding-3-large (quality), Voyage voyage-3 (best for code/RAG). |
| Multilingual | Cohere embed-multilingual-v3.0, BGE-M3 (open). |
| Vector storage | pgvector under ~10M vectors. Pinecone / Qdrant / Turbopuffer / Weaviate above. |
| ANN algorithm | HNSW for most workloads. IVFFlat for very large + read-heavy. |
| Reranker | Cohere Rerank 3, Voyage Rerank, BGE Rerank. |
| Hybrid fusion | Reciprocal-rank fusion (5 lines of code) or your DB's BM25 + vector operators. |
A team's first instinct on "improve search" is "add an LLM." It's usually backwards. The first move is pre-compute an embedding per item, then add a vector index. That alone elevates "search" from "exact-token grep" to "semantic," at a few hundred microseconds per query and no LLM bill.
Once that works, then you decide whether you want an LLM in the loop — to rewrite the query, to rerank, to summarize results. Most apps don't need it.
→ Next: Multimodal patterns.