Reranking
In one line: Retrieve 50 candidates cheaply with vector/hybrid search, then re-score them with an expensive cross-encoder that reads each candidate alongside the query. Return the top 5–10 to the LLM. The biggest single-knob quality win in RAG after chunking.
Vector search is fast but coarse — it scores by "are these two embeddings nearby" without ever comparing the actual texts. A reranker is a small model that takes the query and one candidate at a time, reads them both, and gives a real relevance score. Way slower per pair, but you only run it on the top 50 candidates, not the whole corpus. Net: 30%+ quality gain for ~50ms more latency.
The shape: bi-encoder vs cross-encoder
- Bi-encoder (used at retrieval): query and doc are embedded separately. Comparison is one cheap operation. Scales to billions of vectors. Loses fine-grained interaction.
- Cross-encoder (used at rerank): query and doc are concatenated and passed through a transformer together. The model sees how each query word relates to each doc word. Much more accurate, but you can't pre-compute anything — every pair needs its own forward pass.
You couldn't use a cross-encoder over a whole corpus (it'd take days). You can absolutely use it over 50 pre-filtered candidates (30–100ms total).
The pattern
Two-stage retrieval. Stage 1 is fast and approximate; stage 2 is slow and precise. Production RAG in 2026 looks like this almost universally.
Worked example: Cohere Rerank
import cohere
co = cohere.Client()
def hybrid_then_rerank(query: str, top_k_retrieve: int = 50, top_k_final: int = 5):
# Stage 1: cheap retrieval
candidates = hybrid_search(query, k=top_k_retrieve) # returns [{id, text}, ...]
# Stage 2: rerank
docs = [c["text"] for c in candidates]
result = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_k_final,
)
# Re-order
return [candidates[r.index] for r in result.results]
top_n=5 means Cohere returns the 5 best out of the 50 you sent. Each result has a relevance_score in [0,1]. Median latency: ~80ms for 50 docs.
The major rerankers (May 2026)
| Reranker | Hosted? | Latency (50 docs) | Strength |
|---|---|---|---|
| Cohere Rerank 3.5 | Yes | ~80ms | Industry default, multilingual |
| Voyage rerank-2.5 | Yes | ~100ms | Strong on code + technical |
| Jina Reranker v2 | Yes/self | ~120ms | Good open option |
| BGE-reranker-v2-m3 | Self | ~150ms | Best open multilingual |
| Mixedbread mxbai-rerank | Self | ~100ms | Compact, strong English |
| In-context LLM reranking | Yes (any) | ~500ms+ | Use a workhorse LLM with structured score output |
For starting out: Cohere Rerank is the boring, correct default. Switch later if you have a specific need (self-hosted, code-heavy, etc.).
When reranking pays off (and when it doesn't)
Worth it:
- Top-K from your retriever has the right answer somewhere but it's at rank 12 instead of 1.
- Corpus is large enough that the bi-encoder casts a wide net (50+ candidates).
- Latency budget allows ~100ms extra.
Skip it:
- Tiny corpus (under ~1K docs). Just retrieve more directly.
- Retrieval is already perfect (rare; verify with an eval).
- You're billing $0.001/query and can't add the rerank cost.
A useful eval comparison
On a 100-query RAG eval (your mileage will vary):
| Stage | Recall@5 | MRR |
|---|---|---|
| Pure vector | 62% | 0.48 |
| + Hybrid | 71% | 0.55 |
| + Hybrid + reranker (top-50→5) | 83% | 0.71 |
The reranker is doing the most work of any single step. It's also the cheapest to add — no schema changes, no infra changes, one API call.
LLM-as-reranker
You can use a workhorse LLM as a reranker by asking it to score each candidate. Slower and pricier, but composable with the rest of your stack and no extra vendor.
class CandidateScore(BaseModel):
candidate_id: str
relevance: float # 0..1
# Send 20 candidates, ask the model to score each
prompt = f"""Score how relevant each candidate is to this query.
Query: {query}
Candidates:
{json.dumps([{'id': c['id'], 'text': c['text'][:500]} for c in candidates], indent=2)}
"""
result = client.beta.chat.completions.parse(
model="gpt-5-mini",
messages=[{"role": "user", "content": prompt}],
response_format=list[CandidateScore],
temperature=0,
)
Quality is competitive with dedicated rerankers for many tasks; latency and cost are worse. Useful when you can't take on another vendor.
Pairwise vs listwise rerankers
- Pairwise (most rerankers, including Cohere): scores each (query, doc) independently. Predictable, parallelizable.
- Listwise: the model sees all candidates at once and ranks them as a list. Captures inter-candidate context (e.g., "candidate B is just a rephrase of A; skip the duplicate"). Slower, used in research-grade systems.
For 95% of apps, pairwise is the right default.
What beginners get wrong
- Skipping the rerank. "Vector + a workhorse LLM should be enough." Rerankers consistently add 10–20% recall — at a few cents per million queries. Take the deal.
- Sending too few candidates. Rerank's value is finding the right answer that the retriever missed. If you only send top-10, you have nowhere for the reranker to dig. Send 30–100.
- Sending too many. Past ~100, latency hurts and the reranker quality plateaus. Find your knee experimentally.
- Using a multilingual reranker on English only (wasted cost) or vice versa. Pick the model trained on your languages.
- Truncating doc text mid-sentence before reranking. The reranker reads the truncated version — make truncation respect sentences.
- Mixing reranker model with embedding model assumptions. They're independent; a Cohere reranker works on top of OpenAI embeddings.
- Caching rerank results too aggressively. Document text changes → cached scores lie. Invalidate on content updates.
- Not measuring per-query. Average gains hide the cases where rerank tanked a previously-good query. Look at the diff.
A production-grade RAG pipeline
Five stages, well under 300ms total. The shape of every serious RAG system I've audited in 2026.
A working "cheap retrieval → expensive rerank" pipeline takes one afternoon to add and reliably bumps quality from "demo-good" to "ship-able." If your RAG feels janky, this is the first place to look.
→ Next: RAG basics