Glossary

A single A–Z reference for every term used in this guide. Plain-English definitions, no circular jargon. Cross-references appear in italics.

A

Agent — A setup where an LLM works in a loop: it picks a tool, you run it, you feed the result back, it picks the next tool, until it decides it's done. Contrast with chain.

Agent loop — The control flow of an agent: think → call tool → observe result → repeat. Usually capped by a max-iteration count and a budget.

Agent harness — The orchestration layer around an agent: loop control, context assembly, tool allowlists, memory, budgets, and tracing. Distinct from the base model.

Agentic RAG — Retrieval where the model decides when and how to search across multiple steps, rather than a fixed retrieve-then-generate pipeline. See RAG.

Alignment — The broad research goal of making models behave the way humans actually want, rather than what a misread of the objective might encourage.

ANN (Approximate Nearest Neighbor) — A class of algorithms (HNSW, IVF) for finding the most similar vectors quickly, trading a bit of accuracy for huge speedups versus exact search.

ASR (Automatic Speech Recognition) — Converting spoken audio to text. Same idea as STT. Whisper is the most common open model.

Attention — The mechanism inside a transformer that lets each token "look at" every other token in the context to decide what's relevant.

Autoregressive — Generating one token at a time, where each new token is conditioned on every token before it. The way every mainstream LLM produces output.

B

Barge-in — In a voice agent, the user interrupting while the agent is still speaking. Handling it well (stop TTS, cancel generation, listen) is a core realtime-voice engineering problem.

Base model — A pre-trained model that has only learned to predict the next token, without later instruction-tuning. Powerful but raw — not what you'd ship to end users.

Batch API — Provider endpoints (OpenAI, Anthropic) that run jobs asynchronously within 24 hours for ~50% off the normal rate. Good for offline evals and bulk generation.

Batching — Sending multiple inputs through the model in one forward pass for throughput. Critical for self-hosted inference economics.

Bias — Systematic skew in model outputs against certain groups, topics, or viewpoints. Audited via subgroup evals.

BLEU — An old machine-translation metric that counts n-gram overlap between output and reference. Mostly superseded by LLM-as-judge.

BM25 — A classic keyword-ranking algorithm for sparse retrieval. Often combined with dense retrieval in hybrid search.

BPE (Byte Pair Encoding) — A tokenizer algorithm that builds vocabulary by repeatedly merging the most common pairs of characters. Used by GPT and most modern models.

Braintrust — A commercial eval and observability platform for LLM apps.

C

Cache hit — A request whose prompt prefix matched a cached entry, billed at a fraction of normal input price.

Cache miss — A request that did not match any cached prefix; full input price applies.

Calibration — How well a model's stated confidence matches its real accuracy. A well-calibrated model that says "70% sure" is right 70% of the time.

Cerebras — A hardware company whose wafer-scale chips serve open models at extremely high tokens-per-second.

Chain — A fixed pipeline of LLM and non-LLM steps (vs. an agent, which decides the order itself).

Chain-of-thought (CoT) — Prompting the model to produce intermediate reasoning steps before its final answer. Improves accuracy on multi-step problems.

Chroma — An open-source vector database, popular for local prototyping.

Chunk — One slice of a larger document, sized to fit a retrieval index and the context window.

Chunking — The process of splitting documents into chunks before embedding them.

Claude — Anthropic's family of LLMs. Tiers in 2026: Opus (most capable), Sonnet (balanced), Haiku (fast and cheap).

CLIP — OpenAI's contrastive image-text model: it embeds images and captions into the same vector space, so "find images matching this text" becomes a similarity search. The foundation of most multimodal RAG.

Command-R — Cohere's LLM family, marketed around RAG and tool use.

Constitutional AI — Anthropic's technique for aligning a model using a written set of principles ("constitution") that the model uses to critique and revise its own outputs.

Content moderation — Filtering inputs or outputs against a policy. Provider APIs (OpenAI moderation, Anthropic safety) make this a single call.

Context length — A synonym for context window.

Context window — The maximum number of tokens an LLM can read and write in a single call. 2026 frontier models are typically 200K–2M tokens.

Continuous batching — A serving technique (used by vLLM, TGI) that adds and removes requests from a batch every step, dramatically improving GPU utilization.

Computer use — A capability where a model takes screenshots and emits mouse/keyboard actions to operate a real computer. Pioneered by Anthropic in 2024.

Contrastive embedding — An embedding trained by pulling matching pairs (e.g. an image and its caption) together in vector space and pushing non-matching pairs apart. The training recipe behind CLIP-style multimodal models.

Cosine similarity — A score from -1 to 1 measuring the angle between two vectors. The default similarity metric for normalized embeddings.

Cross-encoder — A reranker model that scores a (query, document) pair jointly. Slower than embedding-based retrieval, but more accurate.

D

Data poisoning — An attack where adversarial documents are inserted into a training set or RAG corpus to make the model behave badly later.

DeepSeek — A Chinese lab whose 2024–2026 open models pushed reasoning-model quality at very low cost.

Decoder — The half of a transformer that generates output tokens. Modern LLMs are typically decoder-only.

Dense retrieval — Retrieval using embedding vectors. Contrast with sparse retrieval (BM25).

Diffusion — A generative architecture, dominant for text-to-image and text-to-video, that learns to denoise random noise into samples.

Distillation — Training a smaller "student" model to mimic a larger "teacher." Used to ship cheap, fast versions of frontier models.

Docling — IBM's open-source document parser for PDFs and complex layouts; competes with Unstructured and LlamaParse.

Dot product — A similarity metric between vectors. Equivalent to cosine similarity when vectors are unit-length.

DPO (Direct Preference Optimization) — A fine-tuning method that learns from pairs of preferred/rejected responses without needing a separate reward model. Simpler than RLHF.

DSPy — Stanford framework that treats prompts as programs and optimizes them automatically against an eval set.

E

ElevenLabs — A leading TTS and voice-cloning provider.

Embedding — A fixed-length vector of floats that represents the meaning of a piece of text, image, or other input. Semantically similar inputs have similar vectors.

Encoder — The half of a transformer that consumes input. Embedding models (e.g., for RAG) are typically encoder-only.

Episodic memory — Memory of specific past interactions or events, as opposed to general facts. Used in long-running agents.

EU AI Act — The European Union's risk-based regulation of AI systems, in force from 2024–2026 in phases.

Eval — A test for an LLM system. Either a deterministic check (regex, schema) or a model-graded judgment (LLM-as-judge).

Eval case — A single test record: input, optional expected output, and the scorer to apply.

Eval suite — A collection of eval cases run together, usually with a scorecard summary.

Exact match — A scorer that returns 1 if output equals the reference string, 0 otherwise. Brittle but cheap.

F

F1 — The harmonic mean of precision and recall. Common for classification-style evals.

Faithfulness — Whether a RAG answer is actually supported by the retrieved sources. Often graded by an LLM-as-judge.

Fallback — Routing to a backup model or provider when the primary fails or rate-limits. Usually configured in an AI gateway.

Feature flag — A toggle that turns a model, prompt, or feature on/off without redeploying. Critical for safe rollouts of AI changes.

Few-shot — Prompting style that includes a handful of input/output examples before the real query.

Fine-tuning — Updating the weights of a pre-trained model on your own data. Cheaper than training from scratch, more expensive and less reversible than prompting.

Fireworks — A serverless inference provider for open models.

FP8 — 8-bit floating-point quantization. Common for serving frontier models at lower memory cost than FP16.

Foundation model — A large model pre-trained on broad data and adaptable to many downstream tasks. Includes LLMs, vision models, multimodal models.

Function calling — Letting the model emit a structured call (name, arguments) that your code executes. Synonym for tool use.

G

Gateway — See AI gateway. A proxy in front of model providers handling routing, retries, logging, and cost controls.

AI gateway — A proxy that sits between your app and multiple LLM providers, adding logging, fallback, key rotation, and cost limits. Examples: Portkey, LiteLLM, OpenRouter.

Gemini — Google's LLM family. Tiers: Ultra, Pro, Flash.

Gemma — Google's open-weights model family, sibling to Gemini.

GGUF — A file format for quantized open-model weights, used heavily by llama.cpp and local-inference tooling.

Golden dataset — A curated, hand-verified set of examples used as the ground truth for evals.

GPT — OpenAI's LLM family ("Generative Pre-trained Transformer").

Groq — A hardware company whose LPUs serve open models at very high tokens-per-second.

Ground truth — The known-correct answer used to score a model's output during evals.

Groundedness — Synonym for faithfulness: does the output stay anchored to retrieved evidence?

Guardrail — A pre- or post-processing check that blocks unsafe prompts or outputs. May be regex, classifier, or a separate LLM call.

H

Haiku — The smallest, fastest tier in Anthropic's Claude family.

Hallucination — When the model produces something confident but wrong. The dominant correctness failure mode; mitigated via RAG, citations, validation, and evals.

Haystack — An open-source framework for building RAG and search applications.

Helicone — An observability platform for LLM apps.

HNSW (Hierarchical Navigable Small World) — A graph-based ANN algorithm; the default index type in most modern vector databases.

Hugging Face — The dominant hub for open-source models, datasets, and training/inference libraries.

Hybrid search — Combining dense retrieval (embeddings) with sparse retrieval (BM25), then merging the rankings.

I

Image embedding — A vector representation of an image. Used for similarity search across images, or to feed images into a text model.

Image-to-text — Generating a textual description of an image (captioning, OCR, VQA).

Inference — Running a trained model to produce outputs. The runtime side of ML, as opposed to training.

Inngest — A durable workflow platform commonly used to orchestrate multi-step LLM jobs.

Inspect AI — UK AI Safety Institute's open-source framework for running model evals and capability tests.

Instruct model — A base model that has been further trained to follow instructions and chat. The default flavor you call via API.

INT4 — 4-bit integer quantization. Aggressive; common for running open models on consumer GPUs.

INT8 — 8-bit integer quantization. Modest accuracy loss, ~4x memory savings vs FP32.

IVF (Inverted File Index) — An ANN technique that clusters vectors and searches only the nearest clusters. Used by FAISS.

J

Jailbreak — A prompt designed to bypass a model's safety training, often through roleplay, encoding tricks, or persona pressure.

JSON mode — A provider feature that constrains the model to emit syntactically valid JSON.

JSON Schema — A standard for describing JSON shapes. Used to specify tool parameters and structured output schemas.

K

Kill switch — An emergency feature flag that disables an AI feature instantly. Pair with observability so you know when to flip it.

KV cache — The cached key/value tensors from previous tokens during generation. Reusing it makes long-context inference vastly cheaper. Underlies prefix caching.

L

LangChain — A long-standing Python/TS framework for building LLM apps; broad but heavy.

LanceDB — An open-source, embedded vector database built on Apache Arrow.

LangGraph — A graph-based agent framework from the LangChain team. Models agents as state machines.

Langfuse — Open-source LLM observability and tracing platform.

LangSmith — LangChain's commercial observability and eval platform.

LiteLLM — A drop-in proxy and SDK that exposes 100+ providers behind one OpenAI-compatible API.

LiveKit — An open-source realtime audio/video infrastructure platform (WebRTC), widely used as the transport layer for voice agents.

LlamaIndex — A Python/TS framework focused on RAG, indexing, and data connectors.

LlamaParse — LlamaIndex's hosted document parser.

Llama — Meta's open-weights LLM family. The de-facto baseline for open models since 2023.

LLM (Large Language Model) — A neural network trained on huge amounts of text that takes text in and produces text out. Examples: Claude, GPT, Gemini, Llama.

LLM-as-judge — Using a strong model to grade another model's output against a rubric. The workhorse scorer for subjective evals.

Logit — A raw, unnormalized score the model outputs for each token in its vocabulary. Softmaxed to get probabilities.

Logprob — The log of the probability the model assigned to a chosen token. Useful for confidence scoring and ranking.

Long-term memory — Persistent storage of facts about a user or domain, surfaced into context on future calls. Usually a vector or key-value store.

LoRA (Low-Rank Adaptation) — A fine-tuning method that trains small adapter matrices instead of all weights. Cheap, fast, and swappable at inference time.

M

MCP (Model Context Protocol) — An open protocol introduced by Anthropic in late 2024 for exposing tools, resources, and prompts to AI clients in a standard way. Broadly adopted by 2026.

MCP server — A process that implements the MCP protocol and exposes a set of tools/resources for clients (Claude Code, Cursor, Claude.ai, etc.) to consume.

Memory — In agent systems, any mechanism for carrying information across turns or sessions. See short-term memory, long-term memory, episodic memory.

Message — A single entry in a chat history, with a role and content.

Milvus — An open-source distributed vector database.

Mistral — A French lab and its open-weights LLM family.

Mixtral — Mistral's MoE model family.

Modal — A serverless platform popular for hosting custom Python/GPU inference workloads.

Model card — A short document describing a model's intended use, training data, limitations, and known risks.

Model router — A component (often inside an AI gateway) that picks among models per request based on cost, latency, or task class.

MoE (Mixture-of-Experts) — A model architecture where each token activates only a few "expert" sub-networks. Lets total parameter count grow without proportional compute cost.

Multi-agent — A system with multiple specialized agents collaborating (e.g., planner + researcher + writer).

Multimodal model — A model that consumes or produces multiple modalities — text, image, audio, video.

N

NIST AI RMF — The US National Institute of Standards and Technology's AI Risk Management Framework. A voluntary playbook for governing AI risk.

O

Observability — The discipline of seeing into a running LLM system: logs, traces, spans, metrics, prompts, costs. Langfuse, Helicone, LangSmith, Braintrust.

OCR (Optical Character Recognition) — Extracting text from images of documents. Modern multimodal models often replace dedicated OCR.

OpenCLIP — The open-source reimplementation of CLIP, trained on public data. The default when you need a self-hosted image-text embedding model.

OpenRouter — A model marketplace exposing dozens of providers behind one OpenAI-compatible endpoint.

Opus — The most capable tier in Anthropic's Claude family.

Output filter — A guardrail applied to model output before it reaches the user (PII redaction, profanity, policy).

P

Paged attention — The memory-management trick at the heart of vLLM: stores the KV cache in fixed-size pages, like an OS, to avoid fragmentation.

Parameters — The learned weights of a model. Counted in billions for modern LLMs.

pgvector — A Postgres extension that adds vector columns and similarity search. The default choice when you already run Postgres.

Phi — Microsoft's family of small, capable open models ("small language models").

Pinecone — A managed vector database, one of the first and still widely used.

Pipecat — An open-source Python framework for building realtime voice (and multimodal) agent pipelines — wiring STT, the LLM, and TTS into one streaming loop.

Plan-and-execute — An agent pattern where one step produces a full plan and subsequent steps execute it. Contrast with ReAct.

Planner-worker — A multi-agent pattern where a planner agent decomposes work and dispatches it to worker agents.

Portkey — A commercial AI gateway with routing, caching, and observability.

Precision — Of items the model flagged as positive, what fraction actually were? Pair with recall.

Preference tuning — The umbrella term for training a model on human (or AI) preference comparisons rather than gold answers — RLHF and DPO are the two main recipes.

Pre-training — The initial, massive training run on raw text that produces a base model.

Prefix caching — Reusing the KV cache for repeated prompt prefixes across requests. Saves cost and latency. Anthropic calls this prompt caching.

Promptfoo — An open-source CLI for running prompt and model evals.

Prompt — The text you send into the model. Usually a system prompt (instructions) plus a sequence of user and assistant messages.

Prompt caching — A provider-side optimization where repeated prompt prefixes are cached and billed at a fraction of normal input cost.

Prompt injection — An attack where untrusted input (a webpage, an email, a document) carries instructions the model follows as if they were yours. Mitigations: isolation, guardrails, careful tool scoping.

Prompt leak — When a model reveals its hidden system prompt to a user, often via prompt injection.

Prompt registry — A versioned store of prompts, separate from code. Lets you A/B-test and roll back prompts without redeploying.

Prompt version — A specific revision of a prompt stored in a prompt registry.

Pydantic AI — A Python framework that uses Pydantic models to type-check LLM tool calls and structured output.

Q

Qdrant — An open-source vector database written in Rust.

QLoRA — LoRA applied on top of a 4-bit-quantized base model. Lets you fine-tune large models on a single GPU.

Quantization — Storing model weights at lower precision (FP8, INT8, INT4) to cut memory and speed up inference, at some accuracy cost.

Query decomposition — Breaking a complex question into sub-questions before retrieval. Helps RAG over multi-hop queries.

Query expansion — Rewriting or augmenting a user query (synonyms, paraphrases) before retrieval to improve recall.

Qwen — Alibaba's open-weights LLM family.

R

RAG (Retrieval-Augmented Generation) — Handing the model relevant documents at query time so it can answer from real data instead of guessing.

Ragas — An open-source framework for evaluating RAG pipelines (faithfulness, answer relevance, context precision).

Rate limit — The cap a provider puts on requests, tokens, or concurrency per minute. The most common cause of "it broke in production."

ReAct — An agent pattern that interleaves Reason and Act steps: model thinks, calls a tool, observes, repeats.

Realtime API — Provider endpoints (OpenAI, Google) for low-latency speech-in/speech-out interactions, bypassing the separate STT → text → TTS loop.

Reasoning model — A model trained to spend extra inference compute on chain-of-thought before answering. Examples: o-series, Claude reasoning modes, DeepSeek-R1.

Recall — Of all the items that actually are positive, what fraction did the model find? Pair with precision.

Red team — A group (human or automated) that adversarially probes a system for failures, jailbreaks, and harmful behaviors.

Refusal — When a model declines to answer because of safety training. Useful when correct, frustrating when over-triggered.

Regression test — An eval case kept around specifically to catch a previously-fixed bug from coming back.

Replicate — A platform that hosts and serves open-source models behind a simple API.

Reranker — A second-stage model that re-scores retrieved candidates more accurately than the first-stage retriever. Usually a cross-encoder.

Retell — A voice-agent platform built on realtime APIs.

Retrieval — Looking up relevant data given a query. The "R" in RAG.

RFT (Reinforcement Fine-Tuning) — OpenAI's term for fine-tuning where the model is rewarded for correct outputs on user-provided graders.

RLHF (Reinforcement Learning from Human Feedback) — Training step where humans rank model outputs, a reward model is trained on those ranks, and the LLM is fine-tuned to maximize the reward.

Role — The label on a message: system, user, assistant, or tool.

ROUGE — A summarization metric based on overlap with reference summaries. Mostly superseded by LLM-as-judge.

Rubric — The written criteria an LLM-as-judge (or human grader) uses to score outputs.

S

Sampling — How the next token is picked from the model's probability distribution. Controlled by temperature, top_p, top_k.

Scorer — The function that turns a model output (and optional reference) into a score during evals.

SentencePiece — A tokenizer implementation popular for non-English languages and Llama-family models.

SFT (Supervised Fine-Tuning) — The simplest form of fine-tuning: train on (input, desired-output) pairs.

Short-term memory — The conversation history kept inside the context window. The simplest form of memory.

SigLIP — Google's improvement on CLIP that swaps the softmax contrastive loss for a sigmoid loss; a common backbone for the vision side of modern VLMs.

SLM (Small Language Model) — A compact LLM (typically under 10B parameters) suitable for edge or on-device use. Examples: Phi, Gemma.

Sonnet — The middle, balanced tier in Anthropic's Claude family.

Span — One unit of work inside a trace — e.g., a single LLM call, a single retrieval step.

Sparse retrieval — Retrieval based on keyword matches (BM25, TF-IDF). Contrast with dense retrieval.

Speculative decoding — A speedup where a small "draft" model proposes tokens that the big model verifies in parallel.

SR 11-7 — US Federal Reserve guidance on model risk management. Often applied to AI models in finance.

SSE (Server-Sent Events) — A long-lived HTTP connection for server-to-client streaming. The default transport for streaming LLM responses.

STT (Speech-to-Text) — Converting audio to text. Synonym for ASR.

Streaming — Sending tokens to the client as soon as they're generated, instead of waiting for the full response.

Structured output — Forcing the model to emit data matching a JSON Schema. Underlies tool calls and typed responses.

System card — A document accompanying a major model release, covering capabilities, evals, and safety mitigations.

System prompt — The first, hidden instruction message that sets the model's persona, rules, and tools.

T

Temperature — A sampling parameter that flattens (high) or sharpens (low) the probability distribution. 0 ≈ deterministic, 1 ≈ balanced, >1 = wilder.

Test-time compute — Extra inference work at answer time (reasoning tokens, verifier passes, multiple samples) to improve hard answers — bounded by harness budgets.

Together — A serverless inference provider for open models.

Temporal — A durable workflow engine often used to orchestrate long-running, retryable LLM jobs.

Text-to-image — Generating an image from a text prompt. Dominated by diffusion models.

Text-to-video — Generating a short video clip from a text prompt. 2026 frontier models can produce minutes of coherent footage.

TGI (Text Generation Inference) — Hugging Face's open-source inference server, comparable to vLLM.

tiktoken — OpenAI's BPE tokenizer library. The reference for "how many tokens is this prompt?" with GPT models.

Token — The unit an LLM reads and writes — roughly 4 characters of English, or ¾ of a word. Bills are quoted per million tokens, in and out separately.

Token bill — Your usage-based invoice from a model provider. Dominated by input tokens for RAG, output tokens for generation.

Tokenizer — The component that splits text into tokens (BPE, SentencePiece, tiktoken).

Tokens-per-second (TPS) — Throughput metric for generation. Frontier-quality models: 20–100 TPS; Groq/Cerebras on open models: 500–2000 TPS.

Tool — A function the model can call. Defined by a name, description, and JSON Schema for arguments.

Tool use — Synonym for function calling.

top_k — A sampling parameter that restricts the next-token choice to the K most probable tokens.

top_p — A sampling parameter that restricts the next-token choice to the smallest set whose cumulative probability exceeds P (a.k.a. nucleus sampling).

Trace — A timeline of all spans that made up one logical request. The basic unit of LLM observability.

Trajectory eval — Evaluating an agent's tool sequence and intermediate steps (process), not only the final answer (outcome).

Training — The process of updating a model's parameters by minimizing a loss function on data.

Transformer — The neural network architecture behind every modern LLM. Introduced in 2017 ("Attention Is All You Need").

TTFT (Time-to-First-Token) — Latency from request start to the first token of output. The metric users actually feel in streaming UIs.

TTS (Text-to-Speech) — Generating audio speech from text. ElevenLabs is the 2026 default.

Turbopuffer — A serverless vector database optimized for cheap storage and fast cold queries.

Turn-taking — In voice agents, deciding when the user has finished speaking and the agent should respond. Built on VAD plus semantic cues; getting it wrong makes the agent interrupt or lag.

U

Unstructured — A widely-used open-source library and hosted API for parsing PDFs, HTML, and office docs into chunks ready for RAG.

V

VAD (Voice Activity Detection) — Detecting when a speaker starts/stops talking. Essential for realtime voice agents.

Vapi — A voice-agent platform built on realtime LLM APIs.

Vector — A list of numbers. In AI context, almost always an embedding.

Vector database — A database optimized for "find the K most similar vectors" queries. Used to power RAG and semantic search.

Vector DB — Short for vector database.

Vector index — The data structure (HNSW, IVF) inside a vector database that enables fast ANN search.

Vellum — A commercial platform for prompt management, evals, and deployment.

Vercel AI SDK — The dominant TypeScript abstraction for LLM integrations, with streaming, tools, and React hooks.

Vespa — An open-source search engine combining vector, lexical, and structured queries at very large scale.

Video generation — Synthesizing video frames from a prompt or seed image. Usually diffusion-based.

Vision — Any model capability that involves understanding images.

Vision-language model (VLM) — A multimodal model that takes both text and images as input. Most 2026 frontier models are VLMs.

vLLM — A high-throughput open-source LLM serving engine. Uses paged attention and continuous batching.

W

Weaviate — An open-source vector database with built-in hybrid search.

Weights — Synonym for parameters. The numbers a model has learned.

Whisper — OpenAI's open-source ASR model. The default baseline for speech-to-text.

Z

Zero-shot — Asking the model to do a task with no examples in the prompt — just instructions.

A​

B​

C​

D​

E​

F​

G​

H​

I​

J​

K​

L​

M​

N​

O​

P​

Q​

R​

S​

T​

U​

V​

W​

Z​

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Z