Tokens
The unit an LLM reads and writes. Why "tokens" instead of words, and what that means for billing, prompts, and context windows.
Tokenizers
BPE, SentencePiece, tiktoken — the algorithms that split your text into tokens, why the same string varies across models, and how to count tokens before you send.
Embeddings
A vector of floats that captures the meaning of a piece of text. The basis for semantic search, RAG, deduplication, classification.
Neural networks (the machine under every model)
A neural network from zero — neurons, weights, layers, the forward pass, and how training nudges weights to fit examples. The deep-learning foundation that every LLM, including the transformer, is built on.
The transformer (just enough)
The neural network architecture behind every modern LLM. Just enough to make decisions later make sense — no calculus.
Training vs. inference
Why training is rare and inference is your daily reality — and why this distinction shapes every cost, latency, and tooling decision.
Quantization (smaller weights, cheaper inference)
Storing model weights at lower precision (FP8, INT8, INT4) to fit bigger models on smaller GPUs and serve them faster — the memory math, the quality trade-off, and when a quantized big model beats a small one.
Reasoning models
Reasoning-effort dials, Claude extended thinking, DeepSeek V4 thinking mode, Gemini Deep Think. Models that "think" before responding — when they're worth it, when they're not, and how to prompt them differently.
Model families
Frontier vs workhorse vs small. Closed vs open. Reasoning models vs base chat models. The durable map of the model landscape.
Where LLMs fail (and why)
The systematic blind spots of a next-token predictor — counting, exact math, state-tracking, fresh facts, and confident wrong answers — and the first-principles reason each one happens, so you know when to distrust the model.