Math primer (appendix)

In one line: You can ship LLM apps for years without any of this — but if you ever want to specialize into fine-tuning, inference optimization, or research engineering, this is the intuition you'll need on the on-ramp.

In plain English

This appendix is explicitly optional. Every other page of this guide says "no calculus required" — and means it. But if you're curious about what's actually happening inside the box, or you're planning to specialize later, this page gives you the four mental images that matter: embeddings as geometry, softmax as a vote, attention as a similarity-weighted average, and gradient descent as rolling downhill. No proofs, no derivations — just intuition.

1. Embeddings — geometry, not magic

A token embedding is a vector — say, 1536 numbers. Each number is a coordinate in a 1536-dimensional space. Tokens with related meanings end up near each other in that space; unrelated tokens end up far apart.

"Near" and "far" are defined by:

Cosine similarity: dot product of the two vectors, normalized by their lengths. Range: -1 (opposite) to 1 (same direction). The default similarity metric for embeddings.
L2 distance: straight-line distance. Useful when magnitude matters.

Why a dot product? If both vectors have unit length, the dot product equals the cosine of the angle between them. Vectors pointing the same way → high dot product → similar meaning.

"king" - "man" + "woman" ≈ "queen"

That cliché actually works (more or less) because of this geometric structure. Concepts are directions in the space — the "royalty direction," the "gender direction," etc. — and you can add and subtract directions.

This is the whole intuition behind vector search (Vector search) and embeddings (Embeddings).

2. Softmax — turning numbers into a vote

Anywhere the model has to choose something — which token to emit next, which input position to attend to — softmax is the mechanism.

Input: a list of arbitrary numbers (called "logits"). Output: a list of probabilities that sum to 1.

logits = [2.0, 1.0, 0.1]
softmax(logits) ≈ [0.66, 0.24, 0.10]   # sums to 1

Mechanically: exponentiate each logit, divide by the sum of the exponentials. The exponential is what makes softmax "winner-takes-most" — small differences in logits become big differences in probabilities.

The temperature knob (Sampling) scales the logits before softmax. Low temperature (e.g. 0.1) sharpens the distribution — the top option dominates. High temperature (e.g. 1.5) flattens it — the distribution is closer to uniform.

That's the entire math behind temperature and top_p knobs.

3. Attention — a similarity-weighted average

This is the trick that defines the transformer. For each token, attention answers: "which other tokens should I look at, and how much?"

Each token has three derived vectors:

Query (Q): "what am I looking for?"
Key (K): "what am I about?"
Value (V): "what info do I carry?"

The mechanism, for one token:

Compute dot product of my Q with every other token's K → a similarity score per token.
Apply softmax → those scores become weights summing to 1.
Take the weighted average of every token's V using those weights.
The result is my new representation — a blend of information from the tokens I attended to most.

That's it. Stack 60 layers of this, train on the internet, and you get a transformer.

Why it's O(n²) in context length: every token computes a Q·K dot product with every other token. Doubling context → quadrupling pairs → roughly quadrupling cost. This is the source of the long-context expense and why flash attention / paged attention / prefix caching exist.

See The transformer for the higher-level view.

4. Loss, gradient, and the "rolling downhill" picture

Training a model means adjusting its parameters (the billions of numbers inside) so that its predictions match the training data better.

Loss is a single number that measures "how wrong" the model is on a given example. For next-token prediction, the typical loss is cross-entropy — high when the model assigned low probability to the actual next token, low when it assigned high probability.

The training landscape is the loss as a function of the parameters. Imagine it as a hilly surface — each point is a possible setting of the model's weights, and the height is the loss for that setting. You want to find a low spot.

The gradient is a vector pointing in the direction of steepest increase of the loss. Flip its sign and you have the direction of steepest decrease.

Gradient descent is the algorithm:

Compute the gradient at the current point (via backpropagation).
Take a small step in the opposite direction (multiplied by a "learning rate").
Repeat.

You're rolling downhill on the loss surface. Modern optimizers (Adam, AdamW) are gradient descent with momentum and per-parameter learning rates, but the picture is the same.

That's all "training" is, at a hand-wavy level. Everything else — RLHF, DPO, fine-tuning, SFT — is variations on what loss you minimize and what data you use.

What this gives you that the rest of the guide doesn't

Why long contexts are expensive. O(n²) attention.
Why embedding similarity is just geometry. Dot products, cosine, vector arithmetic.
Why temperature works the way it does. Logit scaling pre-softmax.
What "training" actually does. Adjust parameters to minimize loss via gradient descent.
The on-ramp to fine-tuning literature. SFT (cross-entropy on demonstrations), DPO (a different loss using preferences), RFT (reward-driven loss).

You don't need any of this to ship AI products. You do need it if you specialize into:

Fine-tuning. You're literally choosing a loss and optimizing it.
Inference optimization. Understanding flash attention, paged attention, quantization formats requires the math.
Research engineering. Reading papers requires fluency at this level.

When (and when not) to study deeper

Don't:

Read PyTorch tutorials before you've shipped your first LLM app.
Try to "really understand" backprop before knowing whether you'll ever train a model.
Take an MOOC on linear algebra "just in case." The opportunity cost is shipping something.

Do, when:

You've shipped 3+ AI features and want to specialize.
A fine-tuning project is on your roadmap.
You're targeting an inference / serving role.
You want to read papers and apply their ideas.

For that, the canonical path is: the 3Blue1Brown deep-learning series for intuition, Deep Learning (Goodfellow) for the formal version, Andrej Karpathy's nanoGPT and micrograd for hands-on building. That stack, done well, is roughly a year of evenings.

What this primer does NOT cover

Backprop derivation. Useful if you'll modify training code; not for inference.
Optimizer specifics (Adam betas, weight decay, learning-rate schedules). Tuning territory.
Positional encoding variants (rotary, ALiBi, learned). Architecture detail.
RLHF / DPO loss functions. Specialization territory; understandable once you know "loss" and "gradient."
Quantization formats (FP16, BF16, FP8, MXFP4, INT4). Inference-engineering territory.
Distributed training (FSDP, DeepSpeed, tensor parallelism). Research-engineering territory.

The point of this appendix isn't to make you an ML engineer — it's to give you enough vocabulary to decide whether you want to become one, without months of detour.

🤔 Quick checkQuick check

→ Back to: Foundations checkpoint

1. Embeddings — geometry, not magic​

2. Softmax — turning numbers into a vote​

3. Attention — a similarity-weighted average​

4. Loss, gradient, and the "rolling downhill" picture​

What this gives you that the rest of the guide doesn't​

When (and when not) to study deeper​

What this primer does NOT cover​