Part 1: Foundations of LLM Systems
Just enough about how LLMs work to make every decision later make sense.
In one line: An LLM is a function that takes a sequence of tokens in and produces a probability distribution over the next token out. Everything else — chat, RAG, agents, multimodal — is layered on top of that one primitive.
Start with The map of AI to see where LLMs sit inside artificial intelligence, machine learning, and deep learning — then come back here for how they work. This chapter is the LLM/agent slice of that map.
You don't need a PhD to build LLM apps. You need a working mental model of: what a token is, what a context window is, how the model decides what to say next, and what new patterns (retrieval, tool use, agents) exist on top of that core. This chapter gives you exactly that — no calculus, no PyTorch.
Why "foundations" matters even if you just want to ship
You can write your first LLM app in ten lines of code without understanding any of this. You can't ship a good one. Almost every production decision — what model to pick, how to control cost, why outputs are slow, why outputs are wrong, when RAG helps and when it doesn't, when to reach for an agent and when not to — is downstream of a foundational concept. Engineers who skip this chapter spend the next year confused about the same five things on a loop.
The mental model
An LLM is a stochastic, next-token function:
- Input: a sequence of tokens (your prompt).
- Output: a probability distribution over what the next token should be.
- The sampler picks one token, appends it, and the model runs again — until a stop condition.
That's it. The rest is engineering:
- Streaming is delivering those tokens to the client as they're produced.
- Structured output is constraining the sampler so the result parses as JSON or matches a schema.
- Tool calling is the model emitting a structured "I want to call function X with these arguments" instead of plain text.
- RAG is stuffing relevant documents into the prompt before generation.
- Agents are running the model in a loop and feeding tool results back in.
Every "magical" AI product is one or more of these patterns assembled carefully.
Read that loop until it feels boring. Every concept in this chapter is some way of feeding that loop better data, harvesting its output more usefully, or chaining several passes through it.
How this chapter is organized
Each page focuses on a single concept. Read in order the first time. (This list is generated from the sidebar, so the order and numbering never drift from the source of truth.)
- The map of AI: where LLMs fit — A one-page map of the whole field — AI, machine learning, deep learning, generative AI, LLMs, and agents — so you always know where the thing you're learning sits, and where classical ML still wins.
- Before you start: the programming you need — The five programming ideas every page of this guide assumes — variables, functions, HTTP, JSON, and the terminal — each in plain English, plus where to go if you've never written code.
The model
- Tokens — The unit an LLM reads and writes. Why "tokens" instead of words, and what that means for billing, prompts, and context windows.
- Tokenizers — BPE, SentencePiece, tiktoken — the algorithms that split your text into tokens, why the same string varies across models, and how to count tokens before you send.
- Embeddings — A vector of floats that captures the meaning of a piece of text. The basis for semantic search, RAG, deduplication, classification.
- Neural networks (the machine under every model) — A neural network from zero — neurons, weights, layers, the forward pass, and how training nudges weights to fit examples. The deep-learning foundation that every LLM, including the transformer, is built on.
- The transformer (just enough) — The neural network architecture behind every modern LLM. Just enough to make decisions later make sense — no calculus.
- Training vs. inference — Why training is rare and inference is your daily reality — and why this distinction shapes every cost, latency, and tooling decision.
- Quantization (smaller weights, cheaper inference) — Storing model weights at lower precision (FP8, INT8, INT4) to fit bigger models on smaller GPUs and serve them faster — the memory math, the quality trade-off, and when a quantized big model beats a small one.
- Reasoning models — Reasoning-effort dials, Claude extended thinking, DeepSeek V4 thinking mode, Gemini Deep Think. Models that "think" before responding — when they're worth it, when they're not, and how to prompt them differently.
- Model families — Frontier vs workhorse vs small. Closed vs open. Reasoning models vs base chat models. The durable map of the model landscape.
- Where LLMs fail (and why) — The systematic blind spots of a next-token predictor — counting, exact math, state-tracking, fresh facts, and confident wrong answers — and the first-principles reason each one happens, so you know when to distrust the model.
Using the API
- Messages — system, user, assistant — The shape of every modern LLM API call — system prompt for instructions, then alternating user and assistant turns.
- Prompting — the craft — Chain-of-thought, ReAct, self-consistency, prompt chaining, few-shot vs zero-shot, role assignment. The repeatable techniques behind reliable prompts.
- Context windows — The hard cap on how many tokens the model can see and emit in one call. Why bigger isn't always better.
- Prompt caching — Reusing the model's KV cache across calls when the prompt prefix is identical. 5-10x cost savings, dramatically faster TTFT.
- Sampling — temperature, top_p, top_k — How the next token is picked from the model's probability distribution. The knobs that make outputs more deterministic or more creative.
- Streaming — Sending tokens to the client as they're generated, instead of waiting for the full response. Required UX for any chat-style feature.
- Structured output — Forcing the model to return JSON, or even better, JSON that conforms to a schema. The bridge between LLM text and traditional code.
- Tool use / function calling — Letting the model emit a structured call (function name + args) that your code then executes. The foundation of every agent.
- Function calling, deep — Parallel tools, forced tool choice, streaming partial JSON, structured output via tools. The patterns that turn basic tool use into production agents.
- MCP — the Model Context Protocol — The 2024-released open protocol for connecting LLM clients to tools, data, and prompts. The standard that ate function-calling glue in 2025-2026.
- Multimodal inputs — Vision (image URLs and base64), audio (Whisper-class STT, Realtime), and document inputs. What changes, what costs, and where it shines.
Retrieval & memory
- Vector search — Finding the K most semantically similar pieces of text by comparing embedding vectors. The "find nearest neighbors in 1,536-dimensional space" primitive.
- Hybrid search — BM25 (keyword) plus vector (semantic) search, blended. Each catches what the other misses. The 2026 production default.
- Chunking strategies — Fixed-token vs semantic vs layout-aware vs hierarchical. Overlap, units, and why chunking dominates RAG quality more than any other knob.
- Reranking — Cross-encoder rerankers (Cohere Rerank, BGE, voyage-rerank). The 'cheap retrieval -> expensive rerank' pattern that wins production RAG.
- RAG basics — Retrieval-Augmented Generation — handing the model relevant documents at query time so it can answer from real data instead of guessing.
- Memory — Giving an LLM continuity across conversations — short-term, long-term, episodic, and the patterns that actually work in production.
Agents
- The agent loop — Tool call → observation → next tool call → done. The single mechanism behind every "AI agent" you've heard of.
- Planning and reflection — Explicit plan-act-reflect loops. When the extra step helps, when it's just expensive theater, and how to wire it without ceremony.
- Multi-agent systems — When (and when not) to add a second agent. The hype is loud; the wins are narrow.
- Context engineering — Curating what's in the model's context window at each step — the core reliability discipline for long-running agents.
- Computer use & browser agents — Claude Computer Use, ChatGPT Agent, Gemini browser control, browser-based agents. The vision-loop primitive — model takes a screenshot, emits clicks/keys, repeats.
- Foundations checkpoint — A self-test before you move on. If you can answer these without scrolling back, you have the foundations.
- Math primer (appendix) — Optional 1-page intuition for the math you can ignore as an AI engineer — but want to understand if you specialize into ML, fine-tuning, or research engineering.
Foundations is the vocabulary. Everything after it builds on these primitives: you'll learn the project lifecycle and tech stack, then the disciplines that separate a demo from a product — evaluation and responsible & safe AI — then specializations like fine-tuning and multimodal & voice, the workflows at every team scale, and finally decisions, career, and real case studies. Read in order and you go from "what's a token?" to job-ready.
When you finish this chapter, move on to Chapter 2: Roadmap.