Part 1: Foundations of LLM Systems

Just enough about how LLMs work to make every decision later make sense.

In one line: An LLM is a function that takes a sequence of tokens in and produces a probability distribution over the next token out. Everything else — chat, RAG, agents, multimodal — is layered on top of that one primitive.

In plain English

You don't need a PhD to build LLM apps. You need a working mental model of: what a token is, what a context window is, how the model decides what to say next, and what new patterns (retrieval, tool use, agents) exist on top of that core. This chapter gives you exactly that — no calculus, no PyTorch.

Why "foundations" matters even if you just want to ship

You can write your first LLM app in ten lines of code without understanding any of this. You can't ship a good one. Almost every production decision — what model to pick, how to control cost, why outputs are slow, why outputs are wrong, when RAG helps and when it doesn't, when to reach for an agent and when not to — is downstream of a foundational concept. Engineers who skip this chapter spend the next year confused about the same five things on a loop.

The mental model

An LLM is a stochastic, next-token function:

Input: a sequence of tokens (your prompt).
Output: a probability distribution over what the next token should be.
The sampler picks one token, appends it, and the model runs again — until a stop condition.

That's it. The rest is engineering:

Streaming is delivering those tokens to the client as they're produced.
Structured output is constraining the sampler so the result parses as JSON or matches a schema.
Tool calling is the model emitting a structured "I want to call function X with these arguments" instead of plain text.
RAG is stuffing relevant documents into the prompt before generation.
Agents are running the model in a loop and feeding tool results back in.

Every "magical" AI product is one or more of these patterns assembled carefully.

Read that loop until it feels boring. Every concept in this chapter is some way of feeding that loop better data, harvesting its output more usefully, or chaining several passes through it.

How this chapter is organized

Each page focuses on a single concept. Read in order the first time.

The model

Tokens — The unit of LLM input and output.
Tokenizers — BPE, SentencePiece, and why the same string is 100 tokens for one model and 130 for another.
Embeddings — Vectors that capture meaning.
The transformer — Just enough architecture to be useful.
Training vs. inference — Why one is rare and the other is your daily reality.
Reasoning models — o1/o3, extended thinking, R1; when "thinking" beats more context.
Model families — Frontier vs workhorse vs small; closed vs open; reasoning vs base.

Using the API

Messages: system, user, assistant — How you actually call an LLM.
Prompting — the craft — Chain-of-thought, few-shot, ReAct, self-consistency, prompt chaining — the named techniques.
Context windows — The hard limit on what fits in one call.
Prompt caching — Reusing KV cache across calls for 5–10× cost savings.
Sampling: temperature, top_p, top_k — How the next token is picked.
Streaming — Delivering tokens as they're generated.
Structured output — Forcing JSON or schema-conformant responses.
Tool use / function calling — Letting the model invoke your code.
Function calling, deep — Parallel tools, forced choice, streaming partial JSON.
MCP — Model Context Protocol — The open protocol for connecting LLM clients to tool servers, resources, and prompts.
Multimodal inputs — Vision, audio, document inputs.

Retrieval & memory

Vector search — Finding semantically similar text.
Hybrid search — BM25 + vector; what each catches that the other misses.
Chunking strategies — The biggest single lever on RAG quality.
Reranking — The cheap-retrieval → expensive-rerank pattern.
RAG basics — Retrieval-Augmented Generation, end to end.
Memory — Giving an assistant continuity across conversations.

Agents

The agent loop — Tool → observation → next tool → done.
Planning and reflection — Explicit plan-act-reflect; when reflection helps.
Multi-agent systems — When (and when not) to add a second agent.
Computer use & browser agents — Vision-loop agents that operate any UI.

Checkpoint

Foundations checkpoint — A self-test before moving on.

Appendix

Math primer — Optional 1-page intuition for embeddings, softmax, attention, and gradient descent.

Where this leads

Foundations is the vocabulary. Everything after it builds on these primitives: you'll learn the project lifecycle and tech stack, then the disciplines that separate a demo from a product — evaluation and responsible & safe AI — then specializations like fine-tuning and multimodal & voice, the workflows at every team scale, and finally decisions, career, and real case studies. Read in order and you go from "what's a token?" to job-ready.

When you finish this chapter, move on to Chapter 2: Roadmap.

Why "foundations" matters even if you just want to ship​

The mental model​

How this chapter is organized​

The model​

Using the API​

Retrieval & memory​

Agents​

Checkpoint​

Appendix​