Sampling
In one line: The model outputs a probability over every possible next token; the sampler picks one.
temperature,top_p, andtop_kshape how adventurous that pick is.
Imagine the model handing you a stack of cards, each showing a candidate next word and how likely it thinks that word is. The sampler is the rule for picking one card. Always pick the highest → boring, deterministic. Reshuffle proportionally to weights → varied, sometimes surprising. Throw out the long tail first → safer creativity. That's it.
The three knobs
temperature(typically 0–2) — Sharpens or flattens the distribution. 0 = always pick the top token (most deterministic). 1 = use the raw distribution. >1 = flatter, more random. For factual/structured tasks: 0–0.3. For creative writing: 0.7–1.2.top_p(nucleus sampling, 0–1) — Only consider the smallest set of tokens whose probabilities sum top. 0.9 is a common default. Lower = safer/more conservative.top_k— Only consider the top K tokens. Less common in modern APIs (top_p usually replaces it).
You typically set one of temperature, top_p, or top_k — combining them is mostly overkill.
When to use what
- Deterministic-ish output (classification, JSON extraction, code):
temperature=0. Note: even at temp 0, perfect determinism is not guaranteed across providers and across time (batching, hardware non-determinism). - Default chat / Q&A:
temperature=0.7is the OpenAI default. Reasonable starting point. - Brainstorming / creative writing:
temperature=1.0–1.2. - High-stakes evals: drop to 0 to compare runs, then verify the live config matches.
- Function-calling / tool use:
temperature=0or near. You want the model to pick the right tool with the right args, not improvise.
Worked example: same prompt, three temperatures
Prompt: "Suggest a name for a coffee shop near a library."
| Temperature | Sample output |
|---|---|
| 0 | "The Reading Room" |
| 0.7 | "Margin Notes Coffee" |
| 1.2 | "Foxglove & Foxed Pages Espresso" |
At 0, you'd get "The Reading Room" every time. At 0.7, you'd get a small spread of plausible names. At 1.2, you get oddballs — useful for brainstorming, terrible for "extract this user's email."
Other relevant params
seed— Some providers expose a seed for reproducibility. Best-effort, not guaranteed (batching, version drift, hardware non-determinism still apply).stop— A list of strings that, if generated, end the response. Useful for structured outputs and chain-of-thought termination.presence_penalty/frequency_penalty— Discourage repeats. Rarely needed in modern models.max_tokens/max_output_tokens— Hard cap on output length. Always set this. Always.logprobs— Some providers return the top-K logprobs per token. Useful for confidence estimates and reranking.
Logprobs: the secret weapon
When you need to know "how sure was the model?", request logprobs:
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": "Sentiment of: 'It's fine I guess.'"}],
logprobs=True,
top_logprobs=5,
)
for tok in response.choices[0].logprobs.content[:3]:
print(tok.token, [(t.token, round(t.logprob, 2)) for t in tok.top_logprobs])
You'll see for each output token, the top 5 alternatives and their log-probabilities. Convert to probability with math.exp(logprob). Use cases:
- Confidence-based routing. If top token < 0.7 prob → escalate to a stronger model.
- Calibrated classification. Take the logprob of each class label, normalize, use as probability.
- Better reranking. Score candidate completions by total logprob.
The "deterministic at temp 0" myth
temperature=0 makes the sampling step deterministic, but the model itself isn't. You can get different outputs from the same prompt because:
- Batching — your request runs with different siblings, causing tiny floating-point differences that flip the top token at ties.
- Hardware non-determinism — different GPU schedules give different rounding.
- Provider version drift —
gpt-5-minitoday is not the exact same checkpoint asgpt-5-minitwo months ago.
For evals you care about: pin the version (e.g., gpt-5-mini-2026-04-15), use temp 0, and log the response so you can compare exact strings.
What beginners get wrong
- Setting temperature high "for creativity" on a structured task. Higher temperature = more invented JSON keys, more malformed schemas, more hallucinated facts. Use temp 0 for extraction.
- Setting both
temperatureandtop_paggressively. Either alone is enough; both together gives unpredictable behavior. - Believing temp 0 means perfect reproducibility. It doesn't. Test, log, expect drift.
- Forgetting
max_tokens. A runaway generation is a budget incident. Set a hard cap. - Using
stopstrings that appear naturally in the output.stop=["\n"]will end a code block mid-line. - Tuning sampling before tuning the prompt. Sampling moves the needle 5%; prompt structure moves it 50%. Fix the prompt first.
- Setting
presence_penaltyto mask a bad prompt. Penalties paper over repetition; a better instruction fixes it at the root.
A reasonable defaults table
| Task | temperature | top_p | max_tokens |
|---|---|---|---|
| JSON extraction | 0 | - | tight cap |
| Classification | 0 | - | tight cap |
| Chat (general) | 0.7 | 1 | 1000–4000 |
| Code generation | 0.2 | - | as needed |
| Brainstorming / naming | 1.0 | - | small |
| Creative writing | 0.9–1.1 | - | as needed |
| Tool calling | 0 | - | small |
Lower temperature isn't "smarter" — it's just less varied. The model knows the same things at temp 0 and temp 1; the difference is whether you see its second-best guess as easily as its first.
Reproducibility recipe
When you need the same prompt to produce the same output as often as possible:
- Pin the model version explicitly (
gpt-5-mini-2026-04-15, notgpt-5-mini). - Set
temperature=0. - Pass
seedif the provider supports it. - Log the exact request and response.
- Re-test on a small fixture set after any model upgrade — silent drift is real.
This still isn't guaranteed reproducibility (see the myth section above), but it's as close as the providers allow today.
Practice: greedy decoding (temperature 0)
temperature=0 means "always pick the highest-probability token" — i.e., argmax over the distribution. That's the whole of greedy decoding. Write it and you've implemented the deterministic end of the sampling dial. (Ties resolve to the lowest index, which is how most implementations break them.)
Write argmax(probs) — given an array of next-token probabilities, return the index of the largest. On a tie, return the lowest index. This is exactly what temperature=0 does.
→ Next: Streaming