Multimodal inputs
This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.
In one line: Modern frontier and workhorse models accept images, audio, and PDFs alongside text. Inputs cost differently per modality, and a 1024×1024 image is typically 1K–3K tokens. Use them when the content is visual or auditory — not because they sound cool.
A multimodal model is the same transformer underneath; it just has extra adapters that turn images, audio, or PDFs into tokens the model can attend to. From your code's perspective, you're still sending messages — some message parts are images or audio instead of text. The model sees them as more tokens.
What's actually supported (May 2026)
| Model | Vision | PDF (native) | Audio in | Audio out | Video |
|---|---|---|---|---|---|
| GPT-5 / 5.1 | yes | yes | yes (Realtime) | yes (Realtime) | partial |
| Claude Sonnet 4.6 / Opus 4.7 | yes | yes | no (use Whisper) | no | no |
| Gemini 2.5 Pro / Ultra | yes | yes | yes | yes (Live API) | yes |
| Llama 3.3 / 4 vision | yes | via OCR | no | no | no |
Each provider's "yes" is slightly different — check before you depend on a specific behavior (max image dimensions, max PDF pages, audio sample rates).
Vision: image URLs and base64
# OpenAI / GPT-5
response = client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image? Be specific."},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
],
}],
)
For local files, base64-encode:
import base64
with open("invoice.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract line items as JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
],
}],
)
Anthropic uses {"type": "image", "source": {"type": "base64", ...}}. Gemini uses parts: [{"inline_data": {...}}]. SDKs paper over this.
Image token cost
A common gotcha — images are not free. Typical accounting:
- OpenAI GPT-5: ~85 tokens base + ~170 per 512×512 tile. A 1024×1024 image ≈ ~765 tokens; a 2048×2048 ≈ ~3000+.
- Anthropic Claude: ~1.15 × (width × height / 750) tokens. A 1024×1024 ≈ ~1400 tokens.
- Gemini: 258 tokens per image for "small" mode; up to several thousand for "high resolution."
A bill spike from "we added image uploads" is almost always images costing more than expected. Resize before sending — 512×512 is plenty for most "what does this look like" questions; reserve high-res for OCR/detail tasks.
What vision is genuinely great at
- Document OCR + understanding — invoices, receipts, screenshots of forms, handwritten notes. Often beats traditional OCR + LLM pipelines.
- Chart and table extraction.
- UI understanding — "click the submit button" in agentic browsers.
- Visual QA — accessibility descriptions, content moderation, product cataloging.
- Code-from-screenshot — "build this UI" from a Figma export.
What it's mediocre at (May 2026):
- Precise spatial reasoning ("is the red box left or right of the blue?") — improving but unreliable.
- Counting many small objects ("how many ants in this photo?") — usually wrong past ~10.
- Reading tiny text — resize and tile yourself if it matters.
Audio inputs
Two flavors:
- Speech-to-text first, then text LLM — classic pipeline. Use Whisper (
whisper-1, Whisper Large v3), Deepgram, AssemblyAI. Cheap (~$0.006/min) and reliable. Then send the transcript to any LLM. - Native audio-in models — GPT-5 Realtime, Gemini Live API. Audio goes straight to the model; it can hear tone, pauses, accents, and respond with audio. Much higher cost (~$0.06+/min in 2026) but enables real conversational latency (<500ms round trip).
# Classic STT
audio = open("call.mp3", "rb")
transcript = client.audio.transcriptions.create(model="whisper-1", file=audio)
print(transcript.text)
# Then a regular text LLM call
summary = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": f"Summarize:\n{transcript.text}"}],
)
For the Realtime API, you stream audio chunks in via WebSocket and receive audio chunks (or text deltas) back. The model can interrupt itself if the user starts speaking. Best for voice assistants and real-time agents — way overkill for batch transcription.
Document inputs (PDFs)
Anthropic and Gemini accept PDFs natively (no OCR preprocessing). The provider extracts text and images from each page and feeds them to the model.
# Anthropic
with open("contract.pdf", "rb") as f:
pdf_b64 = base64.standard_b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64}},
{"type": "text", "text": "Summarize clauses about liability."},
],
}],
)
PDFs cost roughly text tokens + image tokens per page. A 30-page contract can easily be 50K+ tokens. For repeated questions on the same PDF, combine with prompt caching — first call pays; subsequent calls are 5–10× cheaper.
Worked example: invoice extraction from a scanned image
class LineItem(BaseModel):
description: str
qty: int
amount: float
class Invoice(BaseModel):
vendor: str
total: float
line_items: list[LineItem]
with open("scanned-invoice.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
result = client.beta.chat.completions.parse(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract the invoice as JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
],
}],
response_format=Invoice,
)
invoice = result.choices[0].message.parsed
Vision + structured output = end-to-end document automation in 20 lines. This is the single most common multimodal pattern in B2B in 2026.
What beginners get wrong
- Sending 4000×4000 images "for quality." You're paying for the tiles. Downscale to what the task needs — 1024×1024 for general tasks, 2048 max for OCR.
- Forgetting that PDFs are image tokens, not just text. A scanned PDF costs more than a digital one. A 100-page PDF can blow your context window.
- Using native audio Realtime for batch transcription. Use Whisper. Realtime is for live conversation.
- Assuming the model can OCR tiny text reliably. Resize and crop the relevant region; or pre-process with a real OCR (Tesseract, Textract) and send text.
- Embedding raw images for similarity search. Use a real vision-embedding model (CLIP, SigLIP), not the chat API.
- Sending PII images without thought. Provider terms of service vary; some retain inputs for training (default off for enterprise; check). Strip metadata; consider on-prem for sensitive workloads.
- Pasting screenshots when text would do. A screenshot of a stack trace is 2K tokens; the text is 200. Paste the text.
Cost comparison rule of thumb
For the same "extract structured data from a document" task:
| Approach | Cost per doc | Quality |
|---|---|---|
| Vision LLM directly | $$ | high |
| OCR (Tesseract) + text LLM | $ | medium |
| Cloud OCR (Textract) + text LLM | $$ | high |
| Vision LLM + structured output | $$ | highest |
For high-volume pipelines, benchmark all three on your documents. For prototyping, vision LLM with structured output is the fastest path from PDF to typed data.
80% of B2B data lives in PDFs, scans, screenshots, and Word docs nobody wants to manually re-key. Multimodal LLMs turned that pile into a queryable dataset. The unsexy invoice-extraction use case is doing more revenue in 2026 than every consumer AI startup combined.
This page covers the inputs. For building real vision and voice systems — production OCR pipelines, video understanding, voice-agent latency budgets, and how to evaluate multimodal output — see Chapter 8: Multimodal & Voice AI, especially vision and voice.
→ Next: Vector search