Batch inference

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: Submit a JSONL file of requests; come back hours later for the responses; pay roughly half. The single biggest cost lever in your stack for anything that doesn't need to be interactive.

In plain English

A "batch" API is the LLM provider's version of "we'll run it when we have spare capacity." You package up thousands of requests into one file, upload it, and the provider returns the results within a window (typically 24 hours, often much sooner). The price tag: 50% off real-time pricing. If you're embedding a corpus, summarizing a backlog, classifying old emails, generating eval data, or doing nightly enrichments — anything where "two minutes from now" is fine — batch APIs cut your bill in half with one line of code change.

The major options (2026)

Provider	Batch endpoint	Discount	SLA	Max requests per batch	Cancel
OpenAI Batch	`/v1/batches`	50%	within 24h, usually faster	50k per file, 200MB	yes
Anthropic Message Batches	`/v1/messages/batches`	50%	within 24h	100k per batch	yes
Google Gemini Batch	Vertex `BatchPredictionJob`	50%	24h	scales by project	yes
AWS Bedrock Batch	per-model	50%	up to 24h	depends on model	yes
Together / Fireworks / Groq batch	OpenAI-compatible batch endpoints	25–50%	varies	model-dependent	yes

Provider-side batching is the cheapest path. If you need your own batching layer (queue many requests, send concurrently, rate-limit, retry, persist), use a scheduler:

Scheduler	Strengths	Best for
Modal	GPU + queue + Python	Self-hosted batch on open models
Inngest batch / debounce	Event-driven; TS-first	Aggregating bursty events into batches
Trigger.dev	TS durable jobs	TS batch pipelines
Celery / RQ + Redis	Boring, well-known	Python teams already using it
SQS / Cloud Tasks / Pub/Sub	Cloud-native	Existing cloud-locked stacks
Hatchet / Temporal	Durable batch workflows	Mid-size to large Python teams

Default pick for most teams

The provider's native batch API, called from a small script that runs nightly. It's 20 lines of code, free of scheduling infrastructure, and gives you the full 50% discount.

When batch jobs become a regular thing — fan-out across tenants, retries, dependencies — wrap it in Inngest (TS) or Modal (Py) for proper job management.

When to deviate

Truly latency-sensitive but volume-heavy (a user is waiting, but you have 1000 sub-tasks): you can't use the 24h batch API — use parallelism + rate-limit aware concurrency in your own scheduler.
Self-hosted open models: there's no batch API, but vLLM continuous batching gives you the same economic benefit at runtime. Run the queue through Modal or Ray.
Mixed-priority queues (some jobs need 1 minute, some can wait): a scheduler (Inngest, Temporal) that routes by priority into different lanes.
You need to know cost up-front for a fixed corpus: batch APIs let you submit and see token estimates without committing.

What batch is great for

Embedding a large corpus. Embedding 1M chunks at $0.02/Mtok is $10; at batch pricing, $5.
Backfill / migration. Re-summarizing or re-classifying old data after a prompt change.
Eval runs. A 500-case eval against three models = 1,500 calls; batch costs half.
Synthetic data generation. Generate 10k training examples overnight at half price.
Newsletter / digest jobs. Daily summaries of yesterday's data.
Nightly enrichment. Score every signup of the day; tag every support ticket from the prior shift.
Periodic re-grading. Score every conversation from the prior week against your latest rubric.

What batch is NOT for

Real-time chat.
Anything where a user is actively waiting.
Sub-minute SLAs.
Workflows where you need the result to continue an interactive flow.

Minimum integration

OpenAI Batch — three calls:

import openai, json
client = openai.OpenAI()

# 1. Build a JSONL file of requests
with open("input.jsonl", "w") as f:
    for i, doc in enumerate(docs):
        f.write(json.dumps({
            "custom_id": f"req-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-5.1-mini",
                "messages": [{"role": "user", "content": f"Summarize: {doc}"}],
            }
        }) + "\n")

# 2. Upload + submit
batch_file = client.files.create(file=open("input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

# 3. Poll, then download (typically 5min–4h, not 24h)
while client.batches.retrieve(batch.id).status != "completed":
    time.sleep(60)

results = client.files.content(client.batches.retrieve(batch.id).output_file_id)

Anthropic Message Batches — same shape:

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"req-{i}",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": f"Classify: {text}"}],
            }
        }
        for i, text in enumerate(texts)
    ]
)
# Poll batch.id; results come back as JSONL with the same custom_ids.

Pricing & cost notes (May 2026)

A worked example. Classifying 5 million support tickets through Haiku:

Real-time: ~$25 input + $5 output ≈ $30.
Batch (50% off): ≈ $15.

Same for embeddings. A 50M-token corpus at OpenAI batch:

Real-time embeddings: $1.00.
Batch: $0.50.

The savings are linear and meaningful at scale. For a startup spending $10k/mo on LLM API calls, moving 40% of traffic to batch saves $2k/mo with one engineer-day of work.

Patterns worth knowing

Batch by time-of-day. Submit at 11pm; you almost always have results before standup.
Mix batch + realtime. Realtime for the user-facing call, batch for the offline reprocessing. Same model, different price.
Per-batch dedupe. If two rows produce the same prompt, only include it once and fan the result out.
Treat the JSONL as input to a workflow step. Wrap submission + polling + download in a durable workflow (Inngest, Temporal) so a deploy doesn't lose the batch ID.
Save the custom_id-to-row mapping. When the batch returns, you need to know which output goes with which input.
Self-hosted equivalent: vLLM with continuous batching gives the same effective discount at runtime on open models.

Pitfalls

Using batch for user-facing requests. "Within 24 hours" includes the worst case. Users will not wait.
Submitting one huge batch instead of many medium ones. A single 500MB batch is one big retry surface and one slow poll loop. Chunk into 10–50MB batches.
No custom_id. You lose the mapping from input to output. Always set it.
Forgetting to download before the retention window (typically 29 days). After that, results are gone.
No retry strategy for failed lines. Batches return per-line success/failure. Handle the failures.
Doing batch by hand-rolling concurrency to the realtime API. You'll hit rate limits and pay full price. Use the actual batch endpoint.
Not tagging spend. Without a metadata field, you can't tell batch vs realtime in your billing dashboard.

🤔 Quick checkQuick check

→ Next: Voice infrastructure

The major options (2026)​

Default pick for most teams​

When to deviate​

What batch is great for​

What batch is NOT for​

Minimum integration​

Pricing & cost notes (May 2026)​

Patterns worth knowing​

Pitfalls​