Open-weight models
This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.
In one line: Models whose weights you can download. You can run them on your own GPUs, on a managed inference provider, or even on a laptop. The freedom is real; so is the operational cost.
"Open-weight" doesn't mean "open-source" the way Linux is — most of these licenses come with restrictions, and you almost never get the training data. What you do get is the model file: a multi-gigabyte bag of numbers you can load into your own inference server (vLLM, Ollama) or hand to a hosting provider (Together, Fireworks, Groq, Replicate). The pitch is privacy, control, and cost at scale. The catch is that you now own a piece of ML infrastructure.
The major families (June 2026)
| Family | Maintainer | Top sizes | Best for | License notes |
|---|---|---|---|---|
| DeepSeek V4 | DeepSeek | 284B & 1.6T MoE (small active) | Reasoning, agentic coding, cost-efficiency, 1M context | MIT |
| Qwen 3 / 3.5 | Alibaba | 7B–235B MoE, dual thinking modes | Multilingual (esp. CJK), reasoning | Apache 2.0 (open tier) |
| Llama 4 | Meta | Scout (10M ctx), Maverick (1M ctx) | General workhorse, broad community | Llama Community License (commercial OK with caps) |
| Mistral Large 3 / Ministral 3 | Mistral AI | 3B–14B dense, 675B MoE | European workloads, MoE efficiency | Apache 2.0 |
| GLM / MiniMax / Kimi | Zhipu / MiniMax / Moonshot | Large MoE | Frontier-adjacent open weights | Mostly permissive |
| Gemma 3 | 2B, 9B, 27B | On-device, edge | Gemma license (commercial OK) |
- Meta pivoted away from open weights at the frontier. In April 2026 Meta's new Superintelligence Labs shipped Muse Spark — a closed, API-only model that succeeds Llama 4. The Llama 4 herd (Scout, Maverick) remains downloadable, but don't expect a future open Llama flagship. If you picked Llama for "the open default," re-evaluate.
- The open-weight frontier is now mostly Chinese. As of mid-2026 the strongest open-weight models on public leaderboards come from DeepSeek (V4), Alibaba (Qwen), Zhipu (GLM), MiniMax, and Moonshot (Kimi) — with Google's Gemma and NVIDIA's Nemotron the main Western entries. The best open models trail the closed frontier by a small margin and win decisively on cost.
Default pick for most teams
Don't. If you're under ~$5k/month in API spend and you don't have a hard compliance requirement, stay on closed providers. The operational cost of running an open model in production — GPU procurement, autoscaling, quantization tuning, eval drift — almost always swamps the API savings.
If you've decided you need open weights, a mid-size Llama 4 or Qwen 3 model via Together AI or Fireworks is the no-think default. You get a serverless endpoint, predictable pricing, and zero infra. For reasoning- and coding-heavy tasks, DeepSeek V4 is the open price/quality leader as of June 2026.
When to deviate (i.e. when open weights actually make sense)
- Data residency / privacy prevents sending content to a US-hosted closed provider. Self-host inside your VPC.
- Cost at scale. Past ~100M tokens/day, self-hosted Llama on rented GPUs can beat hosted APIs by 3–5×.
- Customization. You want to fine-tune a model you fully control, including for retention/IP reasons.
- Latency floor. Groq serves Llama 4 70B at 500+ tokens/sec — faster than any closed provider can do today.
- Air-gapped deployments. Defense, healthcare, certain regulated finance. No cloud allowed; weights on disk is the only option.
How most teams adopt open weights
A three-step ramp that almost everyone follows:
- Start on a closed API. Ship the feature. Learn the prompt. Build the eval suite.
- Move one expensive sub-task to an open model via a managed provider. Classification, summarization, embedding — the high-volume / lower-stakes pieces. You get cost savings without owning GPUs yet.
- Self-host on vLLM only when (a) the bill from step 2 is large enough to justify a platform team, or (b) compliance forces you off shared infrastructure.
Most teams never get past step 2. That's fine.
Minimum integration
Open models served by a managed provider look identical to OpenAI — that's the whole point of the OpenAI-compatible API standard.
# Together AI — drop-in OpenAI client
from openai import OpenAI
client = OpenAI(
api_key=os.environ["TOGETHER_API_KEY"],
base_url="https://api.together.xyz/v1",
)
# Model IDs are provider-specific and change often — check the provider's catalog.
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4",
messages=[{"role": "user", "content": "Explain MoE in one sentence."}],
)
For local dev, Ollama is one command:
ollama run llama3.3:70b # downloads + serves on localhost:11434, OpenAI-compatible API
Pricing & cost notes (June 2026 ballpark)
| Path | Cost shape | Example for a ~70B-class open model |
|---|---|---|
| Managed (Together, Fireworks) | $0.40–$1.20 / Mtok blended | ~$0.88 / Mtok blended |
| Groq (fast) | $0.40–$1.00 / Mtok | ~$0.80 / Mtok, 400+ tok/s |
| Self-hosted (your GPUs) | $/GPU-hour ÷ throughput | A100 80GB ~$1.50/hr; ~$0.15/Mtok at saturation |
| Self-hosted (idle) | Same $/GPU-hour, no requests | $$$$ — idle GPUs are pure burn |
The trap: self-hosting is cheap only at saturation. A GPU sitting at 10% utilization is more expensive than a hosted API. Autoscaling on Modal or RunPod helps; even better is to combine open + closed providers so the cheap model only sees enough traffic to keep the GPU warm.
Pitfalls
- Picking open weights for "control" without measuring the API bill first. The closed provider's bill is almost always less than one platform engineer's salary.
- Skipping the eval suite when you swap. Llama 70B is not Sonnet. Behavior shifts on edge cases, formatting, tool-call compliance. Re-run your evals before promoting.
- Quantizing too aggressively. INT4 saves memory but tanks reasoning quality on hard tasks. Measure, don't assume.
- Self-hosting on a single GPU node. No autoscaling, no failover, no observability. The first incident teaches a lesson you didn't need to learn.
- Trusting MMLU scores. Open-model benchmark sheets are even more gamed than closed ones. The "this 7B beats GPT-4" claim almost never survives contact with your actual workload.
- Ignoring license clauses. Llama's Community License has a 700M-MAU cap. Cohere's Command R+ is non-commercial. Read the license before you build a product on it.
→ Next: Inference servers