Inference servers

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: If you self-host an open model, an inference server is what actually loads the weights, batches requests, and serves the tokens. vLLM is the production default in 2026.

In plain English

A model file (the "weights") is just numbers on disk — it can't answer requests by itself. An inference server is the program that loads those weights into GPU memory, accepts HTTP requests, runs the math, and streams tokens back. Picking the right one is mostly about throughput (how many concurrent requests can you handle per GPU) and operational maturity (does it crash at 3am).

The major options (2026)

Server	Best for	Languages	Concurrency model	Notes
vLLM	Production self-hosting	Python (server), any client	PagedAttention + continuous batching	The default. Broadest model support.
SGLang	Structured output, tool flows	Python	RadixAttention	Often beats vLLM on prefix-heavy workloads
TGI (HF)	Hugging Face ecosystem	Rust + Python	Continuous batching	Solid; less momentum than vLLM in 2026
TensorRT-LLM	NVIDIA-only, lowest latency	C++ / Python	Custom kernels per model	Fastest, hardest to operate
Ollama	Laptop / dev	Go	Simple queue	Not for prod. Perfect for local.
llama.cpp	CPU / Mac / edge	C++	Single-threaded per request	GGUF quantization, runs on a phone
MLX (Apple)	Apple Silicon dev	Python / Swift	Unified memory	Mac-only; fast on M-series
Provider-managed	Skip all of this	—	—	Together, Fireworks, Groq, Replicate, Modal

Default pick for most teams

Don't self-host. Use a managed inference provider — Together, Fireworks, or Groq — and you get vLLM-class performance without operating it. You pay per token, not per GPU-hour.

If you've decided to self-host, vLLM on Modal or RunPod is the path of least resistance. You write a 20-line Modal function, point it at a Llama or Mistral checkpoint, and get a scaling endpoint. For local development and laptop demos, Ollama.

When to deviate

Structured-output-heavy workload (lots of JSON schemas, tool calling, constrained generation): SGLang has better primitives for this than vLLM.
Sub-50ms first-token latency on a single model: TensorRT-LLM with a pre-compiled engine — but be ready to maintain it.
Edge / on-device inference: llama.cpp with a Q4_K_M quantized model, or MLX on Apple Silicon.
Hugging Face-native ops (Inference Endpoints, Spaces): TGI integrates more cleanly than vLLM there.
You need the absolute cheapest hosted endpoint and latency is fine: Groq for LPU speed, DeepInfra for cost.

Minimum integration

Local dev with Ollama:

# One line to a working OpenAI-compatible endpoint:
ollama run llama3.3:70b
# Now POST to http://localhost:11434/v1/chat/completions

Production self-host with vLLM on Modal:

import modal

app = modal.App("llm")
image = modal.Image.debian_slim().pip_install("vllm==0.6.5")

@app.function(image=image, gpu="A100-80GB:2", scaledown_window=300)
@modal.web_server(8000)
def serve():
    import subprocess
    subprocess.Popen([
        "vllm", "serve", "meta-llama/Llama-4-70B-Instruct",
        "--tensor-parallel-size", "2",
        "--max-model-len", "32768",
    ])

That's a production-grade autoscaling endpoint in about 15 lines. Modal cold-starts new replicas when traffic spikes and tears them down after 5 minutes idle.

What you're optimizing

Throughput — total tokens/sec across concurrent requests. Continuous batching is the single biggest win; PagedAttention (vLLM) and RadixAttention (SGLang) extend it further.
Latency — time to first token (TTFT) for the UX, time per output token (TPOT) for the streaming feel.
Cost per million tokens at your real traffic shape, not at saturation.
Memory efficiency — quantization (FP8, INT8, INT4) trades quality for headroom. Measure before promoting.

Pricing & cost notes

A practical rule from production deployments:

Hosted (Together / Fireworks): $0.60–$1.50 / Mtok blended for Llama-class 70B.
Self-hosted vLLM at saturation: roughly half that, if your GPUs stay busy.
Self-hosted vLLM at 20% utilization: more expensive than hosted, because you pay the full GPU-hour either way.

The breakeven against managed providers is roughly 200M tokens/day sustained. Below that, hosted is cheaper and saner. Above that, the spreadsheet starts to favor self-hosting — assuming you have someone to operate it.

Pitfalls

Treating Ollama as a production server. Ollama serializes requests and is missing the batching that makes vLLM economical. Demos and local dev only.
Running vLLM without --max-model-len. The default is the model's max (e.g. 128k), which reserves enormous KV-cache memory. Set it to what you actually use; you'll get 5–10× more concurrency.
Single-GPU self-host with no failover. One CUDA OOM and the endpoint dies. At minimum: two replicas behind a load balancer, autoscaling on Modal/RunPod, or a managed provider.
Skipping warmup. Cold-start on a 70B model is 30–90 seconds. A request hitting a cold replica times out. Either pre-warm replicas or set generous client timeouts.
Hand-rolling quantization without re-evaluating. FP16 → FP8 → INT4 each costs a few quality points on hard reasoning. Run your eval suite after every quantization change.
Forgetting that "OpenAI-compatible" is partial. Tool calling, structured output, vision, and image gen vary by server and model. Test the specific features you use.

🤔 Quick checkQuick check

→ Next: Embedding models

The major options (2026)​

Default pick for most teams​

When to deviate​

Minimum integration​

What you're optimizing​

Pricing & cost notes​

Pitfalls​