Skip to main content

Observability tools

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: What Datadog and Sentry are for traditional services, these are for LLM calls — per-request prompt + response + tokens + cost + latency, plus traces across multi-step agent flows.

In plain English

LLM observability is just logging — but with three twists. First, the payloads are huge (a 32k-token prompt is not a one-line log). Second, the cost matters per-request, so you want $$ on every span. Third, multi-step agent runs need a trace tree, not a flat log. Without observability, you can't answer "why did this user get a bad answer?" because you don't have the prompt that produced it. With it, you have a one-click replay button into the exact failure.

The major options (2026)

ToolTypeOSS?Strengths$ shape
LangfuseSelf / hostedyes (MIT)OSS default; combines obs + evals + prompt mgmtFree OSS / Cloud free + paid tiers
HeliconeProxy + asyncpartial OSSEasiest setup (one-line proxy); cachingFree up to 100k req / paid
LangSmithHostednoLangChain-native; rich trace UIFree + $39/seat
Arize PhoenixOSS + Arize cloudyesBroad ML + LLM; OpenTelemetry-nativeFree OSS / cloud paid
BraintrustHostedpartialEval-first; observability includedHobby free / usage
Honeycomb / Datadog / GrafanaExisting APM + OTelvariesPipes LLM spans into existing platformExisting APM bill
OpenLLMetry (Traceloop)OSS SDKyesOpenTelemetry semantics for LLMsFree SDK + cloud option
W&B WeaveHostedpartialStrong on experiment trackingFree + paid

Default pick for most teams

Langfuse, self-hosted or cloud. Open-source, generous free tier, captures everything (prompts, tools, responses, costs, latency), and the trace UI is good enough for production debugging.

If you want the absolute fastest setup ("change one line, get observability"), Helicone in proxy mode is hard to beat — point your OpenAI/Anthropic SDK at Helicone's base URL and you're done.

If you're already in a serious APM platform (Datadog, Honeycomb, Grafana), use OpenLLMetry or the platform's native LLM integration and skip the dedicated tool.

When to deviate

  • You want obs + evals + prompt management in one tool: Langfuse or Braintrust — they bundle.
  • You want OTel-native everything so LLM spans land in the same trace as your HTTP spans: OpenLLMetry + your existing APM, or Arize Phoenix.
  • You already use LangChain: LangSmith is the integrated pick.
  • You need caching at the proxy (identical prompt → cached response): Helicone does this and observability in one.
  • You need PII redaction at the observability layer: Helicone and Langfuse both have it; enterprise tools (Patronus, Lakera) extend it.
  • Highly regulated, on-prem: self-hosted Langfuse or Arize Phoenix.

Minimum integration

Langfuse — wrap the SDK:

from langfuse.openai import openai # drop-in: same API, traced

response = openai.chat.completions.create(
model="gpt-5.1",
messages=[{"role": "user", "content": "Hello"}],
)
# Trace appears in Langfuse with prompt, response, tokens, cost, latency.

Helicone — change base URL only:

from openai import OpenAI

client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}"},
)
# Every call now logged to Helicone, nothing else changes.

OpenLLMetry — OpenTelemetry semantics:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my-app")
# Now any anthropic / openai / langchain call emits OTel spans to your APM.

What good observability captures

  • Full prompt (system + messages + tool definitions) per call.
  • Full response (content + tool calls + finish reason).
  • Tokens in / out / cached, $ cost, model, temperature, top_p.
  • Latency: TTFT (time to first token) and total.
  • Trace ID grouping every call in a conversation / agent run / pipeline.
  • User feedback (thumbs, regenerates, copies) joined back to the trace.
  • Per-user / per-tenant / per-feature dimensions for cost slicing.
  • Errors, retries, fallback hops.

What to actually do with it

  • Cost dashboards. By feature, by tenant, by model. Catch surprises early.
  • Quality regressions. When LLM-as-judge scores drop, drill straight to the failing traces.
  • Replay. Re-run a logged call with a new prompt — fastest debugging loop in AI.
  • Eval seeding. Promote interesting prod traces to eval cases in one click.
  • User-feedback joins. Filter for traces where the user clicked thumbs-down. That's your bug list.
  • Alerts. P99 latency > 8s, daily spend > threshold, error rate spike.

Pricing & cost notes

ToolFree tierPaid starts at
Langfuse Cloud50k events/mo$29/mo Team
Langfuse Self-hostunlimitedyour hosting cost
Helicone100k req/mo$25/mo
LangSmith5k traces$39/seat
Arize Phoenix OSSunlimitedfree
BraintrustHobbyusage
Datadog LLM Obsexisting APMincluded in APM tier

Observability is one of the cheapest investments in the AI stack — measured in tens or low hundreds per month for most teams — and it pays back the first time you debug a real production issue.

Pitfalls

  • No observability at all. "I'll add it when we ship" → six months in, you're debugging by re-running the user's question by hand.
  • Logging the prompt at app-level INFO. PII and secrets land in your log aggregator. Use the dedicated tool with proper redaction.
  • Sampling on the floor. "We sample 1% of requests" → the one bad trace the user is asking about is exactly the one you dropped.
  • Trace IDs that don't propagate across services. An agent that spans three microservices needs one root trace ID. Pass it through.
  • No cost on the span. Token counts without dollars are not actionable. Calculate cost at log time using current provider pricing.
  • No user-feedback signal. Without thumbs / regenerates joined to traces, you're guessing at "is this a bug?"
  • Treating observability and evals as the same tool. Some platforms bundle them; that's fine, but conceptually they answer different questions ("what happened in prod?" vs "did this change make things better?"). Don't confuse the two.
🤔 Quick checkQuick check

→ Next: AI gateways