Observability for AI Agents in Production

Applying observability to agentic systems

AI agents fail differently from traditional services. A request can return HTTP 200, finish within the latency budget, and still be wrong, unsafe, too expensive, or impossible to debug. Traditional APM covers infrastructure health. Agent observability also has to show decisions, tool calls, retrieval, loops, and cost.

This series shows how to apply observability to production agentic systems. It starts with traces, spans, sessions, cost, safety, and metadata-first telemetry. Then it applies those ideas in a LangGraph + OpenTelemetry + Langfuse stack that shows what the agent did, why it did it, how much it cost, which prompt and dataset versions were involved, and where it needs attention.

What the series covers

  • A trace-first view of decisions, tool chains, retrieval, cost, and safety events.
  • A metadata-over-content strategy that protects user data while preserving enough signal to debug production behavior.
  • A prioritized signal list for LLM calls, tools, retrieval, sessions, loops, guardrails, budgets, and handoffs.
  • A practical implementation path using LangGraph, OpenTelemetry semantic conventions, and Langfuse.
  • Langfuse sessions, users, prompt management, scores, evaluators, annotation queues, datasets, experiments, dashboards, alerts, and runbook patterns for operating agents day to day.

The applied chapters use Python and LangGraph. The same instrumentation approach can be adapted to other agent frameworks and OTLP-compatible backends. The aim is simple: make agent behavior auditable without turning your observability system into a second copy of user data.


Module 00 - Foundations

The production framing, the telemetry structure, and the signals worth tracking first.

Module 01 - Safety, Privacy & Evaluation

What not to capture, how PII leaks into telemetry, and how evaluation becomes an observability signal.

Module 02 - Security & Cost Controls

Guardrail telemetry, security signals, cost attribution, budget limits, and loop controls.

Module 03 - Hands-on Implementation

LangGraph, OpenTelemetry, Langfuse, LLM spans, tool spans, retrieval spans, and runtime hardening.

Module 04 - Langfuse Workflow

Sessions, users, prompt management, Playground, scores, evaluators, annotation, datasets, and experiments.

Module 05 - Operations

Telemetry tests, pipeline resilience, dashboards, alerts, release gates, and runbooks for operating AI agents day to day.

Module 06 - Advanced Workflows

Subgraphs, subagents, handoff traces, streaming graph updates, and production-safe live workflow visibility.

Appendix

  • Glossary and References
    Definitions, standards, official documentation, and research for the Observability for AI Agents series.

License & Attribution
This series content is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) . You are free to share and adapt this material for any purpose, including commercial use, as long as you give appropriate credit.

Please cite as: "Observability for AI Agents in Production" by William Oliveira - woliveiras.com