Observability for AI Agents in Production

Applying observability to agentic systems

AI agents fail differently from traditional services. A request can return HTTP 200, finish within the latency budget, and still be wrong, unsafe, too expensive, or impossible to debug. Traditional APM covers infrastructure health. Agent observability also has to show decisions, tool calls, retrieval, loops, and cost.

This series shows how to apply observability to production agentic systems. It starts with traces, spans, sessions, cost, safety, and metadata-first telemetry. Then it applies those ideas in a LangGraph + OpenTelemetry + Langfuse stack that shows what the agent did, why it did it, how much it cost, which prompt and dataset versions were involved, and where it needs attention.

What the series covers

A trace-first view of decisions, tool chains, retrieval, cost, and safety events.
A metadata-over-content strategy that protects user data while preserving enough signal to debug production behavior.
A prioritized signal list for LLM calls, tools, retrieval, sessions, loops, guardrails, budgets, and handoffs.
A practical implementation path using LangGraph, OpenTelemetry semantic conventions, and Langfuse.
Langfuse sessions, users, prompt management, scores, evaluators, annotation queues, datasets, experiments, dashboards, alerts, and runbook patterns for operating agents day to day.

The applied chapters use Python and LangGraph. The same instrumentation approach can be adapted to other agent frameworks and OTLP-compatible backends. The aim is simple: make agent behavior auditable without turning your observability system into a second copy of user data.

Module 00 - Foundations

The production framing, the telemetry structure, and the signals worth tracking first.

Ch 0: Before You Start
Audience, prerequisites, reference stack, and terminology for the observability series.
Ch 1: Why Agent Observability Is Different
Why service health is insufficient for agent systems and which production questions telemetry must answer.
Ch 2: The Telemetry Data Model
A precise model for traces, spans, conversations, tasks, events, metrics, logs, and evaluation results.
Ch 3: Semantic Conventions and Schema Governance
How to adopt evolving OpenTelemetry GenAI conventions, name custom attributes, and migrate telemetry schemas safely.
Ch 4: Context Propagation Across Agent Workflows
Preserve causal context through async tasks, queues, durable execution, conversation turns, and multi-agent handoffs.
Ch 5: Signals for Agent Systems
A prioritized signal catalog for model calls, tools, retrieval, workflows, tasks, conversations, and fleets.

Module 01 - Safety, Privacy & Evaluation

What not to capture, how PII leaks into telemetry, and how evaluation becomes an observability signal.

Ch 6: Sampling, Cardinality, and Telemetry Cost
Control trace volume, metric cardinality, and storage cost without discarding the evidence needed for incidents.
Ch 7: Content Capture as a Data-Governance Decision
Decide when prompts, outputs, tool payloads, retrieval content, and memory may enter telemetry.
Ch 8: Privacy and PII Controls
Threat-model personal data in agent telemetry and implement layered collection, redaction, access, and retention controls.
Ch 9: Evaluation as an Observability Signal
Connect deterministic checks, human review, LLM judges, production traces, datasets, experiments, and release gates.

Module 02 - Security & Cost Controls

Guardrail telemetry, security signals, cost attribution, budget limits, and loop controls.

Ch 10: Security and Guardrail Telemetry
Observe prompt injection defenses, tool authorization, approvals, data-flow controls, policy decisions, and agent-specific threats.
Ch 11: Cost, Performance, and SLOs
Attribute provider usage, measure critical-path latency, define task budgets, and build outcome-aware SLOs.

Module 03 - Hands-on Implementation

LangGraph, OpenTelemetry, Langfuse, LLM spans, tool spans, retrieval spans, and runtime hardening.

Ch 12: Reference Architecture and Local Setup
Build the local Python, LangGraph, OpenAI, OpenTelemetry Collector, and Langfuse foundation used by the implementation chapters.
Ch 13: Building the OpenTelemetry Pipeline
Configure the Python SDK and OpenTelemetry Collector to export resource-labeled traces to Langfuse.
Ch 14: Instrumenting OpenAI Model Calls
Wrap OpenAI Responses API calls with OpenTelemetry spans that expose latency, token usage, retries, streaming behavior, and provider errors without copying prompt or response content.
Ch 15: Instrumenting Tools, Retrieval, and Memory
Trace tool execution, downstream dependencies, retrieval quality, and memory operations with explicit safety boundaries.
Ch 16: Instrumenting LangGraph and Multi-Agent Workflows
Trace graph nodes, branches, parallel work, checkpoints, durable resumes, and subagent handoffs.
Ch 17: Runtime Hardening and Feedback
Enforce budgets, detect loops, gate content capture, bind approvals, and connect user feedback to traces.

Module 04 - Langfuse Workflow

Sessions, users, prompt management, Playground, scores, evaluators, annotation, datasets, and experiments.

Ch 18: Langfuse Sessions, Users, and Trace Context
Use Langfuse sessions, users, metadata, tags, and versions to turn individual traces into conversation and product views.
Ch 19: Langfuse Prompt Management and Playground
Version prompts in Langfuse, deploy them with labels, test them in Playground, and correlate prompt versions with traces, cost, and scores.
Ch 20: Scores, Feedback, and Quality Signals in Langfuse
Create trace, observation, and session scores in Langfuse from user feedback, deterministic checks, and bounded quality signals.
Ch 21: Evaluators and Human Annotation Workflows
Use Langfuse evaluators, LLM-as-a-judge, and annotation queues to review traces and turn human judgment into calibrated scores.
Ch 22: Datasets, Experiments, and Release Evaluation
Create Langfuse datasets from approved cases, run experiments across prompt and model versions, and use scores as release evidence.

Module 05 - Operations

Telemetry tests, pipeline resilience, dashboards, alerts, release gates, and runbooks for operating AI agents day to day.

Ch 23: Testing and Operating the Telemetry Pipeline
Create regression tests for span structure, privacy invariants, Langfuse context, score contracts, Collector configuration, and telemetry completeness.
Ch 24: Dashboards, Alerts, Release Gates, and Runbooks
Operate the order-status agent with Langfuse views, score analytics, dataset gates, operational alerts, and incident runbooks tied to the demo built in the series.

Module 06 - Advanced Workflows

Subgraphs, subagents, handoff traces, streaming graph updates, and production-safe live workflow visibility.

Ch 25: Subgraphs, Subagents, and Handoff Traces
Instrument LangGraph subgraphs, synchronous subagents, and independently scheduled handoffs without losing trace causality or expanding agent authority silently.
Ch 26: Streaming Graph Updates Without Leaking State
Add a safe LangGraph streaming entrypoint to the demo, expose bounded progress updates, and record stream lifecycle metadata without exporting full graph state.

Appendix

Glossary and References
Definitions, standards, official documentation, and research for the Observability for AI Agents series.

License & Attribution
This series content is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) . You are free to share and adapt this material for any purpose, including commercial use, as long as you give appropriate credit.

Please cite as: "Observability for AI Agents in Production" by William Oliveira - woliveiras.com