Before You Start
This series is for software engineers who build or operate AI agents. It assumes that a model already participates in a workflow that can call tools, retrieve data, maintain state, or hand work to another component. The goal is to make that workflow explainable in production without copying every prompt and response into a telemetry backend.
The conceptual chapters apply to most agent frameworks and model providers. The implementation chapters use one concrete stack so every command and code sample can be followed end to end.
What you need to know
The series assumes working knowledge of:
- Python applications and virtual environments.
- HTTP APIs and JSON payloads.
- Structured logs, metrics, and distributed traces.
- LLM inference, tool calling, retrieval-augmented generation, and conversation state.
OpenTelemetry terminology is introduced as it is used. Prior experience with an OpenTelemetry SDK or Collector is helpful but not required.
Reference stack
| Layer | Choice in this series | Why it is here |
|---|---|---|
| Runtime | Python 3.11 or newer | Current async context propagation and typing support. |
| Agent orchestration | LangGraph | Explicit nodes, state, branches, and durable execution. |
| Model provider | OpenAI Responses API | Typed response items, tool calls, streaming events, and usage data. |
| Telemetry | OpenTelemetry | Vendor-neutral APIs, context propagation, OTLP, and semantic conventions. |
| Collector | OpenTelemetry Collector | Central processing, filtering, batching, and export. |
| Backend | Langfuse | Trace inspection and evaluation features for LLM applications. |
| Local services | Docker Compose | Reproducible development environment. |
OpenTelemetry is the instrumentation contract in this series. The application emits spans, attributes, events, context, and OTLP data through OpenTelemetry APIs and semantic conventions. Langfuse is the backend we use to inspect that evidence, connect it to prompts and evaluations, and operate the agent locally.
This separation is intentional. If application code writes only to one observability product, the telemetry model usually becomes a product integration. By using OpenTelemetry first, the core evidence stays portable: workflow spans, model calls, tool execution, retrieval, guardrails, costs, and outcomes remain understandable even if the backend changes or another backend is added later. In short: OpenTelemetry defines how the agent describes what happened; Langfuse helps us investigate and operate what happened.
OpenAI is the provider used by the runnable example, not a requirement of the observability model. Provider-specific fields stay at the integration boundary. Application spans and custom attributes remain provider-neutral where a stable convention exists.
The local Langfuse and Docker Compose setup exists so the examples can be run without creating an external account or paying for an observability backend. Treat it as a learning environment. Running an observability backend in production is a separate platform decision that needs availability, backup, upgrade, security, retention, and access-control planning.
Required tools
Confirm the local toolchain before the implementation module:
python --version
docker --version
docker compose version
You also need an OpenAI API key with access to the model configured for the example. Keep the key in a local environment file or secret manager. Never place it in source control, span attributes, logs, exception messages, or screenshots.
Three labels used throughout the series
Technical guidance becomes misleading when standards, recommendations, and local decisions are presented as equivalent. Each chapter uses these meanings:
| Label | Meaning |
|---|---|
| Standard | A requirement or recommendation defined by a referenced specification. |
| Recommended baseline | A defensible starting point that must be adapted to the system’s risk and workload. |
| Example decision | A choice made for the reference implementation, not a universal rule. |
OpenTelemetry GenAI semantic conventions are still evolving. Attribute names, requirement levels, and stability are checked against the version linked in each chapter. Pin instrumentation versions and treat a semantic-convention upgrade as a schema migration.
What the series produces
By the end, the reference agent has:
- One end-to-end trace for each task execution.
- Spans for model calls, tool execution, retrieval, workflow nodes, and guardrails.
- Correlation across conversation turns and asynchronous boundaries.
- Metrics for latency, token usage, cost, errors, quality, and safety controls.
- Content capture disabled by default, with explicit redaction and access boundaries when enabled.
- Evaluations linked to production traces and release datasets.
- Dashboards, SLOs, alerts, release gates, and incident runbooks.
- Tests that fail when instrumentation loses context or emits unsafe attributes.
Next up: Ch 1 - Why Agent Observability Is Different explains why a healthy endpoint can still represent a failed agent task.