Datasets, Experiments, and Release Evaluation

Datasets are where reviewed failures stop being anecdotes. In Chapters 20 and 21, we created scores, human annotations, and an evaluator. This chapter turns one approved order-status case into a repeatable regression test.

The practical workflow is a local runner:

create a Langfuse dataset;
add one curated dataset item;
run the order-status agent against that item;
score the result;
inspect the experiment run in Langfuse.

Let’s get started.

What we will change

Work in the demo project:

cd agent-observability-demo

This chapter touches one file:

File	What to do
`src/agent_observability/dataset_scenarios.py`	Create the dataset `seed`, experiment `run`, and release `gate` runner.

Separate trace samples from datasets

A trace is evidence from a real execution. A dataset item is a curated test case with an owner, purpose, expected output, and review history.

A dataset item should be a minimal reproduction of a failure or regression case. It should not contain full production content.

Do not bulk-copy traces into datasets. Select cases deliberately, because each dataset item is a long-term artifact that will be used for regression testing and release evaluation.

Example of common dataset use cases:

Source	Dataset use
Negative feedback	Regression case for answer quality.
Escalated session	Conversation-resolution case.
Prompt-injection attempt	Safety case.
Retrieval miss	Grounding or recall case.
Tool authorization denial	Policy boundary case.

Every copied case needs the lineage and content approval from Chapters 7 and 8. In this demo, the dataset item is synthetic and metadata-only so we can focus on the experiment workflow without copying raw production content. We’ll create a dataset item with a minimal input and expected output, and then run the agent against it.

Create the dataset runner

Create src/agent_observability/dataset_scenarios.py in the demo project.

This file has three commands:

Command	What it does
`seed`	Creates the `order-status-regression` dataset and one dataset item.
`run`	Runs a Langfuse experiment against that dataset item.
`gate`	Runs the same local case, prints a JSON gate summary, and exits non-zero on failure.

Use this complete file:

import argparse
import json
from types import SimpleNamespace
from typing import Any
from uuid import uuid4

from langfuse import Evaluation, Langfuse

from .config import settings
from .graph import run_agent
from .telemetry import configure_tracing


DATASET_NAME = "order-status-regression"
DATASET_ITEM_ID = "order-status-delayed-eu-v1"
GATE_REQUIRED_SCORES = (
    "expected_outcome",
    "required_terms_present",
    "forbidden_terms_absent",
)
GATE_INPUT = {
    "query": "Where is my delayed EU order?",
    "region": "eu",
    "order_reference": "ORDER-924",
    "prompt_label": "production",
}
GATE_EXPECTED_OUTPUT = {
    "outcome": "answer",
    "must_mention": ["order"],
    "must_not_mention": ["refund issued"],
}

_langfuse: Langfuse | None = None


def get_langfuse() -> Langfuse:
    global _langfuse
    if _langfuse is None:
        _langfuse = Langfuse(
            public_key=settings.langfuse_public_key,
            secret_key=settings.langfuse_secret_key,
            base_url=settings.langfuse_base_url,
            environment=settings.deployment_environment,
        )
    return _langfuse


def ensure_dataset() -> None:
    langfuse = get_langfuse()
    datasets = langfuse.api.datasets.list(limit=100).data
    if not any(dataset.name == DATASET_NAME for dataset in datasets):
        langfuse.create_dataset(
            name=DATASET_NAME,
            description="Regression cases for the order-status support agent.",
            metadata={
                "workflow": "order-status",
                "rubric": "order-status-regression-v1",
                "environment": settings.deployment_environment,
            },
        )


def ensure_dataset_item() -> None:
    langfuse = get_langfuse()
    items = langfuse.api.dataset_items.list(
        dataset_name=DATASET_NAME,
        limit=100,
    ).data
    if any(item.id == DATASET_ITEM_ID for item in items):
        return

    langfuse.create_dataset_item(
        dataset_name=DATASET_NAME,
        id=DATASET_ITEM_ID,
        input=GATE_INPUT,
        expected_output=GATE_EXPECTED_OUTPUT,
        metadata={
            "source": "chapter-22-demo",
            "source_trace_policy": "metadata-only",
            "rubric": "order-status-regression-v1",
            "risk": "medium",
        },
    )


def seed_dataset() -> None:
    ensure_dataset()
    ensure_dataset_item()
    print(f"dataset ready: {DATASET_NAME}")
    print(f"dataset item ready: {DATASET_ITEM_ID}")


def run_order_status_case(*, item: Any, **kwargs: Any) -> dict[str, str]:
    input_data = item.input
    conversation_id = f"eval_{uuid4().hex}"

    result = run_agent(
        {
            "query": input_data["query"],
            "conversation_id": conversation_id,
            "order_reference": input_data["order_reference"],
            "region": input_data["region"],
            "prompt_label": input_data.get("prompt_label", "production"),
        },
    )

    return {
        "answer": result.get("answer", ""),
        "outcome": result.get("outcome", "unknown"),
        "prompt_label": result.get("prompt_label", "unknown"),
    }


def expected_outcome(
    *,
    output: dict[str, str],
    expected_output: dict[str, Any],
    **kwargs: Any,
) -> Evaluation:
    passed = output["outcome"] == expected_output["outcome"]
    return Evaluation(
        name="expected_outcome",
        value=1.0 if passed else 0.0,
        data_type="NUMERIC",
        comment="rubric=order-status-regression-v1",
    )


def required_terms_present(
    *,
    output: dict[str, str],
    expected_output: dict[str, Any],
    **kwargs: Any,
) -> Evaluation:
    answer = output["answer"].lower()
    missing_terms = [
        term
        for term in expected_output.get("must_mention", [])
        if term.lower() not in answer
    ]
    return Evaluation(
        name="required_terms_present",
        value=1.0 if not missing_terms else 0.0,
        data_type="NUMERIC",
        comment=(
            "rubric=order-status-regression-v1"
            if not missing_terms
            else f"missing_terms={','.join(missing_terms)}"
        ),
    )


def forbidden_terms_absent(
    *,
    output: dict[str, str],
    expected_output: dict[str, Any],
    **kwargs: Any,
) -> Evaluation:
    answer = output["answer"].lower()
    forbidden_terms = [
        term
        for term in expected_output.get("must_not_mention", [])
        if term.lower() in answer
    ]
    return Evaluation(
        name="forbidden_terms_absent",
        value=1.0 if not forbidden_terms else 0.0,
        data_type="NUMERIC",
        comment=(
            "rubric=order-status-regression-v1"
            if not forbidden_terms
            else f"forbidden_terms={','.join(forbidden_terms)}"
        ),
    )


def run_experiment() -> None:
    provider = configure_tracing()
    try:
        seed_dataset()
        dataset = get_langfuse().get_dataset(DATASET_NAME)
        result = dataset.run_experiment(
            name="order-status-production-baseline",
            description="Local chapter 22 regression run for the order-status agent.",
            task=run_order_status_case,
            evaluators=[
                expected_outcome,
                required_terms_present,
                forbidden_terms_absent,
            ],
            max_concurrency=1,
            metadata={
                "prompt_label": "production",
                "workflow": "order-status",
                "environment": settings.deployment_environment,
                "agent_version": settings.agent_version,
            },
        )
        print(result.format())
        print(f"dataset run url: {result.dataset_run_url}")
    finally:
        provider.force_flush(timeout_millis=5000)
        provider.shutdown()


def evaluate_release_gate() -> None:
    provider = configure_tracing()
    try:
        output = run_order_status_case(item=SimpleNamespace(input=GATE_INPUT))
        evaluations = [
            expected_outcome(
                output=output,
                expected_output=GATE_EXPECTED_OUTPUT,
            ),
            required_terms_present(
                output=output,
                expected_output=GATE_EXPECTED_OUTPUT,
            ),
            forbidden_terms_absent(
                output=output,
                expected_output=GATE_EXPECTED_OUTPUT,
            ),
        ]
        scores = {evaluation.name: float(evaluation.value) for evaluation in evaluations}
        passed = all(scores.get(name) == 1.0 for name in GATE_REQUIRED_SCORES)
        print(
            json.dumps(
                {
                    "release_gate": "order-status-regression-v1",
                    "dataset": DATASET_NAME,
                    "dataset_item_id": DATASET_ITEM_ID,
                    "passed": passed,
                    "required_scores": list(GATE_REQUIRED_SCORES),
                    "scores": scores,
                    "output": {
                        "outcome": output["outcome"],
                        "prompt_label": output["prompt_label"],
                    },
                },
                indent=2,
                sort_keys=True,
            )
        )
        if not passed:
            raise SystemExit(1)
    finally:
        provider.force_flush(timeout_millis=5000)
        provider.shutdown()


def main() -> None:
    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers(dest="command", required=True)
    subparsers.add_parser("seed")
    subparsers.add_parser("run")
    subparsers.add_parser("gate")

    args = parser.parse_args()
    if args.command == "seed":
        seed_dataset()
    elif args.command == "run":
        run_experiment()
    elif args.command == "gate":
        evaluate_release_gate()


if __name__ == "__main__":
    main()

The seed command is idempotent. Re-running it should not create duplicate items because the item ID is stable.

Seed the dataset

Run this from the demo project root:

PYTHONPATH=src python -m agent_observability.dataset_scenarios seed

Expected output:

dataset ready: order-status-regression
dataset item ready: order-status-delayed-eu-v1

Then open Langfuse and go to Evaluation -> Datasets. You should see:

Field	Value
Dataset	`order-status-regression`
Item ID	`order-status-delayed-eu-v1`
Input query	`Where is my delayed EU order?`
Expected outcome	`answer`

Run an experiment against the agent

Now run the experiment:

PYTHONPATH=src python -m agent_observability.dataset_scenarios run

This calls the configured model through run_agent(), so exact wording, latency, usage, and provider metadata can vary between runs. Treat the deterministic evaluator scores and dataset run record as the verification target, not a byte-for-byte answer.

The command does three things:

loads the order-status-regression dataset;
calls run_agent() once per dataset item;
records evaluator scores for the experiment run.

Expected terminal output includes a summary from Langfuse and a dataset run URL:

dataset ready: order-status-regression
dataset item ready: order-status-delayed-eu-v1
...
dataset run url: http://localhost:3000/project/...

Go to Evaluation -> Datasets -> order-status-regression -> Runs. Inspect the run named order-status-production-baseline.

The experiment runner records the output and scores so a failed case can be inspected, not just counted.

Understand the evaluators

The first run uses three deterministic evaluators:

Evaluator	What it checks
`expected_outcome`	The agent returned the expected route, such as `answer`.
`required_terms_present`	The answer includes simple required terms.
`forbidden_terms_absent`	The answer avoids terms that should not appear.

These are intentionally simple. They are not a replacement for human annotation or LLM-as-a-Judge. They are a release guardrail that catches obvious regressions before traffic sees a candidate prompt or workflow.

Use deterministic checks for exact requirements. Use human annotation and calibrated judges for nuanced quality.

Compare prompt, model, and workflow changes

We have some baseline scores for the production prompt label. Now we can run a second experiment with a candidate prompt label, model, or workflow code.

Change one major variable per experiment when possible:

Candidate	Keep stable
Prompt label	Model, retrieval data, workflow code, evaluator version.
Model alias	Prompt label, retrieval data, workflow code, evaluator version.
Retrieval data	Prompt label, model, workflow code, evaluator version.
Workflow code	Prompt label, model, retrieval data, evaluator version.

If several variables change together, the experiment can still protect the release, but it cannot explain which change caused a regression.

In this demo, the easiest comparison is prompt label. Add a second dataset run later with metadata like:

metadata={
    "prompt_label": "staging",
    "workflow": "order-status",
    "environment": settings.deployment_environment,
    "agent_version": settings.agent_version,
}

Do not call it a prompt experiment unless only the prompt label changed.

Use versioned datasets for reproducibility

Datasets change as the product changes. New failures become regression cases, weak expectations get corrected, and old cases sometimes stop representing the behavior you care about. That is fine, but it means an experiment result only makes sense together with the dataset state used to produce it.

For a release decision, record the exact dataset run URL or dataset version next to the candidate you evaluated. Otherwise, you can accidentally compare a staging prompt from today against a production baseline that was scored before five new failure cases were added.

In the local demo, keep the URL printed by dataset_scenarios run and the timestamp shown in Langfuse. In production, store that run URL or dataset version in the release ticket, pull request, or deployment record.

Gate releases with explicit thresholds

The dataset run is useful only if it changes a release decision. After the candidate run finishes, compare it with the production baseline and write down the exact rule you are applying. The local demo includes a small executable gate for the current case; in production, the same rule should run against the full candidate dataset in CI, a deployment checklist, or a release ticket.

A practical gate for the experiment from this chapter could look like this:

release_gate = "order-status-regression-v1"
dataset = "order-status-regression"
baseline_run = "order-status-production-baseline"
candidate_run = "order-status-staging-candidate"
dataset_run_url = "<paste the Langfuse dataset run URL here>"

pass_if:
  expected_outcome == 1.00
  required_terms_present == 1.00
  forbidden_terms_absent == 1.00
  total_cost_per_case <= baseline * 1.10

fail_if:
  any required evaluator is missing
  any safety or policy score is below threshold
  candidate run used a different dataset version than baseline

Use the Dataset Run page in Langfuse as the source for the average evaluator scores. Use the run metadata and cost columns to confirm that the baseline and candidate were executed against the same dataset and comparable model configuration.

For a CI-compatible local check, run:

PYTHONPATH=src python -m agent_observability.dataset_scenarios gate

Expected output is JSON that can be archived by CI or pasted into a release note:

{
  "dataset": "order-status-regression",
  "dataset_item_id": "order-status-delayed-eu-v1",
  "output": {
    "outcome": "answer",
    "prompt_label": "production"
  },
  "passed": true,
  "release_gate": "order-status-regression-v1",
  "required_scores": [
    "expected_outcome",
    "required_terms_present",
    "forbidden_terms_absent"
  ],
  "scores": {
    "expected_outcome": 1.0,
    "forbidden_terms_absent": 1.0,
    "required_terms_present": 1.0
  }
}

If any required score is below 1.0, the command exits with status 1. Keep the Langfuse experiment run as richer release evidence; use the gate command when you need an automated pass/fail signal.

Result	Release decision
All required scores pass	Promote the candidate
`expected_outcome` fails	Block the release
`required_terms_present` drops	Inspect item-level outputs before deciding
`forbidden_terms_absent` fails	Block the release and add the reviewed failure as a regression case
Cost rises above the threshold	Require an explicit product or engineering approval

The threshold is a product and risk decision. I would keep the local demo strict because the dataset has only one item. With a larger dataset, use thresholds that match the user impact of the workflow. A documentation assistant and a refund workflow should not share the same gate.

Feed reviewed failures back into datasets

Reviewed failures should become dataset items. Otherwise the same bug can be fixed once, forgotten, and reintroduced by the next prompt, model, retrieval, or workflow change.

In Langfuse, start from the failed trace or session. Confirm the failure with a human annotation, a trusted evaluator score, or both. Then add it to order-status-regression from the trace view, or create the dataset item manually in Evaluation -> Datasets -> order-status-regression.

For this demo, do not store the raw trace as the dataset input unless it already matches the shape used by src/agent_observability/dataset_scenarios.py. The experiment runner reads these fields directly:

{
  "query": "Where is my delayed EU order?",
  "region": "eu",
  "order_reference": "ORDER-924",
  "prompt_label": "production"
}

The expected output should describe the behavior you want to protect, not the exact sentence the model happened to produce:

{
  "outcome": "answer",
  "must_mention": ["order"],
  "must_not_mention": ["refund issued"]
}

Add lineage metadata so the future reader knows why the case exists:

{
  "source": "reviewed-failure",
  "source_trace_id": "<trace id>",
  "source_session_id": "<session id>",
  "failure_type": "missing_order_status_context",
  "rubric": "order-status-regression-v1"
}

After adding the item, run the experiment again:

PYTHONPATH=src python -m agent_observability.dataset_scenarios run

If the current production version passes the new case, the item is probably too weak or the failure was not reproduced. If production fails and the candidate passes, the dataset is doing its job.

This loop keeps the dataset tied to real failures instead of synthetic examples that slowly drift away from production.

For this chapter, one synthetic dataset item is enough. The important move is that the demo now has a repeatable path from “reviewed behavior” to “release check”, and a clear format for turning future reviewed failures into regression cases.

What should exist before we go to Chapter 23

At this point the demo should have:

src/agent_observability/dataset_scenarios.py;
one order-status-regression dataset in Langfuse;
one dataset item named order-status-delayed-eu-v1;
one experiment run named order-status-production-baseline;
item-level evaluator scores attached to the experiment run;
a gate command that emits JSON and fails closed on required-score regressions;
a rule for promoting reviewed failures into regression cases.

Chapter 23 tests the telemetry, Langfuse attributes, scores, prompt versions, and dataset workflows so the operating loop does not break during refactors.

References

Next up: Ch 23 - Testing and Operating the Telemetry Pipeline verifies span structure, privacy invariants, Langfuse attributes, prompt versions, scores, and experiments.