From adoption to evidence
You’ve configured instructions, created custom agents, set up MCP connections, added hooks, enforced security, and the coding agent is producing pull requests. The question every engineering leader will ask next is: is it working?
“Working” doesn’t mean “the agent runs.” It means your team ships faster, the code is better, and developers are happier. Proving that requires measurement, and the right measurements depend on what you’re optimizing for.
This chapter covers the metrics that matter, how to collect them, how to analyze agent-specific performance, and how spec-driven development turns the PDRC mental model from a way of thinking into a measurable process. By the end, you’ll have a measurement system that justifies your team’s investment in AI-assisted development and surfaces what to improve next.
What to measure (and what not to)
The trap: measuring output volume
The instinct when adopting a productivity tool is to measure output. More lines of code. More pull requests. More commits. This instinct is wrong for AI-assisted development.
Research from GitHub’s survey of 500 enterprise developers found that only one-third of developers think code volume is a good performance metric — and they’re right. An agent can generate hundreds of lines that a human would have written in twenty. More code doesn’t mean better code, and in many cases it means worse code.
The metrics that actually tell you whether AI-assisted development is working fall into four categories:
| Category | What it answers | Example metrics |
|---|---|---|
| Velocity | Are we shipping faster? | Cycle time, PR lead time, throughput (PRs merged per sprint) |
| Quality | Is the code better? | Build success rate, test coverage, bug escape rate, PR merge rate |
| Developer experience | Are developers happier and more focused? | Satisfaction surveys, flow state frequency, cognitive load reduction |
| Agent effectiveness | Is the agent doing its job well? | Task completion rate, review rounds before merge, intervention frequency |
Let’s define each one precisely.
Velocity metrics
Cycle time
What it is: The elapsed time from when work starts (first commit, or issue assignment) to when the change is deployed to production.
Why it matters for AI adoption: Cycle time captures the entire pipeline, not just coding speed. If the agent writes code in minutes but the PR sits in review for three days, your cycle time hasn’t improved.
How to measure it: Most engineering analytics tools (LinearB, Jellyfish, Pluralsight Flow, GitHub’s own Copilot Metrics API) calculate this from your Git and CI data.
What to watch for: A drop in coding time paired with stable or increasing review time suggests a bottleneck shift — the agent is producing more work than your review process can absorb. The fix is usually more automation (Ch 10’s AI-assisted code review) or better scoping of tasks.
PR lead time
What it is: The time from when a pull request is opened to when it’s merged.
Why it matters: This is the metric most directly affected by AI-assisted development. The agent opens PRs faster, but do they merge faster?
How to measure it: GitHub’s REST API provides PR creation and merge timestamps. Calculate the difference.
Healthy pattern: Agent-created PRs with well-written specs should have shorter lead times than human-created PRs, because:
- The code is consistent (it follows custom instructions).
- Tests are included (if your instructions require them).
- The PR description is detailed (the agent generates comprehensive session logs).
Unhealthy pattern: Agent-created PRs that have longer lead times than human PRs usually indicate poor scoping. The agent was given a task too broad or too ambiguous, producing a PR that requires extensive rework.
Throughput
What it is: The number of pull requests merged per unit of time (per sprint, per week, per developer).
Why it matters: Throughput is where the volume conversation becomes useful, not “how many lines” but “how many complete, reviewed, tested changes reached production.”
Reference point: In a randomized controlled trial with Accenture, developers using Copilot saw an 8.69% increase in pull requests and a 15% increase in PR merge rate, meaning not only were more PRs created, but more of them passed review.
Quality metrics
Build success rate
What it is: The percentage of CI builds that pass on the first attempt.
Why it matters: Agent-generated code that breaks the build wastes reviewer time and CI resources. A rising build success rate means the agent is producing code that works.
Reference point: In the Accenture study, developers using Copilot saw an 84% increase in successful builds, suggesting that AI-assisted code was not sacrificing quality for speed.
How to use it: Track build success rate separately for agent-created PRs vs. human-created PRs. If the agent’s rate is significantly lower, the problem is usually in custom instructions (missing build commands, wrong test runners) or setup steps (dependencies not available).
Test coverage delta
What it is: The change in test coverage percentage before and after adopting AI-assisted development.
Why it matters: If your custom instructions say “write tests for all new code,” coverage should increase, or at minimum, stay stable while throughput rises. A coverage drop means the agent is shipping untested code.
How to measure it: Your CI pipeline already reports coverage. Compare monthly averages before and after adoption.
Bug escape rate
What it is: The number of bugs found in production per unit of delivered work (per PR, per story point, per sprint).
Why it matters: This is the ultimate quality signal. Are bugs increasing because the agent produces subtly wrong code? Or decreasing because the agent generates more comprehensive tests?
How to measure it: Tag production incidents with the PR that introduced them. Calculate the ratio of incidents to merged PRs, broken down by agent-created vs. human-created.
PR merge rate
What it is: The percentage of opened PRs that actually get merged (as opposed to closed without merging).
Why it matters: A low merge rate for agent-created PRs means the agent is producing work that doesn’t meet standards. A high merge rate means the delegation is working.
Developer experience metrics
Why developer experience matters
Productivity metrics tell you about the system. Developer experience metrics tell you about the people in the system. A team that ships 30% more PRs but is burned out and frustrated is not succeeding.
Satisfaction surveys
The simplest measurement: ask your developers directly. Run a brief survey (3-5 questions, monthly or quarterly):
- On a scale of 1-5, how satisfied are you with your current development workflow?
- How often do you feel you can focus on interesting problems (vs. repetitive tasks)? (Never / Rarely / Sometimes / Often / Always)
- How has AI-assisted development affected your workday? (Much worse / Somewhat worse / No change / Somewhat better / Much better)
- What’s the biggest friction point in your current workflow? (Free text)
- What task do you wish the agent handled better? (Free text)
Reference point: In the Accenture study, 95% of developers reported enjoying coding more with Copilot, and 90% felt more fulfilled in their jobs. If your numbers are significantly lower, investigate what’s different about your setup.
Flow state and cognitive load
Flow state, the experience of deep, uninterrupted focus, is one of the strongest predictors of developer satisfaction and output quality.
GitHub’s research found that developers using Copilot reported maintaining flow state more consistently and experienced reduced cognitive load on repetitive tasks. Specifically, 70% of developers reported significantly less mental effort on repetitive tasks, and 54% spent less time searching for information or examples.
You can measure this indirectly:
- Context switches per day: Track how often developers switch between tasks, tools, or conversations. Fewer switches suggest better flow.
- Uninterrupted blocks: Measure the average length of uninterrupted coding sessions (time between commits without meetings or other activity).
- Wait time: Time spent waiting on builds, tests, or reviews. A reduction here directly enables longer flow states.
The upskilling effect
One of the less obvious benefits of AI-assisted development: it helps developers learn. In GitHub’s survey, 57% of developers said AI coding tools help them improve their coding language skills, the top reported benefit, even above productivity gains.
Track how many developers are working in languages or frameworks they weren’t previously comfortable with. If the agent handles the boilerplate, developers can focus on understanding the architecture and business logic.
Agent effectiveness metrics
These metrics evaluate how well your specific agent configuration is performing. Not just “is AI helping” but “are our agents configured well.”
Task completion rate
What it is: The percentage of issues assigned to the coding agent that result in a merged PR without manual intervention.
How to measure it: Track issues labeled with the agent assignment. Count how many result in:
- A merged PR (success)
- A PR that required human commits before merging (partial success)
- A PR that was closed without merging (failure)
- No PR created (failure)
Target: Start by tracking your baseline. Most teams see 40-60% full completion on well-scoped tasks, rising as custom instructions improve.
Review rounds before merge
What it is: The number of review cycles (review → request changes → update → re-review) before an agent-created PR is approved.
Why it matters: More rounds mean the agent isn’t producing merge-ready code on the first attempt. This metric directly reflects the quality of your custom instructions, agent profiles, and task scoping.
Target: Aim for agent PRs needing the same number of review rounds as human PRs — or fewer.
Intervention frequency
What it is: How often a human needs to step in during agent execution — pushing commits to the branch, commenting corrections, or taking over the task entirely.
Why it matters: High intervention frequency means you’re not saving time; you’re spending time differently (supervising instead of coding). The goal is autonomous completion with human review only at the PR stage.
Agent-specific breakdowns
If you have multiple custom agents (Ch 8), compare their performance.
Example breakdown:
| Agent | Tasks assigned | Completion rate | Avg review rounds | Avg PR lead time |
|---|---|---|---|---|
implementer | 45 | 62% | 1.8 | 4.2h |
test-writer | 30 | 78% | 1.2 | 2.1h |
docs-updater | 20 | 85% | 1.0 | 1.5h |
security-reviewer | 15 | 70% | 1.5 | 3.0h |
This breakdown tells you which agents need better instructions, which tasks are well-suited for agents, and where to invest in improvement.
Where to find the data
GitHub’s built-in tools
Copilot activity report. Available for organization owners under Settings > Copilot > Access. Shows per-user activity data including last activity timestamp, surface used (IDE, GitHub.com, CLI), and feature used (inline suggestions, chat, coding agent, code review, PR summaries). Data refreshes every 30 minutes with a 90-day retention window.
Copilot Metrics API. REST API endpoints that provide programmatic access to the same data. Useful for building dashboards or feeding into your existing analytics pipeline.
GitHub API for PR data. The Pull Requests API gives you creation time, merge time, review events, commit count, and label data. Everything you need to calculate lead time, review rounds, and throughput.
Third-party engineering analytics
Tools like LinearB, Jellyfish, Pluralsight Flow, and Haystack connect to your GitHub data and calculate DORA metrics (deployment frequency, lead time, change failure rate, time to restore service) automatically. Many now include AI-specific views.
Your own tracking
For agent-specific metrics, you’ll likely need to build lightweight tracking:
- Label agent PRs. The coding agent already labels its PRs. Use these labels to filter your analytics.
- Use audit hooks. In Ch 14, you built hooks that log every tool call. Parse those logs to calculate intervention frequency, command patterns, and error rates.
- Track issue-to-PR mappings. When the agent creates a PR from an issue, the PR body references the issue. Use this link to calculate task completion rates.
Spec-driven development: the Plan that powers everything
Throughout this series, we’ve emphasized that the Plan phase of PDRC is the most important. Spec-driven development formalizes that idea: before delegating any task to an agent, you write a specification that defines what success looks like.
Why specs matter more with AI
When a human developer picks up a task, they bring context: they know the codebase, they know the team’s conventions, they can ask questions when something is unclear. An AI agent has none of this unless you provide it. The spec is how you provide it.
A well-written spec transforms a vague issue into a precise instruction set:
Bad issue (no spec):
Add user preferences to the settings page.
Good issue (with spec):
User Preferences Feature
Goal
Allow users to configure notification preferences (email, push, in-app) from the settings page.
Acceptance criteria
- New “Notifications” section in the settings page, below “Profile”
- Three toggle switches: Email notifications, Push notifications, In-app notifications
- Toggles persist to the user’s profile via the existing
PATCH /api/users/:idendpoint- Default state: all three enabled for new users
- Unit tests for the new API endpoint behavior
- E2E test for the toggle interaction
Technical notes
- Use the existing
ToggleSwitchcomponent fromsrc/components/ui/- Preferences are stored in the
user_preferencesJSON column on theuserstable- See
src/pages/settings/ProfileSection.tsxfor the pattern to followOut of scope
- Notification delivery logic (separate issue)
- Granular notification types (future enhancement)
The spec as a measurement instrument
Here’s the key insight: a spec with checkboxes is both an instruction for the agent and a measurement tool. After the agent creates its PR, you can check:
- How many acceptance criteria did the PR satisfy?
- Which criteria were missed?
- Were any out-of-scope changes included?
This gives you a precise completion quality score: 5/6 criteria met = 83% spec compliance. Track this over time, and you’ll see exactly how your instructions and agent profiles improve (or don’t).
Spec templates
Create a spec template in your repository that developers fill out before assigning to the agent:
---
name: Agent Task
about: A task to be assigned to the Copilot coding agent
labels: ["copilot-agent"]
---
## Goal
[One-paragraph description of what this task accomplishes and why]
## Acceptance criteria
- [ ] [Specific, testable criterion]
- [ ] [Specific, testable criterion]
- [ ] [Tests: describe what tests should exist]
## Technical notes
- [Files to modify]
- [Patterns to follow]
- [Dependencies or constraints]
## Out of scope
- [What this task explicitly does NOT include]
Connecting specs to the PDRC cycle
| PDRC phase | How specs help |
|---|---|
| Plan | The spec IS the plan. Writing it forces you to think through the problem before the agent touches any code. |
| Delegate | The spec becomes the issue body. The agent reads it as its primary instruction. |
| Review | The acceptance criteria become your review checklist. Did the PR do what the spec said? |
| Correct | Unmet criteria become specific, actionable feedback. “Criterion 3 is not satisfied — missing unit test for error case.” |
This is the full loop we’ve been building toward since Ch 1. The spec makes every phase concrete and measurable.
Putting it all together: a measurement dashboard
Here’s a practical measurement dashboard example you can build with data available from GitHub’s API and your CI pipeline:
Weekly snapshot
| Metric | This week | Last week | Trend |
|---|---|---|---|
| PRs merged (total) | 28 | 24 | Up 17% |
| PRs merged (agent) | 12 | 8 | Up 50% |
| Avg PR lead time (agent) | 3.8h | 5.1h | Improved |
| Avg PR lead time (human) | 6.2h | 6.5h | Stable |
| Build success rate (agent PRs) | 91% | 85% | Improved |
| Test coverage | 78.2% | 77.8% | Stable |
| Agent task completion rate | 67% | 58% | Improved |
| Avg review rounds (agent PRs) | 1.6 | 2.1 | Improved |
| Spec compliance (avg) | 88% | 82% | Improved |
Monthly developer experience
| Question | Score (1-5) | Previous month | Trend |
|---|---|---|---|
| Workflow satisfaction | 4.1 | 3.8 | Improved |
| Focus on interesting work | 3.9 | 3.5 | Improved |
| AI impact on workday | 4.2 | 4.0 | Stable |
What to do with the data
- Improving velocity + stable quality: Your setup is working. Invest in expanding agent usage to more task types.
- Improving velocity + declining quality: The agent is producing fast but sloppy code. Improve custom instructions, add stricter hooks, require more comprehensive specs.
- Stable velocity + improving quality: The agent is being used cautiously. Consider expanding the types of tasks assigned to it.
- Declining everything: Step back. Review your instructions, agent profiles, and task scoping. The agent is probably being given tasks it’s not suited for.
Continuous improvement: the feedback loop
Measurement only matters if it drives change. Here’s the improvement cycle:
Monthly review
- Pull the dashboard. Gather the weekly metrics for the past month.
- Identify the biggest gap. Which metric is furthest from where you want it? That’s your focus.
- Trace the root cause. Use the agent-specific breakdowns and spec compliance scores to find the source:
- Low completion rate → examine the specs (probably too vague)
- High review rounds → examine custom instructions (probably missing standards)
- Low build success → examine setup steps (probably missing dependencies)
- Low developer satisfaction → ask the free-text survey questions (identify specific friction)
- Make one change. Update the instructions, agent profile, hook, or spec template that addresses the root cause. Don’t change everything at once — you won’t know what helped.
- Measure the next month. Did the change improve the target metric without regressing others?
Quarterly review
Every quarter, zoom out:
- Are you assigning more task types to agents than you were three months ago?
- Is the overall team velocity trend positive?
- Are developers more satisfied?
- Has the agent configuration stabilized, or are you still making frequent changes?
If you’re still making frequent changes after six months, the setup is probably fighting your workflow rather than supporting it. Revisit your PDRC fundamentals (Ch 1-4).
Hands-on: build your measurement baseline
In this exercise, you’ll define a measurement framework for your team and collect baseline data.
Step 1: identify your current data sources
List the tools your team uses and what data they provide:
| Tool | Data available | How to access |
|---|---|---|
| GitHub | PR creation/merge times, review events, labels, CI status | REST API, GraphQL API |
| CI/CD (Actions, CircleCI, etc.) | Build pass/fail, test coverage, build time | CI API or dashboard |
| Project management (Linear, Jira, etc.) | Issue creation time, cycle time, sprint data | Tool API |
| Copilot | Activity data, usage by feature, seats | Copilot Metrics API, activity report |
Step 2: choose your metrics
Select at least one metric from each category. Here’s a starter set:
- Velocity: PR lead time (agent vs. human)
- Quality: Build success rate (agent PRs vs. human PRs)
- Developer experience: Monthly satisfaction score (1-5 scale)
- Agent effectiveness: Task completion rate
Step 3: collect baseline data
Before making any changes to your agent configuration, record two weeks of data. This is your baseline — everything you measure later will be compared against it.
For each metric, record:
- The current value
- The date range
- How you measured it (API call, dashboard export, manual count)
Example baseline:
# AI-Assisted Development Metrics Baseline
**Date range:** 2026-02-10 to 2026-02-23
**Team size:** 6 developers
**Agent configuration:** 3 custom agents (implementer, test-writer, docs-updater)
## Velocity
- Avg PR lead time (all PRs): 8.2 hours
- Avg PR lead time (agent PRs): 5.1 hours
- PRs merged per week: 12
## Quality
- Build success rate (all PRs): 88%
- Build success rate (agent PRs): 85%
- Test coverage: 74.3%
## Developer experience
- Satisfaction score: 3.6/5 (survey of 6 developers)
- Top friction: "Agent PRs often miss edge cases in tests"
## Agent effectiveness
- Task completion rate: 55% (11 of 20 assigned tasks)
- Avg review rounds (agent PRs): 2.3
- Tasks assigned this period: 20
Step 4: create a spec template
Create the issue template from the spec-driven development section earlier. Add it to your repository:
- Create
.github/ISSUE_TEMPLATE/agent-task.mdwith the template content. - Create a sample issue using the template.
- Assign it to the coding agent (or simulate the assignment).
- After the PR is created, score each acceptance criterion as met or not met.
Step 5: plan your first improvement
Based on your baseline, identify one metric you want to improve and one change you’ll make:
| Current metric | Target | Change to make |
|---|---|---|
| Build success 85% | Build success over 90% | Add build and test commands to copilot-setup-steps.yml |
| Completion rate 55% | Completion rate over 65% | Add spec template with acceptance criteria to all agent issues |
| Review rounds 2.3 | Review rounds under 2.0 | Update custom instructions with code style examples |
Commit your baseline document and spec template. In two weeks, pull the same metrics and compare.
What you practiced
- Identifying data sources already available in your development pipeline
- Choosing metrics that balance velocity, quality, experience, and agent effectiveness
- Collecting a baseline before making changes (the only way to measure improvement)
- Creating spec templates that make the Plan phase concrete and measurable
- Connecting measurement to a specific improvement action
Conclusion
Measurement transforms AI-assisted development from a novelty (“look, the agent wrote code!”) into a validated engineering practice (“the agent reduced our PR lead time by 35% while maintaining build success rates”). The key principles:
-
Measure outcomes, not output. PRs merged, build success, and developer satisfaction matter. Lines of code and commits don’t.
-
Compare agent work to human work. The baseline isn’t “zero” — it’s what your team was doing before. Track agent PRs and human PRs side by side.
-
Use specs as both instructions and scorecards. A spec with acceptance criteria tells the agent what to do and tells you how well it did it.
-
Make one change at a time. When a metric is below target, trace the root cause and fix one thing. Measure the impact before changing something else.
-
Developer experience is not optional. A team that ships faster but hates the process will find ways to route around the tools. Survey regularly and take the qualitative feedback seriously.
-
The PDRC cycle is a measurement cycle. Plan (write the spec) → Delegate (assign with clear criteria) → Review (score against the spec) → Correct (feed results back into instructions). Each iteration makes the next one better.
In Ch 17, we’ll put everything from the entire series together in an end-to-end final project. You’ll receive a realistic scenario, apply every technique from Modules 1 through 4 — planning with specs, delegating to custom agents, reviewing with AI-assisted tools, correcting with hooks and feedback — and produce a complete, deployable result. It’s the full PDRC cycle, measured, from start to finish.