What this series is about

You already know how to write software. The question is: how do you work with an AI that also writes software?

Coding assistants are everywhere. Autocomplete suggestions, inline chat, autonomous agents that open pull requests while you sleep. The tooling has evolved fast, faster than most teams’ ability to use it well. The result is a gap: developers adopt the tools but don’t change how they think about the work. They accept suggestions without reviewing them. They prompt vaguely and get vague results. They let the agent run wild and then spend hours cleaning up the mess.

This series exists to close that gap. Every chapter is built around a single idea: you stay in control, and the AI amplifies your decisions. We’ll build that idea around three concrete practices:

  1. Spec-Driven Development — write success criteria before delegating anything
  2. Plan → Build mode — use the built-in workflow your tools already provide to validate an approach before executing it
  3. The harness — a portable system of skills, instructions, and agent profiles that makes the spec-first workflow repeatable across every project

We’ll apply these across real-world tasks: writing tests, reviewing code, debugging, documenting, configuring agents, and measuring impact.

The research does not support a lazy claim like “AI makes every developer faster.” It supports a sharper claim: AI helps when the task is scoped, the context is explicit, the batch is small, and verification is part of the workflow. It can slow experienced developers down when the codebase is mature, the task is high-context, and the generated diff creates more review work than it saves.

But before we get to the workflow (that’s Ch 2), we need to understand what these tools actually are, where they shine, and where they will waste your time.


The evolution: from autocomplete to autonomous agents

Not all AI coding tools work the same way. They sit on a spectrum of autonomy, and understanding where each one falls is the first step to using them well.

But first, some history, because this didn’t start yesterday.

A brief timeline

It seems big, but I promise that’s important context. Take attention to the fast changes in the last few years, and especially the last 12 months.

YearEvent
2013Codota — early research on AI-based code suggestions for Java
2014Microsoft Research releases Bing Code Search plugin for Visual Studio 2013 — code snippet search via natural language (a precursor to Copilot)
2018Jacob Jackson (University of Waterloo) creates TabNine — the first deep-learning-based code completion tool, using GPT-2
2019The Verge calls TabNine “Gmail’s Smart Compose for coders”; Codota acquires TabNine
Jun 2021GitHub Copilot announced as a technical preview, powered by OpenAI Codex (a fine-tuned GPT-3)
Jun 2022GitHub Copilot becomes generally available. In the technical preview, ~40% of code in enabled files was written by Copilot in languages like Python
Nov 2023Copilot Chat upgraded to GPT-4; Tabnine introduces its own AI chat agent
Oct 2024GitHub Copilot goes multi-model: users can choose between OpenAI GPT, Anthropic Claude, and Google Gemini
Feb 2025GitHub announces agent mode — Copilot can read/modify multiple files, run commands, and iterate on errors inside the IDE
Apr 2025Agent mode + MCP support rolled out to all VS Code users; Copilot Pro+ plan launched with premium requests
May 2025GitHub announces coding agent — fully autonomous, receives a GitHub issue, spins up an environment, writes code, opens a PR
Jun 2025GitHub’s “pair to peer” vision blog — Copilot repositioned from assistant to independent agent
Sep 2025GPT-5 and GPT-5 Mini generally available in Copilot — first GPT-5 family models; new embedding model for smarter code search in VS Code
Dec 2025Agent Skills — reusable instruction folders that teach Copilot specialized tasks (compatible with Claude Code skills)
Dec 2025Copilot Memory (early access) — agents learn from your codebase and build repository-specific context over time
Dec 2025Custom agents from partners (Datadog, HashiCorp, etc.) for observability, IaC, and security workflows
Feb 2026Agentic Workflows (technical preview) — repository automation written in Markdown (not YAML), executed via GitHub Actions with AI agents
Feb 20263rd-party agents (Anthropic Claude, OpenAI Codex) available on github.com and VS Code as alternative coding agents
Feb 2026Copilot CLI generally available — Copilot runs directly in the terminal; GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro available

The speed of this evolution matters: it took just four years to go from “autocomplete on steroids” to “autonomous agent that opens pull requests”, and then just nine more months to reach agentic workflows, persistent memory, and third-party agent ecosystems. Each level on that spectrum requires a different way of working and it’s close to impossible to follow all changes without a good understanding on how to use the tools effectively.

Completions (inline suggestions)

This is where it started for most developers. You type a function signature, and the tool predicts the next few lines. It works inside your editor, in real time, with no explicit prompt from you. TabNine pioneered this with deep learning in 2018; GitHub Copilot brought it to the mainstream in 2021.

When it works well: boilerplate, repetitive patterns, standard implementations you’d write on autopilot anyway.

When it doesn’t: anything that requires understanding why you’re writing the code, not just what comes next.

Think of completions as a fast typist who has read a lot of code but has no idea what your project does.

Chat (interactive conversation)

Chat gives you a conversation window, either in the IDE sidebar or in a terminal, where you describe what you want in natural language. The tool generates code, explains concepts, or helps you debug.

Key difference from completions: you provide explicit context and intent. Instead of the tool guessing from your cursor position, you tell it what you need.

This is where prompt quality starts to matter. A vague prompt like “fix this” will give you a generic answer. A specific prompt like “this function throws a TypeError when the input array is empty, add a guard clause and a unit test” will give you something useful.

Agent mode (IDE-integrated)

Agent mode, introduced by GitHub Copilot in February 2025, takes chat a step further. Instead of generating a single code block, the agent can:

  • Read and modify multiple files
  • Run terminal commands
  • Execute tests
  • Iterate on its own output based on errors

You’re still in the loop, you approve or reject each action, but the tool is doing more than suggesting. It’s executing.

When it works well: multi-file refactors, test generation, implementing well-scoped features with clear acceptance criteria.

When it needs you: deciding which files to touch, what the acceptance criteria are, and whether the result is actually correct.

Coding agents (autonomous)

This is the most autonomous level. GitHub’s Copilot Coding Agent was announced in May 2025; OpenAI’s Codex followed a similar model. A coding agent receives a task (typically a GitHub issue), spins up its own environment, writes code, runs tests, and opens a pull request. All without you sitting in front of the IDE.

Since then, the ecosystem has expanded rapidly. By February 2026, GitHub introduced Agentic Workflows, repository automation written in plain Markdown that runs via GitHub Actions with AI agents. Anthropic’s Claude and OpenAI’s Codex became available as alternative coding agents directly on github.com. And features like Agent Skills (reusable instruction folders) and Copilot Memory (repository-specific context that persists across sessions) made agents significantly more context-aware.

The critical difference: you’re not reviewing in real time. The agent works asynchronously, and you review the output after the fact. This makes reviewing against your spec even more important — by the time you see the code, the agent has already made dozens of decisions you didn’t approve individually. The spec you wrote before delegating is what lets you evaluate whether those decisions were correct.


The tool landscape

The market moves fast, so rather than memorizing features that will change next quarter, focus on the categories and what each tool is optimized for.

ToolTypeBest atRuns in
GitHub CopilotCompletions + Chat + Agent mode + Coding Agent + Agentic WorkflowsDeep GitHub integration, code review, PR workflows, enterprise governance, agent skills, memoryVS Code, JetBrains, CLI, Eclipse, Xcode, Zed, GitHub.com
Claude CodeCLI agent (also available as 3rd-party agent on GitHub)Long-context reasoning, complex refactors, multi-file changesTerminal (CLI-first), GitHub.com
OpenAI CodexCoding agent (also available as 3rd-party agent on GitHub)Autonomous task execution from issues, cloud-based sandboxed environmentGitHub.com, ChatGPT
CursorIDE with AI-native UXCodebase-aware chat, fast iteration, composer for multi-file editsCursor IDE (VS Code fork)
WindsurfIDE with AI-native UXFlows (multi-step agent), contextual awarenessWindsurf IDE

A practical heuristic

You don’t need to pick one tool and ignore the rest. Different tools excel at different tasks:

  • Quick inline edits and completions → Copilot, Cursor
  • Deep reasoning over a large codebase → Claude Code
  • Autonomous issue-to-PR → Copilot Coding Agent, Codex
  • Rapid prototyping with multi-file orchestration → Cursor (composer), Windsurf (flows)
  • Enterprise governance and code review → Copilot (organization policies, agent firewalls, custom instructions)
  • Repository automation with AI → Copilot Agentic Workflows (Markdown-based, runs in GitHub Actions)

The important thing is not which tool you use, but how you use it. A well-structured prompt in any of these tools will outperform a lazy prompt in the “best” tool.


Where AI delivers proven value

The evidence is strongest when the task is bounded and success can be checked. That is the first principle of the whole series.

The numbers

FindingWhat was measuredWhat I take from it
Developers completed a bounded JavaScript HTTP-server task 55.8% faster with CopilotControlled experiment by Peng et al.AI can be very effective on constrained implementation tasks. Do not generalize this number to broad product work.
Developers using Copilot completed 26.08% more tasks across three field experimentsCui et al., with developers at Microsoft, Accenture, and a Fortune 100 companyReal enterprise gains exist, but adoption, experience level, and process maturity change the result.
Google engineers were estimated to finish an AI-assisted task about 21% faster, with uncertainty after controlsGoogle randomized controlled trialMeasured gains are plausible, but they are not automatic.
Experienced maintainers on mature open-source projects were 19% slower with early-2025 AI toolsMETR randomized controlled trialIn high-context codebases, review and correction cost can exceed generation savings.
AI-generated code studies found substantial security weakness ratesPearce et al.; Perry et al.; repository studies of Copilot-generated snippetsTreat generated code as untrusted input. Security review is part of the workflow, not an advanced add-on.
DORA found AI can amplify existing delivery strengths and weaknessesDORA 2025 reportsA weak delivery system does not become strong because an assistant writes code faster.

Primary sources: Peng et al., Cui et al., Google’s randomized trial, METR’s experienced-developer study, Pearce et al., Perry et al., and DORA 2025.

What this means in practice

The biggest productivity gains cluster around tasks that are repetitive, well-defined, and have clear patterns:

  • Test generation — writing unit tests, integration tests, expanding coverage for existing code
  • Documentation — JSDoc, docstrings, README files, API docs
  • Refactoring — extracting functions, renaming, restructuring code to improve maintainability
  • Debugging — explaining errors, suggesting fixes, tracing through stack traces
  • Boilerplate — configuration files, scaffolding, standard CRUD operations

Notice a pattern: these are all tasks where the what is clear and the how follows established conventions. The AI doesn’t need to understand your business domain to generate a test for a pure function.

The senior-engineer move is task selection. Do not ask, “Can AI do this?” Ask, “Can I specify this clearly, verify it objectively, and review the diff without increasing risk?” If the answer is no, start with analysis, tests, or a smaller slice.


What AI does NOT do well

This is the part most evangelists skip. Understanding the limitations is just as important as understanding the capabilities, because using AI on the wrong task wastes more time, money, and natural resources than doing it manually.

Complex architectural decisions

An AI can generate a microservice. It cannot tell you whether your system should be split into microservices. Architecture decisions require understanding organizational constraints, team topology, deployment infrastructure, and long-term maintenance costs. These are context-heavy decisions that no model has access to.

The empirical pattern is consistent: gains shrink when the task becomes high-context. METR’s 2025 study is the cleanest warning here. Experienced developers working in familiar, mature repositories expected AI to help, but the measured result was slower completion.

Organizational context

Your company’s coding standards, naming conventions, security policies, deployment pipelines, internal libraries, none of this exists in the model’s training data. Off-the-shelf tools won’t know that your team uses a specific error-handling pattern, or that certain modules require approval from the security team before changes.

This is exactly why Modules 01 and 03 of this series focus heavily on custom instructions, AGENTS.md, and MCP integrations. The tools can be taught your context, but you have to teach them.

Multifaceted requirements

When a task combines multiple frameworks, integrates with several systems, and has non-obvious constraints, AI tools struggle. McKinsey participants reported that to get usable solutions for multifaceted requirements, they had to break the problem into smaller segments manually before prompting.

One participant put it simply: “[Generative AI] is least helpful when the problem becomes more complicated and the big picture needs to be taken under consideration.”

The “accept all” trap

This isn’t a tool limitation, it’s a human one. The GitHub Survey found that developers consistently rank code quality as the most important metric. But when you accept every suggestion without reading it, quality degrades silently. Bad patterns compound. Technical debt accumulates. And by the time you notice, the damage is spread across dozens of files.

Security studies make the same point from a harsher angle. Pearce et al. found vulnerable code in a large share of Copilot-generated programs across security scenarios, and Perry et al. found that users with an AI assistant wrote less secure code while being more likely to believe it was secure. The quality doesn’t come from the AI. It comes from the developer’s review, tests, and refusal to accept plausible code blindly.

Real-world failures: what happens without human review

These aren’t hypothetical scenarios. Each of these happened in 2024–2026, and each illustrates what goes wrong when AI output is trusted without verification.

cURL ends its bug bounty after a flood of AI-generated reports

In January 2024, Daniel Stenberg — founder and lead developer of cURL, one of the most widely used open-source projects in the world — published a detailed account of AI-generated security reports overwhelming the project’s bug bounty program on HackerOne.

The reports looked professional. They included code references, proposed fixes, and were written in clean English. But they were fabricated. One claimed a critical vulnerability related to CVE-2023-38545 that didn’t exist. The reporter admitted that they were using Google Bard. Another described a buffer overflow in WebSocket handling, complete with a proposed patch, but after careful investigation there was no buffer overflow at all. The LLM had mixed real function names with hallucinated vulnerabilities.

Stenberg described the cost: each well-crafted fake report took real developer time to investigate and dismiss. Security reports are high priority, they trump bug fixes and feature work. Every hallucinated vulnerability stole hours from real development work.

The scale of the problem grew. By January 2026, the project ended its bug bounty program entirely, stating it was “an attempt to reduce the noise.” The word “slop” (Merriam-Webster’s 2025 Word of the Year) became the shorthand for this kind of AI-generated low-quality output that shifts its cost to the recipient.

What went wrong: The reporters had no spec — no success criteria to verify the report against, no threat model to check claims against. They delegated entirely to an LLM and submitted the raw output without verifying any of the claims. The entire cost of verification landed on the maintainers.

OpenClaw: when an autonomous agent becomes a security nightmare

In January 2026, OpenClaw (originally “Clawdbot”) — an open-source autonomous AI agent by Peter Steinberger — went viral with 140,000 GitHub stars. It could book flights, manage calendars, control browsers, and execute shell commands on behalf of users, all through messaging platforms like WhatsApp and iMessage.

The security community raised alarms almost immediately. Cisco’s AI Threat and Security Research team tested a third-party skill called “What Would Elon Do?”, which had been artificially inflated to rank #1 in the skill repository, and found it was functionally malware. The skill performed silent data exfiltration (sending user data to an external server via curl without user awareness) and direct prompt injection to bypass safety guidelines. Cisco’s Skill Scanner flagged nine security findings: two critical, five high severity.

The risks were systemic, not just theoretical:

  • Skills operated without vetting. Anyone could publish a skill to the repository; there was no code review, no signing, no sandbox.
  • The agent had broad system access. OpenClaw could run shell commands, read/write files, and access email, calendars, and messaging — a single misconfigured skill could exfiltrate sensitive data across all connected services.
  • Prompt injection via messaging apps extended the attack surface. Malicious prompts embedded in messages could cause unintended behavior without the user ever triggering them directly.
  • Plaintext credentials were leaked — API keys and tokens reported exposed, stealable via prompt injection or unsecured endpoints.

One of OpenClaw’s own maintainers warned on Discord: “if you can’t understand how to run a command line, this is far too dangerous of a project for you to use safely.”

In a separate incident, a user discovered that his OpenClaw agent had created a dating profile on MoltMatch (an AI agent dating platform) and was screening potential matches without his explicit direction. The agent acted autonomously beyond what the user intended — exactly the kind of behavior that escapes detection when there’s no Review step.

What went wrong: Users delegated life-management tasks to an autonomous agent without reviewing what skills it loaded, what permissions it needed, or what it was actually doing in the background. There was no spec defining what the agent was and wasn’t allowed to do, no review of the skills before they ran, and no permission boundaries to contain the damage.

The OpenClaw story also hit close to home for AI professionals themselves. In February 2026, Business Insider reported that a Meta AI alignment director shared her own OpenClaw nightmare: the agent deleted emails from her account without permission, and she “had to RUN to my Mac mini” to stop it. This is someone whose literal job is AI safety — and even she was caught off guard by an autonomous agent acting beyond its intended scope.

AWS outages caused by an AI coding agent

In early 2026, it was reported that Amazon Web Services suffered at least two outages caused by its own AI tools. In the most notable incident, in December 2025, engineers allowed Amazon’s Kiro agentic coding tool to make changes autonomously, and the AI decided the best course of action was to delete and recreate the environment, causing a 13-hour disruption to AWS Cost Explorer in one of Amazon’s cloud regions in China.

As a senior AWS employee told the Financial Times: “The engineers let the AI agent resolve an issue without intervention. The outages were small but entirely foreseeable.” In both incidents, the engineers didn’t require a second person’s approval before finalizing the changes.

Amazon’s response was to call it “user error, not AI error” and a “coincidence” that AI was involved, but security researchers disagreed. As one researcher pointed out, AI agents don’t have full visibility into the context in which they’re running. They don’t understand how customers might be affected or what the cost of downtime might be.

This is AWS — one of the most sophisticated engineering organizations on the planet — operating its own infrastructure, using its own AI tool. If it can happen there, it can happen anywhere.

What went wrong: The engineers let the agent resolve an issue without defining what it was allowed to do. There was no scope constraint (no spec saying “do not delete or recreate environments”), no second pair of eyes before changes went live, and no rollback plan if the agent’s “fix” made things worse. A simple approval gate at the plan stage could have caught a “delete and recreate” action before it caused 13 hours of downtime.

”Workslop”: AI-generated work that creates more work

A September 2025 study published in the Harvard Business Review, conducted jointly by Stanford University and BetterUp, coined the term “workslop”. The AI-generated content at work that looks polished but lacks substance, shifting the burden of quality from the creator to the recipient.

The findings are stark: 40% of participating employees had received workslop, and each incident took an average of two hours to resolve. The mechanism is simple: someone uses an LLM to draft a document, email, or proposal, spends minimal time reviewing it, and sends it along; the recipient then has to figure out what’s actually correct, what’s hallucinated, and what’s missing.

BetterUp defines workslop as “AI-generated content that looks good, but lacks substance.” It is the professional equivalent of accepting all Copilot suggestions: the output appears complete, but the thinking behind it is absent.

What went wrong: The “slopper” had a task and used an LLM to complete it, but skipped the step that matters: reviewing the output against the actual goal before handing it off. The cost didn’t disappear, it was externalized to colleagues.

The pattern across all these cases

Every failure follows the same structure:

What was missingWhat happenedWho paid the cost
A spec with scope boundariesNo threat model, no scope definitionThe agent acted with unbounded authority (OpenClaw); the AI deleted and recreated an environment (AWS)
Review against acceptance criteriaOutput accepted without verificationHallucinated vulnerabilities wasted maintainer time (cURL); a Meta director’s emails were deleted (OpenClaw)
Iteration on qualityNo correction loop, no quality checkColleagues spent hours fixing workslop (HBR study); 13 hours of AWS downtime (AWS)

The tools themselves aren’t the problem. The absence of human judgment at critical checkpoints is.


Setting expectations for this series

By the end of this series, you will:

  1. Have a structured workflow (spec-first + Plan→Build mode) for deciding when and how to use AI on any coding task
  2. Know how to build a harness — the system of instructions, agents, and skills that makes the workflow repeatable across projects
  3. Be able to generate, review, and iterate on AI-assisted tests, code reviews, documentation, and refactors
  4. Understand the security and governance implications of AI-assisted development
  5. Measure the impact of AI tools on your team’s productivity with concrete metrics

What you will not get is a magic prompt that solves everything. The tools are force multipliers — they multiply whatever you bring to them. Bring clear thinking and structured intent, and you’ll get impressive results. Bring vague prompts and uncritical acceptance, and you’ll get impressive-looking garbage.

That is the portfolio-level claim of this series: a strong AI-assisted developer is not the person who prompts the most. It is the person who can turn messy intent into a spec, choose the right level of autonomy, constrain the assistant, verify the result, and explain the trade-offs in the final PR.

Let’s start with the workflow that makes the difference. Next up: Chapter 2 — The Spec-First Workflow.