Skip to content
Back to articles

Stripe's AI Agents Write All the Code. Here's How.

March 2, 2026ยท13 min readยท2,565 words
Agentic EngineeringStripeCoding AgentsArchitectureMCPCI/CD
IndyDevDan video thumbnail: I Studied Stripe's AI Agents โ€” Vibe Coding Is Already Dead
Image: Screenshot from YouTube.

Key insights

  • Stripe ships over 1,300 pull requests per week with zero human-written code, powered by six architectural layers working together
  • Blueprints combine deterministic code with agent flexibility, letting engineers control which steps need precision and which benefit from AI creativity
  • A centralized tool shed manages nearly 500 MCP tools, solving the discoverability problem that breaks most agent setups
SourceYouTube
Published March 2, 2026
IndyDevDan
IndyDevDan
Hosts:IndyDevDan (Dan)

This article is a summary of I Studied Stripe's AI Agents... Vibe Coding Is Already Dead. Watch the video โ†’

Read this article in norsk


In Brief

IndyDevDan breaks down Stripe's recently published blog post on Minions, their internal coding agents that reportedly ship over 1,300 pull requests per week with no human-written code. Dan walks through six architectural components that make this possible, from warm development sandboxes that spin up in seconds to a centralized tool shed managing nearly 500 MCP (Model Context Protocol) tools. The analysis highlights what separates production-grade agentic engineering from casual AI-assisted coding.

1,300+
AI-written PRs per week
~500
MCP tools in the tool shed
3M+
tests run on every push

Agentic engineering vs. vibe coding

Dan opens with a sharp distinction between two approaches to working with AI coding tools. Agentic engineering means knowing your system so well that you can predict what will happen without watching every step. Vibe coding, by contrast, means prompting an AI and hoping for the best (2:10).

The difference matters because Stripe processes $1.9 trillion in annual payment volume (0:28). Their codebase contains hundreds of millions of lines of code, much of it in an uncommon Ruby stack with homegrown libraries that Large Language Models (LLMs) have never seen in their training data (1:15). Getting AI agents to work reliably in this environment requires deliberate architecture, not luck.


The six components of Stripe's agentic layer

Dan identifies six architectural layers that together form what Stripe calls their agentic platform (3:07). Each layer solves a specific problem that breaks most agent setups.

1. API layer: three ways in

Engineers interact with Minions through three interfaces: a Slack integration for quick tasks, a command-line interface (CLI) for terminal workflows, and a web interface for more involved work (3:07). Multiple entry points mean agents fit into existing workflows rather than forcing engineers to adopt a new tool.

2. Warm dev box pool

Every Minion run gets its own isolated development environment, a sandboxed copy of Stripe's codebase running on AWS EC2 instances. These dev boxes spin up in roughly 10 seconds (11:15) because Stripe maintains a pool of pre-warmed instances ready to go.

Engineers typically run half a dozen dev boxes simultaneously (16:26), each handling a different task. This is what Dan calls "out-loop" operation: fully unattended agents working in parallel sandboxes (20:38). The alternative, "in-loop" operation, is the familiar pattern of sitting at a desk prompting back and forth (20:03). Stripe's architecture supports both, but the real throughput comes from running agents out-loop.

3. Agent harness (forked from Goose)

The agent harness is the runtime that executes each Minion. Stripe forked it from Goose, Block's open-source coding agent (12:22). The harness handles the low-level mechanics: reading files, running commands, managing context, and communicating with the LLM.

By building on an open-source foundation, Stripe avoided reinventing the basic agent loop. They focused their engineering effort on the layers above the harness, specifically blueprints and the tool shed, where their unique requirements live.

4. Blueprint engine

A blueprint is a workflow defined in code that directs a Minion run (23:39). Blueprints combine deterministic steps (steps that always execute the same way) with agent-driven steps where the LLM decides what to do. This is the key architectural choice: not everything needs to be agentic.

For example, a blueprint might deterministically check out the right branch and load specific context files, then hand control to the agent for the creative work of writing code, then deterministically run the test suite. The engineer controls which steps need precision and which benefit from flexibility.

Explained simply:

Explained simply: Think of a blueprint like a recipe with two types of instructions. Some steps are exact: "Preheat the oven to 200ยฐC" โ€” these always happen the same way. Other steps require judgment: "Season to taste" โ€” here the chef (the AI) decides what to do. Stripe's blueprints work the same way: some steps are fixed code that runs identically every time, while others let the AI make decisions. Unlike a recipe, though, the blueprint also controls which tools the AI can access and how many attempts it gets.

1

Design the blueprint

An engineer defines a workflow that mixes deterministic code steps with agent-driven steps. The blueprint specifies which context files to load, which tools to enable, and how many CI cycles the agent gets.

2

Spin up a dev box

The platform pulls a pre-warmed EC2 instance from the pool. The full Stripe codebase is available within seconds.

3

Execute the blueprint

The agent harness runs through the blueprint. Deterministic steps execute as written. Agent steps let the LLM read code, write changes, and use tools.

4

Run CI validation

The agent pushes code, triggering Stripe's test suite. Over 3 million tests run to validate the changes (14:29). The agent gets at most 2 rounds of CI feedback to fix any failures (14:50).

5

Open a pull request

If CI passes, the Minion opens a pull request on GitHub. A human engineer reviews and approves before merge.

5. Scoped rule files

Stripe uses conditional context loading to give each Minion run only the rules it needs (3:38). Rather than dumping the entire company's coding standards into every prompt, rule files are scoped to specific parts of the codebase. A Minion working on payments infrastructure loads different rules than one working on the dashboard.

This keeps prompts focused and reduces noise, which is critical when the codebase has hundreds of millions of lines and the LLM has never seen Stripe's homegrown libraries during training.

6. Tool shed

The tool shed is Stripe's centralized internal MCP (Model Context Protocol, a standard for connecting AI agents to external tools and data sources) server that manages nearly 500 tools (29:14). Dan describes it as a "meta-tool": instead of loading hundreds of tool definitions into every prompt, the agent queries the tool shed to discover and load only what it needs for the current task (29:21).

Explained simply:

Explained simply: Think of the tool shed like a library catalog. A student doesn't carry every book in the library to their desk. They look up which books are relevant, grab those, and leave the rest on the shelves. The tool shed works the same way for AI agents: it indexes all available tools and serves only the ones the agent needs right now. Unlike a library catalog, though, the tool shed can also enforce access controls, deciding which agents are allowed to use which tools.


What practitioners can learn

Dan rates Stripe's agentic layer 8 out of 10 (32:32). Two specific critiques point to areas where the architecture could evolve.

The CI feedback limit

Minions get at most 2 rounds of CI feedback (33:02). If the agent can't fix failing tests in two attempts, the run is abandoned. Dan argues this is too limiting. More CI rounds would let agents solve harder problems, especially in a codebase with 3 million tests where failures can cascade in unexpected ways.

The counterargument is cost: each CI run against Stripe's full test suite consumes significant compute resources. Stripe likely chose 2 rounds as a balance between agent capability and infrastructure cost.

The human review bottleneck

Stripe's current system still requires a human engineer to review and approve every pull request before merge (34:02). Dan pushes toward what he calls Zero Touch Engineering (ZTE): a future where sufficiently validated changes go from prompt to production with no human review (34:29).

For a company processing $1.9 trillion in payments, removing human review is a significant trust threshold. But Dan's point is directional: as CI validation and automated testing improve, the human review step becomes less about catching bugs and more about organizational comfort.


Applying these patterns to your own work

You don't need Stripe's scale to benefit from their architecture. Several components translate directly to smaller teams and individual developers.

1

Start with scoped context files

Create rule files (like CLAUDE.md or .cursorrules) that give your AI coding tools project-specific knowledge. Focus on conventions, patterns, and constraints that the LLM wouldn't know from public training data.

2

Design simple blueprints

Structure your agent workflows as a mix of deterministic and agentic steps. For example: deterministically load context and check out a branch, let the agent write code, then deterministically run tests. Most coding agent tools support this through configuration files or scripts.

3

Add CI as a feedback loop

Connect your agent's output to automated tests. Even a small test suite gives the agent something to work against. The agent writes code, tests fail, the agent reads the failure, and tries again.

4

Run agents in parallel

If your tasks are independent, run multiple agent sessions simultaneously. Each works in its own branch or directory. This is the "out-loop" pattern that drives Stripe's throughput.


Common pitfalls: What goes wrong with coding agents

Based on the patterns Dan describes, these are the most frequent mistakes when setting up AI coding agents.

  • Dumping everything into one prompt? Stripe uses scoped rule files and a tool shed specifically to avoid overloading the agent with irrelevant context. Start with the minimum context the agent needs for the current task
  • Skipping CI entirely? Without automated tests, there is no feedback loop. The agent has no way to know whether its code works. Even basic linting and type checking give the agent something to correct against
  • Running agents only in-loop? Sitting at the desk watching every agent step limits throughput to one task at a time. Design your workflow so agents can run unattended in sandboxed environments
  • Using a generic prompt for every task? Stripe's blueprint engine exists because different tasks need different workflows. A migration task needs different tools and rules than a bug fix
  • Expecting agents to handle unfamiliar code without help? Stripe's codebase uses homegrown Ruby libraries that LLMs have never seen. They solved this with scoped rule files that teach the agent about internal conventions. If your project uses unusual patterns, document them for the agent
Remember:

Remember: Stripe didn't build this overnight. They started with an open-source foundation (Goose), added layers incrementally, and iterated based on what worked. The architecture is a result, not a starting point.


Practical implications

For individual developers

Start with context files. A well-written CLAUDE.md or project-specific rules file is the single most impactful improvement for AI coding. It costs nothing, takes 30 minutes to write, and immediately improves every agent interaction by giving the LLM knowledge about your project's conventions.

For teams adopting coding agents

Invest in your CI pipeline before scaling agent usage. Stripe's 3 million tests are what make 1,300 weekly agent-written pull requests viable. Without strong automated validation, more agent output just means more bugs to review manually. Start with the test coverage you have and expand it as agent usage grows.

For engineering leaders evaluating agent platforms

Look for the blueprint pattern. Platforms that let you mix deterministic and agentic steps give you control over quality-critical paths while still benefiting from AI flexibility. Avoid platforms that are purely agentic with no way to enforce specific steps.

Test yourself

  1. Architecture trade-off: Stripe limits Minions to 2 rounds of CI feedback. If you were designing this system, how would you decide the right number of rounds, and what factors would you weigh?
  2. Transfer: Stripe's tool shed manages nearly 500 MCP tools through a central catalog. How would you apply this pattern to a non-coding domain, for example a customer support system with dozens of integrations?
  3. Trade-off: Blueprints mix deterministic and agentic steps. For a database migration task, which steps would you make deterministic and which would you leave to the agent?
  4. Behavior: Dan argues that Zero Touch Engineering (prompt to production, no human review) is the future. How might removing human review change the way engineers think about test coverage and code quality?
  5. Architecture: Stripe pre-warms development environments so they spin up in 10 seconds. What would break in their architecture if this took 5 minutes instead?

Glossary

TermDefinition
Agent harnessThe runtime that executes a coding agent. It handles the basic loop: read files, run commands, talk to the LLM, and apply changes. Stripe's harness is forked from Goose.
Agentic engineeringDan's term for designing AI coding systems with enough structure that you can predict outcomes without watching every step. The opposite of hoping a prompt will work.
BlueprintA workflow defined in code that mixes deterministic steps with agent-driven steps. Think of it like a recipe where some steps say "add exactly 200g flour" and others say "season to taste."
CI (Continuous Integration)An automated system that runs tests every time code is pushed. If tests fail, the code doesn't merge. Stripe runs over 3 million tests on every push.
CLI (Command-Line Interface)A text-based way to interact with software by typing commands in a terminal. One of three ways Stripe engineers can trigger a Minion run.
Dev boxA sandboxed development environment where an agent can safely read, write, and test code without affecting the main codebase. Stripe keeps a pool of these pre-warmed and ready.
Deterministic stepA step in a blueprint that always executes the same way, regardless of what the AI decides. Checking out a git branch or running a test suite are deterministic steps.
EC2Amazon's cloud computing service (Elastic Compute Cloud). Stripe uses EC2 instances as sandboxed environments for each Minion run.
GooseAn open-source coding agent built by Block (the company behind Square and Cash App). Stripe forked Goose as the foundation for their agent harness.
In-loopA working mode where the developer sits at the desk and interacts with the agent step by step. Gives full control but limits throughput to one task at a time.
LLM (Large Language Model)An AI model trained on massive amounts of text that can understand and generate code and natural language. The "brain" behind coding agents like Minions.
MCP (Model Context Protocol)A standard for connecting AI agents to external tools and data sources. Stripe's tool shed is an MCP server that manages nearly 500 tools.
MinionsStripe's internal name for their AI coding agents. Each Minion handles a single task end-to-end: read code, write changes, run tests, open a pull request.
Out-loopA working mode where agents run fully unattended in parallel sandboxes. The engineer dispatches tasks and reviews results later. This is where the throughput gains come from.
PR (Pull Request)A request to merge code changes into the main codebase. Other engineers review the changes before approving. Stripe's Minions open PRs automatically after CI passes.
Scoped rule filesContext files that contain coding standards and conventions for specific parts of a codebase. Only loaded when the agent works in that area, keeping prompts focused.
Tool shedStripe's centralized MCP server that indexes and serves nearly 500 tools. Agents query it to discover which tools are relevant to their current task instead of loading all tools at once.
Vibe codingDan's term for the practice of prompting an AI and hoping for good results without understanding the underlying system. The opposite of agentic engineering.
Warm poolA collection of pre-initialized development environments kept ready for immediate use. Like having rental cars already running in the parking lot instead of making customers wait for the engine to start.
ZTE (Zero Touch Engineering)A future vision where sufficiently validated code goes from prompt to production with no human review step. Not yet implemented at Stripe.

Sources and resources