Building Agent V2: Self-Healing Autonomous Code Generation for Production-Ready Applications

We gave it a thinking framework, an execution loop, and a quality standard. Here's what happened.

Rocket is a vibe solutioning platform:

Capability	What it does
Solve	Produces structured strategic intelligence
Intelligence	Monitors competitors continuously
Build	Generates production-ready applications
Projects	Holds all of it together so every task starts from the full picture, not a compressed summary

1.5M people across 180 countries have used what we've built.

This post is about what happens inside Build. Specifically, the agent architecture that turns a user's prompt into working code that doesn't look, feel, or break like AI-generated output.

We call it Agent V2. It:

Thinks before it codes
Runs its own code, reads its own errors, fixes them, and runs again
Enforces a quality standard strict enough that the output doesn't carry the fingerprints of a machine

Three architectural decisions made the difference.

What Agent V2 actually is

Agent V2 is an end-to-end autonomous code generation system. "End-to-end" means the agent owns every step from the moment a user submits a prompt to the moment a working preview appears. Not just the code generation step. All of it.

The Self-Healing Loop

The critical part is what we call the self-healing loop: the tight cycle between three steps:

Run the generated code
Read Logs to check the actual build output, runtime errors, and console logs
Fix by classifying the error and making a surgical correction (not a full regeneration)

Then it runs again. And again, until the build passes.

The agent doesn't guess whether its code works. It knows.

How this plays out in practice:

A component with a missing import gets caught, fixed, and re-run before the user ever sees it
A database integration with an incorrect access policy gets tested against the actual backend, fails, gets diagnosed from the error log, and gets corrected
The user sees the working version

The agent runs its own code. That sentence sounds simple. In practice, it changes everything about the quality of what gets returned. Error categories that previously required the user to describe the problem, wait for a regeneration, and check again now get resolved inside the agent's own loop.

The user describes what they want. The agent figures out how to make it work.

But the loop alone wasn't enough.

The mental model: why the loop needs a brain

The self-healing loop gives Agent V2 the ability to recover from errors. The mental model gives it the ability to avoid most of them in the first place. These are two different capabilities, and the agent needs both.

Without the mental model, the self-healing loop still works. The agent generates, runs, reads the error, and fixes. But it fixes reactively. It has no plan to reason against, so it treats every failure as an isolated problem.

We watched early versions of Agent V2 spend 3-4 fix cycles on problems that a few seconds of structured thinking would have prevented entirely.

So we built a mental model for how the agent should reason before it codes. Not a vague instruction to "plan carefully." A specific 6-phase framework:

Phase 1: Analysis. What does the user actually want? Not the literal words in the prompt, but the underlying requirement. "Add payments" means different things for a SaaS subscription app versus a one-time purchase e-commerce store. The agent resolves ambiguity before generating a single line of code.
Phase 2: Context Gathering. What exists in the codebase right now? What files are relevant? What patterns are already established? The agent reads the existing project, identifies what matters for this specific task, and builds a working picture of the current state.
Phase 3: Solution Design. What's the right architectural approach? Which files need to change? What new files need to be created? What are the dependencies between changes? This phase produces a structured TODO list that becomes the agent's working memory for the entire task.
Phase 4: Implementation. Execute the TODOs in order. Generate code for each step. Because the agent planned its approach, each generation step has precise context about what it's building and why, not just the user's original prompt.
Phase 5: Integration. Merge generated code into the existing codebase. Verify that new code works with existing patterns, naming conventions, and architectural decisions. This prevents the "fixed one thing, broke another" failure mode.
Phase 6: Validation. Run the build. Read the logs. Verify the complete implementation works end-to-end, not just the individual pieces.

Testing the mental model

We tested this rigorously: agents with the mental model versus agents without it, same prompts, same LLM, same tools. The results were clear:

The mental model didn't just improve accuracy. It prevented catastrophic failures.
Cases where the agent skipped planning entirely and jumped straight to generation produced the worst outcomes consistently.

How the mental model changed the self-healing loop

Without mental model	With mental model
Fix cycles were reactive and often circular	Fix cycles became targeted
Agent tried the same failed approach with minor variations	Agent knew what it was trying to achieve at each step
Like a developer who keeps recompiling without reading the error message	Could reason about why something failed relative to the plan

The insight generalizes beyond code generation: the intelligence isn't in the model weights. It's in the framework around the model that shapes how it applies its capabilities.

Anti-slop as architecture

The execution loop catches errors.
The mental model prevents them.
But neither one controls what "good output" actually looks like.

AI-generated code has a quality ceiling. Not because the models lack capability, but because nobody defines "quality" with enough precision for the model to hit it consistently. The result is output that technically works but carries unmistakable signatures of machine generation:

Layouts that default to the same patterns
Copy that reaches for the same adjectives
Design choices that reflect training data averages rather than intentional decisions

Most teams treat this as a prompt problem. Add "make it high quality" to the system prompt. Hope the model figures out what you mean. We treat it as an architecture problem.

Anti-slop at Rocket is a system of specific, measurable rules embedded at every stage of the agent's pipeline. Not a filter applied after generation. Rules that the agent must satisfy, verified the same way a build error gets verified.

What the quality rules cover:

Language
Layout
Visual design
Code structure
Content density
Interaction patterns

We maintain over a hundred of these rules. Each one exists because we identified a specific failure pattern in AI-generated output and wrote a constraint that prevents it.

When the agent's output violates a rule, the same self-healing loop that catches build errors catches quality violations. The agent corrects and regenerates until the output passes.

Example: AI-generated landing pages

AI-generated landing pages have a default look:

Centered text over a gradient background
The same handful of fonts
Copy that reaches for "revolutionize" and "seamlessly"

These patterns emerge because models default to training data averages when nobody specifies otherwise. Our quality rules prevent this at the architectural level. The generation agent cannot produce output that matches known slop patterns. When it does, the self-healing loop catches the violation the same way it catches a build error.

That's what we mean by "anti-slop as architecture." Quality isn't a property you add at the end. It's a set of constraints that shape every decision the system makes, from how information flows between agents to what the generation agent is allowed to output.

The competitive implication is straightforward:

Every vibe coding tool has access to the same frontier models
The difference in output quality between tools isn't about which model you use
It's about what happens between the user's prompt and the final output
That gap is filled by architecture, and architecture is what we spend our engineering time building

What we measured

We benchmark Agent V2 across five dimensions: cost, time, bugs, quality, and capability.

Cost

Metric	Result
Cost reduction per task	38% on average
Fewer steps to solve	28% fewer on average
Why	Mental model means fewer LLM calls; self-healing loop adds work per step but eliminates wasted V1 cycles

Time

Metric	Result
Wall clock time increase	~30% slower on average
Per-step overhead (equal steps)	Only 8.5% slower
Why	Self-healing loop adds real seconds (run, read logs, fix, run again)

The tradeoff is straightforward: the user waits longer once and gets a working result, instead of waiting shorter each time across multiple broken rounds that need manual correction.

Bugs

Error rates dropped across every framework we compared between V1 and V2:

Error category	Reduction
Integration errors	48% (hardest to fix autonomously; involve external services, timing, configuration)
Code structure errors	78%
Dependency errors	81%
Type system errors	88%

These weren't edge cases. These were the highest-volume failure categories in V1. The agent now catches them in its own loop, fixes them, and returns working code.

Capability

The loop and the mental model together expanded what the agent can reliably produce:

Multi-screen applications with shared state
Database integrations with auth and real-time features
Payment flows
API integrations with error handling

These were unreliable before Agent V2. Now they work because the agent can iterate through complexity instead of trying to solve it in a single pass.

What this architecture makes possible

Agent V2 is live and serving Rocket's builders today.

The execution loop, the mental model, and the quality standard work on every prompt. But they work differently depending on how much context the agent has to start with:

Scenario	Context depth	What the agent has
Task inside a Project	Full picture	Prior research, competitive context, every decision made so far, the complete codebase. It builds with understanding.
Task outside a Project	Prompt only	Still has the self-healing loop, the mental model, and the quality standard. Still refuses to ship slop. But works from the prompt alone, without the accumulated context that Projects provide.

The architecture is the same. The depth of context is not.

The quality ruleset grows with every failure pattern we identify:

Every pattern becomes a rule
Every rule gets embedded in the architecture
The standard ratchets forward and never slides back

That's what we built. And we're shipping improvements to it every other day.

Table of contents

We Built an Agent That Refuses to Ship Slop