We gave it a thinking framework, an execution loop, and a quality standard. Here's what happened.
Rocket is a vibe solutioning platform:
| Capability | What it does |
|---|
| Solve | Produces structured strategic intelligence |
| Intelligence | Monitors competitors continuously |
| Build | Generates production-ready applications |
| Projects | Holds all of it together so every task starts from the full picture, not a compressed summary |
1.5M people across 180 countries have used what we've built.
This post is about what happens inside Build. Specifically, the agent architecture that turns a user's prompt into working code that doesn't look, feel, or break like AI-generated output.
We call it Agent V2. It:
- Thinks before it codes
- Runs its own code, reads its own errors, fixes them, and runs again
- Enforces a quality standard strict enough that the output doesn't carry the fingerprints of a machine
Three architectural decisions made the difference.
What Agent V2 actually is
Agent V2 is an end-to-end autonomous code generation system. "End-to-end" means the agent owns every step from the moment a user submits a prompt to the moment a working preview appears. Not just the code generation step. All of it.

The Self-Healing Loop
The critical part is what we call the self-healing loop: the tight cycle between three steps:
- Run the generated code
- Read Logs to check the actual build output, runtime errors, and console logs
- Fix by classifying the error and making a surgical correction (not a full regeneration)
Then it runs again. And again, until the build passes.
The agent doesn't guess whether its code works. It knows.
How this plays out in practice:
- A component with a missing import gets caught, fixed, and re-run before the user ever sees it
- A database integration with an incorrect access policy gets tested against the actual backend, fails, gets diagnosed from the error log, and gets corrected
- The user sees the working version
The agent runs its own code. That sentence sounds simple. In practice, it changes everything about the quality of what gets returned. Error categories that previously required the user to describe the problem, wait for a regeneration, and check again now get resolved inside the agent's own loop.
The user describes what they want. The agent figures out how to make it work.
But the loop alone wasn't enough.
The mental model: why the loop needs a brain
The self-healing loop gives Agent V2 the ability to recover from errors. The mental model gives it the ability to avoid most of them in the first place. These are two different capabilities, and the agent needs both.
Without the mental model, the self-healing loop still works. The agent generates, runs, reads the error, and fixes. But it fixes reactively. It has no plan to reason against, so it treats every failure as an isolated problem.
We watched early versions of Agent V2 spend 3-4 fix cycles on problems that a few seconds of structured thinking would have prevented entirely.
So we built a mental model for how the agent should reason before it codes. Not a vague instruction to "plan carefully." A specific 6-phase framework:
-
Phase 1: Analysis. What does the user actually want? Not the literal words in the prompt, but the underlying requirement. "Add payments" means different things for a SaaS subscription app versus a one-time purchase e-commerce store. The agent resolves ambiguity before generating a single line of code.
-
Phase 2: Context Gathering. What exists in the codebase right now? What files are relevant? What patterns are already established? The agent reads the existing project, identifies what matters for this specific task, and builds a working picture of the current state.
-
Phase 3: Solution Design. What's the right architectural approach? Which files need to change? What new files need to be created? What are the dependencies between changes? This phase produces a structured TODO list that becomes the agent's working memory for the entire task.
-
Phase 4: Implementation. Execute the TODOs in order. Generate code for each step. Because the agent planned its approach, each generation step has precise context about what it's building and why, not just the user's original prompt.
-
Phase 5: Integration. Merge generated code into the existing codebase. Verify that new code works with existing patterns, naming conventions, and architectural decisions. This prevents the "fixed one thing, broke another" failure mode.
-
Phase 6: Validation. Run the build. Read the logs. Verify the complete implementation works end-to-end, not just the individual pieces.
Testing the mental model
We tested this rigorously: agents with the mental model versus agents without it, same prompts, same LLM, same tools. The results were clear:
- The mental model didn't just improve accuracy. It prevented catastrophic failures.
- Cases where the agent skipped planning entirely and jumped straight to generation produced the worst outcomes consistently.
How the mental model changed the self-healing loop
| Without mental model | With mental model |
|---|
| Fix cycles were reactive and often circular | Fix cycles became targeted |
| Agent tried the same failed approach with minor variations | Agent knew what it was trying to achieve at each step |
| Like a developer who keeps recompiling without reading the error message | Could reason about why something failed relative to the plan |
The insight generalizes beyond code generation: the intelligence isn't in the model weights. It's in the framework around the model that shapes how it applies its capabilities.
Anti-slop as architecture
- The execution loop catches errors.
- The mental model prevents them.
- But neither one controls what "good output" actually looks like.
AI-generated code has a quality ceiling. Not because the models lack capability, but because nobody defines "quality" with enough precision for the model to hit it consistently. The result is output that technically works but carries unmistakable signatures of machine generation:
- Layouts that default to the same patterns
- Copy that reaches for the same adjectives
- Design choices that reflect training data averages rather than intentional decisions
Most teams treat this as a prompt problem. Add "make it high quality" to the system prompt. Hope the model figures out what you mean. We treat it as an architecture problem.
Anti-slop at Rocket is a system of specific, measurable rules embedded at every stage of the agent's pipeline. Not a filter applied after generation. Rules that the agent must satisfy, verified the same way a build error gets verified.
What the quality rules cover:
- Language
- Layout
- Visual design
- Code structure
- Content density
- Interaction patterns
We maintain over a hundred of these rules. Each one exists because we identified a specific failure pattern in AI-generated output and wrote a constraint that prevents it.
When the agent's output violates a rule, the same self-healing loop that catches build errors catches quality violations. The agent corrects and regenerates until the output passes.
Example: AI-generated landing pages
AI-generated landing pages have a default look:
- Centered text over a gradient background
- The same handful of fonts
- Copy that reaches for "revolutionize" and "seamlessly"
These patterns emerge because models default to training data averages when nobody specifies otherwise. Our quality rules prevent this at the architectural level. The generation agent cannot produce output that matches known slop patterns. When it does, the self-healing loop catches the violation the same way it catches a build error.
That's what we mean by "anti-slop as architecture." Quality isn't a property you add at the end. It's a set of constraints that shape every decision the system makes, from how information flows between agents to what the generation agent is allowed to output.
The competitive implication is straightforward:
- Every vibe coding tool has access to the same frontier models
- The difference in output quality between tools isn't about which model you use
- It's about what happens between the user's prompt and the final output
- That gap is filled by architecture, and architecture is what we spend our engineering time building
What we measured
We benchmark Agent V2 across five dimensions: cost, time, bugs, quality, and capability.

Cost
| Metric | Result |
|---|
| Cost reduction per task | 38% on average |
| Fewer steps to solve | 28% fewer on average |
| Why | Mental model means fewer LLM calls; self-healing loop adds work per step but eliminates wasted V1 cycles |
Time
| Metric | Result |
|---|
| Wall clock time increase | ~30% slower on average |
| Per-step overhead (equal steps) | Only 8.5% slower |
| Why | Self-healing loop adds real seconds (run, read logs, fix, run again) |
The tradeoff is straightforward: the user waits longer once and gets a working result, instead of waiting shorter each time across multiple broken rounds that need manual correction.

Bugs
Error rates dropped across every framework we compared between V1 and V2:
| Error category | Reduction |
|---|
| Integration errors | 48% (hardest to fix autonomously; involve external services, timing, configuration) |
| Code structure errors | 78% |
| Dependency errors | 81% |
| Type system errors | 88% |
These weren't edge cases. These were the highest-volume failure categories in V1. The agent now catches them in its own loop, fixes them, and returns working code.
Capability
The loop and the mental model together expanded what the agent can reliably produce:
- Multi-screen applications with shared state
- Database integrations with auth and real-time features
- Payment flows
- API integrations with error handling
These were unreliable before Agent V2. Now they work because the agent can iterate through complexity instead of trying to solve it in a single pass.
What this architecture makes possible
Agent V2 is live and serving Rocket's builders today.
The execution loop, the mental model, and the quality standard work on every prompt. But they work differently depending on how much context the agent has to start with:
| Scenario | Context depth | What the agent has |
|---|
| Task inside a Project | Full picture | Prior research, competitive context, every decision made so far, the complete codebase. It builds with understanding. |
| Task outside a Project | Prompt only | Still has the self-healing loop, the mental model, and the quality standard. Still refuses to ship slop. But works from the prompt alone, without the accumulated context that Projects provide. |
The architecture is the same. The depth of context is not.
The quality ruleset grows with every failure pattern we identify:
- Every pattern becomes a rule
- Every rule gets embedded in the architecture
- The standard ratchets forward and never slides back
That's what we built. And we're shipping improvements to it every other day.