Rocket Blogs
Engineering

The work is only as good as the thinking before it.
You already know what you're trying to figure out. Type it. Rocket handles everything after that.
Rocket Blogs
Engineering

You already know what you're trying to figure out. Type it. Rocket handles everything after that.
Table of contents
We gave it a thinking framework, an execution loop, and a quality standard. Here's what happened.
Rocket is a vibe solutioning platform:
| Capability | What it does |
|---|---|
| Solve | Produces structured strategic intelligence |
| Intelligence | Monitors competitors continuously |
| Build | Generates production-ready applications |
| Projects | Holds all of it together so every task starts from the full picture, not a compressed summary |
1.5M people across 180 countries have used what we've built.
This post is about what happens inside Build. Specifically, the agent architecture that turns a user's prompt into working code that doesn't look, feel, or break like AI-generated output.
We call it Agent V2. It:
Three architectural decisions made the difference.
Agent V2 is an end-to-end autonomous code generation system. "End-to-end" means the agent owns every step from the moment a user submits a prompt to the moment a working preview appears. Not just the code generation step. All of it.
The critical part is what we call the self-healing loop: the tight cycle between three steps:
Then it runs again. And again, until the build passes.
The agent doesn't guess whether its code works. It knows.
How this plays out in practice:
The agent runs its own code. That sentence sounds simple. In practice, it changes everything about the quality of what gets returned. Error categories that previously required the user to describe the problem, wait for a regeneration, and check again now get resolved inside the agent's own loop.
The user describes what they want. The agent figures out how to make it work.
But the loop alone wasn't enough.
The self-healing loop gives Agent V2 the ability to recover from errors. The mental model gives it the ability to avoid most of them in the first place. These are two different capabilities, and the agent needs both.
Without the mental model, the self-healing loop still works. The agent generates, runs, reads the error, and fixes. But it fixes reactively. It has no plan to reason against, so it treats every failure as an isolated problem.
We watched early versions of Agent V2 spend 3-4 fix cycles on problems that a few seconds of structured thinking would have prevented entirely.
So we built a mental model for how the agent should reason before it codes. Not a vague instruction to "plan carefully." A specific 6-phase framework:
Phase 1: Analysis. What does the user actually want? Not the literal words in the prompt, but the underlying requirement. "Add payments" means different things for a SaaS subscription app versus a one-time purchase e-commerce store. The agent resolves ambiguity before generating a single line of code.
Phase 2: Context Gathering. What exists in the codebase right now? What files are relevant? What patterns are already established? The agent reads the existing project, identifies what matters for this specific task, and builds a working picture of the current state.
Phase 3: Solution Design. What's the right architectural approach? Which files need to change? What new files need to be created? What are the dependencies between changes? This phase produces a structured TODO list that becomes the agent's working memory for the entire task.
Phase 4: Implementation. Execute the TODOs in order. Generate code for each step. Because the agent planned its approach, each generation step has precise context about what it's building and why, not just the user's original prompt.
Phase 5: Integration. Merge generated code into the existing codebase. Verify that new code works with existing patterns, naming conventions, and architectural decisions. This prevents the "fixed one thing, broke another" failure mode.
Phase 6: Validation. Run the build. Read the logs. Verify the complete implementation works end-to-end, not just the individual pieces.
We tested this rigorously: agents with the mental model versus agents without it, same prompts, same LLM, same tools. The results were clear:
| Without mental model | With mental model |
|---|---|
| Fix cycles were reactive and often circular | Fix cycles became targeted |
| Agent tried the same failed approach with minor variations | Agent knew what it was trying to achieve at each step |
| Like a developer who keeps recompiling without reading the error message | Could reason about why something failed relative to the plan |
The insight generalizes beyond code generation: the intelligence isn't in the model weights. It's in the framework around the model that shapes how it applies its capabilities.
AI-generated code has a quality ceiling. Not because the models lack capability, but because nobody defines "quality" with enough precision for the model to hit it consistently. The result is output that technically works but carries unmistakable signatures of machine generation:
Most teams treat this as a prompt problem. Add "make it high quality" to the system prompt. Hope the model figures out what you mean. We treat it as an architecture problem.
Anti-slop at Rocket is a system of specific, measurable rules embedded at every stage of the agent's pipeline. Not a filter applied after generation. Rules that the agent must satisfy, verified the same way a build error gets verified.
What the quality rules cover:
We maintain over a hundred of these rules. Each one exists because we identified a specific failure pattern in AI-generated output and wrote a constraint that prevents it.
When the agent's output violates a rule, the same self-healing loop that catches build errors catches quality violations. The agent corrects and regenerates until the output passes.
AI-generated landing pages have a default look:
These patterns emerge because models default to training data averages when nobody specifies otherwise. Our quality rules prevent this at the architectural level. The generation agent cannot produce output that matches known slop patterns. When it does, the self-healing loop catches the violation the same way it catches a build error.
That's what we mean by "anti-slop as architecture." Quality isn't a property you add at the end. It's a set of constraints that shape every decision the system makes, from how information flows between agents to what the generation agent is allowed to output.
The competitive implication is straightforward:
We benchmark Agent V2 across five dimensions: cost, time, bugs, quality, and capability.
| Metric | Result |
|---|---|
| Cost reduction per task | 38% on average |
| Fewer steps to solve | 28% fewer on average |
| Why | Mental model means fewer LLM calls; self-healing loop adds work per step but eliminates wasted V1 cycles |
| Metric | Result |
|---|---|
| Wall clock time increase | ~30% slower on average |
| Per-step overhead (equal steps) | Only 8.5% slower |
| Why | Self-healing loop adds real seconds (run, read logs, fix, run again) |
The tradeoff is straightforward: the user waits longer once and gets a working result, instead of waiting shorter each time across multiple broken rounds that need manual correction.
Error rates dropped across every framework we compared between V1 and V2:
| Error category | Reduction |
|---|---|
| Integration errors | 48% (hardest to fix autonomously; involve external services, timing, configuration) |
| Code structure errors | 78% |
| Dependency errors | 81% |
| Type system errors | 88% |
These weren't edge cases. These were the highest-volume failure categories in V1. The agent now catches them in its own loop, fixes them, and returns working code.
The loop and the mental model together expanded what the agent can reliably produce:
These were unreliable before Agent V2. Now they work because the agent can iterate through complexity instead of trying to solve it in a single pass.
Agent V2 is live and serving Rocket's builders today.
The execution loop, the mental model, and the quality standard work on every prompt. But they work differently depending on how much context the agent has to start with:
| Scenario | Context depth | What the agent has |
|---|---|---|
| Task inside a Project | Full picture | Prior research, competitive context, every decision made so far, the complete codebase. It builds with understanding. |
| Task outside a Project | Prompt only | Still has the self-healing loop, the mental model, and the quality standard. Still refuses to ship slop. But works from the prompt alone, without the accumulated context that Projects provide. |
The architecture is the same. The depth of context is not.
The quality ruleset grows with every failure pattern we identify:
That's what we built. And we're shipping improvements to it every other day.