Skip to the playbook

An AI build playbook

The Software Factory.

At Ambiguous AI, we rebuilt fifteen SaaS products to feature parity with their category leaders in under thirty days. At first, we were skeptical that it would be possible. It worked because the method pulls together three crafts that rarely sit in one place: building product, writing code, and running operations at scale. This playbook is the principles and patterns behind it.

15
SaaS apps at parity
4M
lines of code
30 days
from first line of code
Read the playbook

The Software Factory

Goal in, software out.

Why now

Agents are the biggest lever you have.People, operations, process, product: the agent layer sits on top and multiplies everything below it.
The build is the cheap part now.Like hardware in the 1990s, value moved to the ends: what to build, and who it is for. The assembly in the middle is commoditized.
Every feature has a price tag.When a feature costs a known amount to build, the roadmap becomes a portfolio decision instead of a guess.

The principles

set it up → run it → trust it → improve it

The harness

01 / Agent Architecture

Give agents context.

An agent does its best work with the context you would give a strong new hire: the mission, the craft, and how the product works.

Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates. Total chaos. Nothing works. That is what AI feels like today. The missing piece is extracting all the domain knowledge from people's heads and providing that as structured context to the models.
Tom Blomfield @t_blom
What every agent inherits
FOUNDATIONCompanyidentity · beliefs · valuesDOMAINFunctionengineering · design · opsMODULEProductgoal · approach · examples

How it works

Three tiers, each inheriting the one below. The company file holds mission, vision, and values, the things that tell anyone, human or agent, whether a piece of work was good. Each function file carries a craft, written once and reused. Each product file defines and specifies the customer surfaces.

Each tier inherits the one beneath it. A product file pulls in its function, and that function pulls in the company, so an agent reading any single file already carries everything above it. Change a value in the company file and every agent downstream sees it on the next run. Nothing is copied, so nothing falls out of sync.

Why it works

Shared context is what lets a teammate tell a good day from a bad one. Every company that scales writes its operating context down so everyone works from the same understanding. Agents are no different, and you onboard a hundred a day, each ready the moment it reads the file.

Encode that context once and every agent works from the same version of the company. One source of truth keeps a hundred agents pulling in the same direction.

02 / Structured Thinking

Structure what you write.

Consultants live by this: clear structure keeps the work specific and the scope steady.

I've been on a kick about clear thinking and communication recently. It's critical for developing safe, useful models, and applications built on top of them.
Sean Grove @sgrove, OpenAI
Three heuristics
PYRAMIDAnswerReasonsEvidenceLead with the answer,then support it.MECEUIAPIDataTestsOne feature, four parts.No overlap, no gaps.SEVEN CIRCUMSTANCEWhatWhoWhereWhenWhyHowHow muchAnswer all seven. No blanks to guess.

How it works

Three habits do the work. Lead with the answer, then support it: the pyramid principle. Split a problem into parts that do not overlap and leave no gaps: MECE. And run every task through the seven circumstances as a checklist, what, who, where, when, why, how, and how much, so it is adequately specified before any agent touches it.

The seven are not a style guide, they are a gate. Run a spec through them and any missing part shows itself. When a task answers all seven, it is ready to hand off.

Why it works

Clarity is what carries teams, human or otherwise. A person fills a vague brief from a hallway conversation; a model fills it with the most probable token, so the more you specify, the more it gets right. Consultants built the pyramid principle and MECE precisely because a recommendation, like a prompt, gets one shot to land clearly.

An atomic, MECE, answer-first spec reads the same to everyone, so the agent builds exactly what you meant.

03 / Three Levels

Specify the goal and the approach.

It is a specification problem, not an AI one. Ask an agent for quicksort and it is right every time, because the spec spells everything out.

Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: it just cannot be done.
David Marr Vision, 1982
Why, how, and what
L1WHYcomputationalL2HOWalgorithmicL3WHATimplementationalSort a list,smallest to largest.Quicksort: pivot, partition,recurse. O(n log n).// pointer to an examplefunction sort(a) {return qsort(a, 0, a.length - 1)}time-invarianttime-invariantvolatile

How it works

David Marr split any computational system into three levels, and the split decides who writes what and where it lives. The goal (what success looks like for the user) and the approach (the method and the hard constraints) live in the agent architecture, written once as durable, time-invariant specs. The implementation, the code, lives in the codebase and is volatile: it changes often. The architecture does not pin it, it keeps a pointer to a current example.

Write the approach down and the model has one path to follow, the same one every run. Pin the goal and the approach; let the implementation stay volatile, referenced as an example. Your half is time-invariant, the code is not.

Why it works

Models are already near-perfect at anything specified to the algorithm level. Hand one a competitive-programming problem, fully stated, and it returns a correct solution. Raw capability is not the constraint here.

Your feature is the same kind of problem, just rarely specified that completely. Pin the goal and the approach to the level a contest problem states them, and the agent builds it just as cleanly. The spec is the lever, not the model.

You write the why and the how. The agent writes the code.

Read 'Managing AI is Managing Entropy', in Entrepreneur's Edge

04 / Design the System

Design the system for autonomy.

Build it like a value chain: modular parts with clear boundaries and clean inputs and outputs.

The behavior of a system cannot be known just by knowing the elements of which the system is made.
Donella Meadows Thinking in Systems
Coupled vs composable
TIGHTLY COUPLEDChange one, break the rest.COMPOSABLEmodule 1in → outmodule 2in → outmodule 3in → outEach holds its contract. Test, measure, trust alone.

How it works

Decompose the system into independent, composable parts, each with one job and an explicit contract: typed inputs, typed outputs, no shared state. The contract is the same whether a human or an agent does the work: you hand either one the inputs, the expected outputs, and the single thing it owns.

Composable parts form a directed acyclic graph, a flow of steps with no loops, and a graph you can instrument. Every node carries its own health metric: does it pass its tests, does it hold its contract. You can see exactly which part needs work and fix it in place, rather than debug the whole system at once.

Why it works

Independent parts are easier to measure, test, and trust. A part with a clean contract can be handed to an agent without it needing to understand the whole system to change one piece, and you can verify that piece in isolation before it touches anything else.

Tight contracts keep each part self-contained. When a part owns exactly one thing, you can let an agent build it, test it alone, and trust the result. The behavior you want falls out of the structure you drew.

05 / SPEAR

Keep humans at the gates.

Once agents can code on their own, the highest-value thing a person does is judge the work. Move people from the inner loop to the outer loop, and keep quality high.

Detect and fix any problem in a production process at the lowest-value stage possible.
Andrew Grove High Output Management
The SPEAR loop
ScopeHUMANPlanAGENTExecuteAGENTAssessAGENTResolveHUMANGATEGATEITERATION 0 ASSESS 3/10ACCEPTED

How it works

Five phases. You scope the work and, later, you resolve it. In between, the agent runs an unattended loop: plan, execute, assess against a rubric, then go again. Two human gates bracket the loop; everything inside runs without you.

Each assess pass is stricter than the last, so the output climbs toward a passing score. The loop stops when the rubric reads ten out of ten.

Why it works

Once an agent can write the code, the work that remains is judging it. SPEAR moves that judgment to the stage that made the work: the assess rubric raises the bar inside the loop, where a fix is cheap, so what reaches you is already good.

The two gates put human judgment where it matters, deciding what to build and accepting what shipped. The work in the middle is mechanical, so it can run a hundred times unattended.

Scope. Plan. Execute. Assess. Resolve.

Read the full essay on SPEAR

06 / Process & Checklist

Demand proof of work.

A checklist is how everyone, including AI, gets every step right. It is why you board a flight without a second thought: the pre-flight list runs the same, every time.

Under conditions of complexity, not only are checklists a help, they are required for success.
Atul Gawande The Checklist Manifesto
Proof of work
plan.mdMILESTONE 1 · AUTH1.1Auth endpointTODO1.2Rate limitingTODO1.3Token refreshTODO1.4Session expiryTODO1.5Audit logTODOPROOF OF WORKevidence the runner reads

How it works

Give the agent two artifacts: the recipe, a durable process for how the work is done, and the checklist, the atomic steps that each get checked off. The recipe rarely changes; the checklist flips state on every run.

Done is when all the evidence agrees.

Why it works

A checklist makes the optional-feeling step non-negotiable, so it gets done under pressure. Aviation answered this with the pre-flight checklist. Restaurants answered it with the recipe. A company I ran before answered it with a checklist for every task.

The checklist is how quality scales. Tie done to evidence a machine can read, and verification runs itself.

07 / Test at the Ends

Test only at the ends.

Anything you can measure cheaply, an agent will optimize. So measure the outcome you want, and the agent optimizes for that.

When a measure becomes a target, it ceases to be a good measure.
Goodhart's law as phrased by Marilyn Strathern, 1997
Proxy versus end
VALUEOPTIMIZATION EFFORT →proxy metricleads, tests passedreal outcomesales, the job donethis gapships

How it works

Drive the real interface the way a user would, assert the real output, and treat the implementation as opaque: what counts is what comes out the end. Define success as the end outcome, write it so a machine can check it, and anchor the assess rubric to that, and only that. Intermediate signals, tests green, types clean, the build compiles, are diagnostics that tell you where you are. The finish line is the outcome itself.

Test the whole surface, not a sample. This is the payoff for keeping parts small and composable: a small surface has a small span, small enough to cover completely. Cover it end to end and every case is accounted for.

Why it works

Aim an agent at the real outcome and it works toward the real outcome. Aim it at a proxy and it gives you exactly the proxy: a growth team told to lift leads lifts leads, even when revenue holds still, because leads only stood in for revenue. Point the measure at what you want, and what you measure and what you want become the same thing.

So point the rubric at the end and leave the intermediates as instruments. Measure what the user feels, cover the whole span, and the only way to move the score is to do the real work.

Measure the end, not the proxy.

In one line: black-box testing. Assert behavior at the boundary, never the implementation.

Full essay coming soon, in Entrepreneur's Edge

08 / The Flywheel

Recursive self-improvement.

The flywheel turns anything that slips past the gates into a permanent guardrail, automatically, so the system gets a little stronger every time.

The process resembles relentlessly pushing a giant, heavy flywheel, turn upon turn, building momentum until a point of breakthrough.
Jim Collins Good to Great
Defects in, guardrails out
fewer defects,more customersCustomersuse the productErrors & omissionsbugs + gaps surfaceDiagnose + proposeroot cause, scored fixSPEARthe fix ships

How it works

Each turn starts with a production signal: a tracked error, a monitor, a customer report. It is triaged automatically into an error (something built wrong) or an omission (something missing), then diagnosed, fixed, and, the part that makes it a flywheel, captured as a permanent check.

Diagnose before you patch. Sort the symptom into one category with evidence, the way a clinician works from a manual, so the fix lands on the cause. A patch aimed at the symptom adds code; a fix aimed at the cause clears a whole class of problems at once, and the check you leave behind keeps it that way.

Why it works

Capture each fix as a guardrail and the work compounds: every issue you resolve makes the next one less likely, and the error rate keeps falling. That is a flywheel, it spins faster the longer it runs.

SPEAR's assess loop catches errors and omissions before you ship. The flywheel catches anything that reaches production and feeds it back through the same diagnosis. Two nets, one at the gate and one in the field, and whatever the field surfaces becomes a test that guards the next build.

E&O Flywheel

In one line: an errors-and-omissions flywheel for production. Every error and every omission, once caught, becomes a permanent check.

Full essay coming soon, in Entrepreneur's Edge

09 / The Harness

Build the harness.

Put the pieces together and you have a harness: the system an agent runs inside. This is the methodology in practice.

A bad system will beat a good person every time.
W. Edwards Deming The Deming Institute
Goal in, tested PR out
INPUTa goalTHE HARNESSAGENTS.mdwhat it knowsSPEARhow work flowsRuntimewhere work happensOUTPUTtested PR

How it works

What the agent knows is the architecture. How work flows is SPEAR. Where work happens is the runtime: the process, the checklist, and the rubric that carry state from one iteration to the next. Wire the three together and you can hand off a goal and collect a pull request.

It runs two ways. Proactively, you scope a goal and start a run. Reactively, a failing check or a monitor fires and the same loop diagnoses the cause and repairs it. Same harness, different trigger.

Why it works

Each piece is load-bearing, and together they compound. The architecture gives the agent the company; SPEAR gives it the gates; the runtime gives it a place to do the work. Apart they are a model with a prompt; wired together they are a system that ships.

This is the methodology behind Ambiguous: more than four million lines of code in thirty days, across fifteen SaaS applications, built this way.

See it in practice

We built Ambiguous with it.

Ambiguous is an AI-native workspace where agents and humans are coworkers.

Reach out

Happy to go deeper.

Ambiguous Workspace is my full-time focus, but I speak and advise on the software factory often. The work has been shared and featured at leading AI communities.

rwaliany@gmail.com