An AI build playbook

The Software Factory.

At Ambiguous AI, we rebuilt fifteen SaaS products to feature parity with their category leaders in under thirty days. At first, we were skeptical that it would be possible. It worked because the method pulls together three crafts that rarely sit in one place: building product, writing code, and running operations at scale. This playbook is the principles and patterns behind it.

SaaS apps at parity

lines of code

30 days

from first line of code

Read the playbook

The Software Factory

Goal in, software out.

Why now

Agents are the biggest lever you have.People, operations, process, product: the agent layer sits on top and multiplies everything below it.

The build is the cheap part now.Like hardware in the 1990s, value moved to the ends: what to build, and who it is for. The assembly in the middle is commoditized.

Every feature has a price tag.When a feature costs a known amount to build, the roadmap becomes a portfolio decision instead of a guess.

The principles

set it up → run it → trust it → improve it

01Give agents context.Agent Architecture

02Structure what you write.Structured Thinking

03Specify the goal and the approach.Three Levels

04Make the implementation deterministic.The Spec

05Design the system for autonomy.Design the System

06Keep humans at the gates.SPEAR

07Demand proof of work.Process & Checklist

08Test only at the ends.Test at the Ends

09Recursive self-improvement.The Flywheel

The harness

10Build the harness.The Harness

01 / Agent Architecture

Give agents context.

An agent does its best work with the context you would give a strong new hire: the mission, the craft, how the product works, and the specific surface it is building.

“Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates. Total chaos. Nothing works. That is what AI feels like today. The missing piece is extracting all the domain knowledge from people's heads and providing that as structured context to the models.”

Tom Blomfield @t_blom

What every agent inherits

How it works

Four tiers, each inheriting the one below. The company file holds mission, vision, and values, the things that tell anyone, human or agent, whether a piece of work was good. Each function file carries a craft, written once and reused. Each product file defines the customer surfaces it ships. Each surface file pins one of them, an API, a mobile app, with the stack and constraints specific to it.

Each tier inherits the one beneath it. A surface file pulls in its product, the product pulls in its function, and the function pulls in the company, so an agent reading any single file already carries everything above it. Change a value in the company file and every agent downstream sees it on the next run. Nothing is copied between tiers, so the structure itself cannot drift. What does scatter is the day-to-day: edits land the same fact in two places, and a periodic maintenance pass sweeps it back to one home, staged as a diff a person reviews before it merges. The pass proposes; the reviewer owns confirming nothing was lost.

Why it works

Shared context is what lets a teammate tell a good day from a bad one. Every company that scales writes its operating context down so everyone works from the same understanding. Agents are no different, and you onboard a hundred a day, each ready the moment it reads the file.

Encode that context once and every agent works from the same version of the company. One source of truth keeps a hundred agents pulling in the same direction.

One fact, one home.

The spec that scales agents, in Entrepreneur's Edge

02 / Structured Thinking

Structure what you write.

Consultants live by this: clear structure keeps the work specific and the scope steady.

“I've been on a kick about clear thinking and communication recently. It's critical for developing safe, useful models, and applications built on top of them.”

Sean Grove @sgrove, OpenAI

Three heuristics

How it works

Three habits do the work. Lead with the answer, then support it: the pyramid principle. Split a problem into parts that do not overlap and leave no gaps: MECE. And run every task through the seven circumstances as a checklist, what, who, where, when, why, how, and with what, so it is adequately specified before any agent touches it.

The seven are not a style guide, they are a gate. Run a draft through them and any missing part shows itself. When a task answers all seven, it is ready to hand off.

Why it works

Clarity is what carries teams, human or otherwise. A person fills a vague request from a hallway conversation; a model fills it with the most probable token, so the more you specify, the more it gets right. Consultants built the pyramid principle and MECE precisely because a recommendation, like a prompt, gets one shot to land clearly.

An atomic, MECE, answer-first spec reads the same to everyone, so the agent builds exactly what you meant.

Structure in, clarity out.

Full essay coming soon, in Entrepreneur's Edge

03 / Three Levels

Specify the goal and the approach.

It is a specification problem, not an AI one. Ask an agent for quicksort and it is right every time, because the spec spells everything out.

“Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: it just cannot be done.”

David Marr Vision, 1982

Why, how, and what

How it works

David Marr split any computational system into three levels, and the split decides who writes what and where it lives. The goal (what success looks like for the user) and the approach (the method and the hard constraints) live in the agent architecture, written once, durable and time-invariant. The implementation, the code, lives in the codebase and is volatile: it changes often. The architecture does not pin it, it keeps a pointer to a current example.

Write the approach down and the model has one path to follow, the same one every run. Pin the goal and the approach; let the implementation stay volatile, referenced as an example. Your half is time-invariant, the code is not.

Why it works

Models are already near-perfect at anything specified to the algorithm level. Hand one a competitive-programming problem, fully stated, and it returns a correct solution. Raw capability is not the constraint here.

Your feature is the same kind of problem, just rarely specified that completely. Pin the goal and the approach to the level a contest problem states them, and the agent builds it just as cleanly. The spec is the lever, not the model.

You write the why and the how. The agent writes the code.

Read 'Managing AI is Managing Entropy', in Entrepreneur's Edge

04 / The Spec

Make the implementation deterministic.

LLMs are probabilistic: when a brief can be built three ways, variation is inevitable. The run should be specified enough to be deterministic, so every path resolves to the same decision.

“The hardest single part of building a software system is deciding precisely what to build.”

Fred Brooks No Silver Bullet, 1986

The specification

How it works

The spec sits between the last section's levels, past the approach but short of the code: level 2.7. The brief is the compact half every agent preloads, the goal and the approach, enough context to orient any run. The spec extends the brief and loads on demand, so a run carries light context until it needs the full detail. Where the brief leaves more than one way to build, the spec answers all seven circumstances, what, who, where, when, why, how, and with what, until only one way remains.

Five MECE categories hold those specs, one per kind of decision: product behavior, design, architecture, algorithm, and verification, so each open question has exactly one place its answer lives. The run itself makes no design decisions: when it reaches a choice with more than one defensible answer, it does not pick; it halts and surfaces the gap. You author the decision in its category and the run resumes, citing it rather than restating it. The build compiles decisions that are already made; it never makes them.

Why it works

Manufacturing solved this two centuries ago. A part filed to fit comes out different from every bench; a part cut to a toleranced drawing comes out the same from any shop, and the armories that adopted the drawing got interchangeable parts at scale. The drawing did not make the machinist better, it removed the choices the machinist had to make. A locked spec does the same for a run: the skill stays, the variance goes.

The split also draws a clean line between what people own and what agents own. Deciding what the product should do is judgment, and judgment stays human; turning a decided thing into code is the part agents already do well. Sorting every decision into its category before the run keeps each side on its own ground, and a disagreement shows up as a gap in a file instead of a surprise in a pull request.

Make the implementation deterministic.

Full essay coming soon, in Entrepreneur's Edge

05 / Design the System

Design the system for autonomy.

Build it like a value chain: modular parts with clear boundaries and clean inputs and outputs.

“The behavior of a system cannot be known just by knowing the elements of which the system is made.”

Donella Meadows Thinking in Systems

Coupled vs composable

How it works

Decompose the system into independent, composable parts, each with one job and an explicit contract: typed inputs, typed outputs, no shared state. The contract is the same whether a human or an agent does the work: you hand either one the inputs, the expected outputs, and the single thing it owns.

Composable parts form a directed acyclic graph, a flow of steps with no loops, and a graph you can instrument. Every node carries its own health metric: does it pass its tests, does it hold its contract. You can see exactly which part needs work and fix it in place, rather than debug the whole system at once.

Why it works

Independent parts are easier to measure, test, and trust. A part with a clean contract can be handed to an agent without it needing to understand the whole system to change one piece, and you can verify that piece in isolation before it touches anything else.

Tight contracts keep each part self-contained. When a part owns exactly one thing, you can let an agent build it, test it alone, and trust the result. The behavior you want falls out of the structure you drew.

Design the road for the car.

Read the full essay in Entrepreneur's Edge

06 / SPEAR

Keep humans at the gates.

Once agents can code on their own, the highest-value thing a person does is judge the work. Move people from the inner loop to the outer loop, and keep quality high.

“Detect and fix any problem in a production process at the lowest-value stage possible.”

Andrew Grove High Output Management

The SPEAR loop

How it works

Five phases. You scope the work and, later, you resolve it. In between, the agent runs an unattended loop: plan, execute, assess against a rubric, then go again. Two human gates bracket the loop; everything inside runs without you.

Each assess pass is stricter than the last, so the output climbs toward a passing score. The loop stops when the rubric reads ten out of ten.

Why it works

Once an agent can write the code, the work that remains is judging it. SPEAR moves that judgment to the stage that made the work: the assess rubric raises the bar inside the loop, where a fix is cheap, so what reaches you is already good.

The two gates put human judgment where it matters, deciding what to build and accepting what shipped. The work in the middle stays mechanical because the design decisions were all made up front; one that surfaces mid-run exits to you as a gap instead of getting settled inside the loop. Mechanical work can run a hundred times unattended.

Scope. Plan. Execute. Assess. Resolve.

Read the full essay on SPEAR

07 / Process & Checklist

Demand proof of work.

A checklist is how everyone, including AI, gets every step right. It is why you board a flight without a second thought: the pre-flight list runs the same, every time.

“Under conditions of complexity, not only are checklists a help, they are required for success.”

Atul Gawande The Checklist Manifesto

Proof of work

How it works

Give the agent two artifacts: the recipe, a durable process for how the work is done, and the checklist, the atomic steps that each get checked off. The recipe rarely changes; the checklist flips state on every run.

Done is when all the evidence agrees.

Why it works

A checklist makes the optional-feeling step non-negotiable, so it gets done under pressure. Aviation answered this with the pre-flight checklist. Restaurants answered it with the recipe. A company I ran before answered it with a checklist for every task.

The checklist is how quality scales. Tie done to evidence a machine can read, and verification runs itself.

Proof of work is the state.

Full essay coming soon, in Entrepreneur's Edge

08 / Test at the Ends

Test only at the ends.

Anything you can measure cheaply, an agent will optimize. So measure the outcome you want, and the agent optimizes for that.

“When a measure becomes a target, it ceases to be a good measure.”

Goodhart's law as phrased by Marilyn Strathern, 1997

Proxy versus end

How it works

Drive the real interface the way a user would, assert the real output, and treat the implementation as opaque: what counts is what comes out the end. Define success as the end outcome, write it so a machine can check it, and anchor the assess rubric to that, and only that. Intermediate signals, tests green, types clean, the build compiles, are diagnostics that tell you where you are. The finish line is the outcome itself.

Test the whole surface, not a sample. This is the payoff for keeping parts small and composable: a small surface has a small span, small enough to cover completely. Cover it end to end and every case is accounted for.

Why it works

Aim an agent at the real outcome and it works toward the real outcome. Aim it at a proxy and it gives you exactly the proxy: a growth team told to lift leads lifts leads, even when revenue holds still, because leads only stood in for revenue. Point the measure at what you want, and what you measure and what you want become the same thing.

So point the rubric at the end and leave the intermediates as instruments. Measure what the user feels, cover the whole span, and the only way to move the score is to do the real work.

Measure the end, not the proxy.

In one line: black-box testing. Assert behavior at the boundary, never the implementation.

Full essay coming soon, in Entrepreneur's Edge

09 / The Flywheel

Recursive self-improvement.

The flywheel turns anything that slips past the gates into a permanent guardrail, automatically, so the system gets a little stronger every time.

“The process resembles relentlessly pushing a giant, heavy flywheel, turn upon turn, building momentum until a point of breakthrough.”

Jim Collins Good to Great

Defects in, guardrails out

How it works

Each turn starts with a production signal: a tracked error, a monitor, a customer report. It is triaged automatically into an error (something built wrong) or an omission (something missing), then diagnosed, fixed, and, the part that makes it a flywheel, captured as a permanent check.

Diagnose before you patch. Sort the symptom into one category with evidence, the way a clinician works from a manual, so the fix lands on the cause. A patch aimed at the symptom adds code; a fix aimed at the cause clears a whole class of problems at once, and the check you leave behind keeps it that way.

Why it works

Capture each fix as a guardrail and the work compounds: every issue you resolve makes the next one less likely, and the error rate keeps falling. That is a flywheel, it spins faster the longer it runs.

SPEAR's assess loop catches errors and omissions before you ship. The flywheel catches anything that reaches production and feeds it back through the same diagnosis. Two nets, one at the gate and one in the field, and whatever the field surfaces becomes a test that guards the next build.

E&O Flywheel

In one line: an errors-and-omissions flywheel for production. Every error and every omission, once caught, becomes a permanent check.

Full essay coming soon, in Entrepreneur's Edge

10 / The Harness

Build the harness.

Put the pieces together and you have a harness: the system an agent runs inside. This is the methodology in practice.

“A bad system will beat a good person every time.”

W. Edwards Deming The Deming Institute

Goal in, tested PR out

How it works

What the agent knows is the architecture. How work flows is SPEAR. Where work happens is the runtime: the process, the checklist, and the rubric that carry state from one iteration to the next. Wire the three together and you can hand off a goal and collect a pull request.

It runs two ways. Proactively, you scope a goal and start a run. Reactively, a failing check or a monitor fires and the same loop diagnoses the cause and repairs it. Same harness, different trigger.

Why it works

Each piece is load-bearing, and together they compound. The architecture gives the agent the company; SPEAR gives it the gates; the runtime gives it a place to do the work. Apart they are a model with a prompt; wired together they are a system that ships.

This is the methodology behind Ambiguous: more than four million lines of code in thirty days, across fifteen SaaS applications, built this way.

Every piece carries its weight.

Full essay coming soon, in Entrepreneur's Edge

See it in practice

We built Ambiguous with it.

Ambiguous is an AI-native workspace where agents and humans are coworkers.

Explore Ambiguous AI

Reach out

Happy to go deeper.

Ambiguous Workspace is my full-time focus, but I speak and advise on the software factory often. The work has been shared and featured at leading AI communities.

rwaliany@gmail.com