Skip to content
Touchskyer's Thinking Wall
S1E02
4 min read

Silicon Workforce S1E02: What Does a One-Person Engineering Team Look Like

Silicon Workforce S1E02

The first time OPC felt like a real team was during a code review.

I had four AI reviewers independently examine the same project. PM said: “Features are complete, but there’s no loading feedback — users stare at a blank screen while waiting.” Security Expert said: “Path traversal vulnerability — attackers could read system files through theme file paths.” Engineer said: “Input validation is too loose — some illegal inputs aren’t being blocked.” Skeptic said: “About that missing loading state PM mentioned — are you sure it’s not intentional? There’s a relevant comment in the code.”

Four reports. Zero overlap. One of them even challenged another’s conclusions.

This isn’t “let AI take a look.” This is a review process with role separation, evidence requirements, and a final verdict. It achieves this not through a stronger model, but through a designed pipeline.

The Chef Doesn’t Taste the Final Dish

There’s an iron rule in the kitchen: the chef who cooks the dish can’t be the one who gives the final taste test.

Not because the chef has bad taste buds. But because from the first cut, they’ve been adjusting — after tasting a dozen times, their palate has been calibrated by their own process. What they think is “just right” might be too salty, too bland, or just off for someone tasting it for the first time.

The same principle applies exactly to writing code. You write a function, write tests for it, tests pass, you feel perfect. But you’re testing what you think the code should do — if your understanding of the requirements is wrong, your tests will be wrong too, and they’ll pass perfectly.

That’s why OPC’s own tests in the last episode were all fake — the maker judged their own work, and the tests were forever green.

OPC’s first principle came from this lesson: The one who does the work doesn’t judge their own work. The code-writing AI can’t review its own code. The test-designing AI can’t execute its own tests. The judgment-making AI can’t adjudicate its own judgment.

This isn’t philosophy. It’s architecture.

Four Roles

OPC breaks an engineering team into four roles. Not because four sounds nice, but because this is the minimum division of labor needed to ensure quality.

Builder is like a factory worker with blueprints — they don’t design the blueprints, they execute them precisely. What the Builder receives are explicit acceptance criteria (a checklist that defines “done means these things are true”) — not “build a website” but “user can log in with email, session persists across refresh, logout clears session.” Each criterion has a corresponding verification method. The Builder doesn’t need to be creative; it needs to execute precisely.

Reviewer is like anonymous peer reviewers for an academic paper — each one grades independently, none can see what the others wrote. OPC uses multiple reviewers, at least two roles reviewing independently, unable to see each other’s reports. One of them is a permanent Skeptic Owner — the person whose entire job is to say “wait, what are you all missing?” Other reviewers might say “overall looks fine”; the Skeptic Owner’s purpose is to find the traps hidden inside “overall looks fine.”

Tester works like military staff officers and field commanders — two roles, clear division. One AI is responsible for designing test cases: what scenarios to test, what edge cases, what could go wrong. Another AI is responsible for executing: takes the test plan and actually runs commands, collecting pass/fail evidence. The designer doesn’t touch code; the executor doesn’t change the plan. Why split them? Because a test designer who also executes will unconsciously avoid their own blind spots.

Gate is the final checkpoint. It’s not AI — it’s hard-coded logic that does exactly one thing: count. Each reviewer classifies their findings as they write — Critical (fatal), Major (serious), or Minor (trivial). The Gate takes these classifications and applies rules: Critical >= 1? Automatic FAIL. Major <= 2 with zero Critical? PASS. Everything else? ITERATE — back to Builder for another round.

Why can’t the Gate be AI? Because AI gets anchored by the last sentence when it sees mixed signals. Three reviewers all say FAIL, but the last one adds “overall acceptable” — and AI might rule PASS. Counting doesn’t get anchored. Rules can’t be persuaded.

A 14-Node Pipeline

String these four roles together, and you get OPC’s full-stack flow — a complete pipeline from discussion to delivery.

Discussion → Build → Code Review → Test Design → Test Execute → Test Gate
→ Acceptance Review → Acceptance Gate → Security Audit → Audit Gate
→ E2E Testing → E2E Gate → UX Simulation → Final Gate

14 nodes. Each node has explicit inputs, outputs, and handoff protocols. The discussion node produces a spec document (not a conversation transcript); code review produces an evaluation report; test design produces a test plan; test execute produces test evidence (actual command output); gates produce one of three states: PASS / FAIL / ITERATE.

Not every project needs all 14 nodes. A simple CSS fix might only run the build-verify flow (build → review → test → gate, 4 nodes). A new feature might hit 8 nodes. Only formal pre-release reviews need all 14. Pipeline length is determined by task risk, not one-size-fits-all.

But regardless of how many nodes are used, one rule is ironclad: every node’s output must be an artifact — files, screenshots, command output, scoring reports — not an AI saying “I checked, looks fine.”

This rule solves the “aspirational theater” problem from the last episode. If a test-execute node claims “tests passed” without attaching any command output, it’s not testing — it’s an essay about testing.

Gates Make “Good Enough” Difficult

The most valuable part of the pipeline isn’t any single role — it’s the Gates.

A workflow without Gates looks like this: you have AI write code, another AI reviews it, the review finds three issues, you look at them, decide “not serious,” mark them “fix later,” and ship. The three issues are forgotten forever.

A workflow with Gates looks like this: review finds three issues. Each reviewer classifies their findings as they write — blocker (must fix), yellow (should fix), blue (optional), false positive (misidentified), out of scope. The Gate takes all reports and does one thing: counts by classification. Classification is the reviewer’s engineering judgment; the Gate just enforces rules. You can’t say “fix later” — reviewers must label every finding, and the Gate makes its verdict from those labels.

Gates are like dams — they don’t make the rain fall better, they stop the flood from getting through.

Crystal gates in the OPC pipeline — code must pass to move forward

In a real review, Round 1 reviewers found three issues: a regex that only checked the beginning but not the end (a door that was only half-closed), two tests using inconsistent coding styles (inconsistency breeds bugs), and an error handler that was too broad (swallowing every exception including ones that shouldn’t be swallowed). All three were classified as yellow (should fix) by reviewers, and the Gate counted the labels — ITERATE, back to Builder. Round 2 fixed all three. Gate again: PASS.

Without Gates, the three findings from Round 1 would become TODOs in comments, never to be fixed. Gates make it very hard to say “pretty much done.”

1,686 Lines of Mechanical Spec

At this point you might ask: who writes the Gate rules? Who decides what’s Critical versus Major?

The answer: 1,686 lines of TypeScript code.

This code turns every fuzzy quality standard into executable checks. For example, OPC defines 10 “red flags” — not an open list you can add to freely, but a sealed checklist (closed enum) with exactly 10 fixed items:

  • default-favicon: still using the browser’s default icon
  • stack-trace-visible: error pages expose full stack traces
  • broken-link: links that go nowhere
  • data-loss-on-error: user data lost when errors occur

Each red flag has different severity at different quality tiers. default-favicon is ignored at the functional tier (you’re building a CLI tool, who cares about favicons); at the polished tier it becomes a warning; at the delightful tier it becomes critical — because a product pursuing exceptional experience still using a default favicon means you haven’t polished the details at all.

This is 100 times more useful than “score 7.5.” With 7.5 you don’t know what to fix; with “default-favicon is critical,” you know to change the icon.

The same logic extends to acceptance criteria review. A tool called criteria-lint checks 14 items: do your criteria use vague words (“fast,” “clean,” “intuitive” — these fail without quantitative metrics); do they have impossible-to-fail conditions (“should work as expected” — this is always true, meaning nothing); are there duplicates (two criteria sharing 80% of words are probably copy-paste redundancy).

These rules don’t rely on AI judgment. They’re code. Run them, get a result — pass or fail, no “overall acceptable” middle ground.

The Difference It Makes

Before the pipeline, I had AI build a family calendar app. The result worked — but no dark mode, no loading animations, error pages displaying raw stack traces. It looked like a homework assignment.

After the pipeline, the same task went through a build-verify cycle. Reviewers caught the missing dark mode and loading states; test cases covered error handling paths; the Gate ruled ITERATE, sending it back for two more rounds. Final output: same AI, same prompts, but the result went from “homework assignment” to “something you’d actually use.”

The difference isn’t that AI got smarter. The difference is that the bad stuff got caught.

The $130 Lesson

The first full pipeline run was a 27-hour marathon. 28 AI subagents dispatched for various tasks — some reviewing code, some designing tests, some playing Devil’s Advocate to poke holes. Total consumption: 148 million tokens, costing $130.

What I learned through this process: The pipeline’s value isn’t making AI do better work — it’s preventing bad work from getting through.

AI doesn’t suddenly write better code because you have a pipeline. It still makes the same mistakes — default favicons, unhandled edge cases, always-true tests. But with the pipeline, these mistakes get caught. Before they become the user’s problem, they get caught.

It’s like a dam. A dam doesn’t make the rain fall better. It stops the flood from getting through.

OPC isn’t a tool that makes AI stronger. It’s a system that lets one person work like a team — with building, reviewing, testing, and adjudication. Every step has someone (or rather, a role) checking the previous step’s work. No one is both the player and the referee.

That’s what a one-person engineering team looks like.


Silicon Workforce S1: The OPC Framework Evolution Previous: Why AI Can Write Code but Can’t Do Engineering <- Next: From “It Runs” to “I Trust It” ->

Comments