Silicon Workforce S1E01: Why AI Can Write Code but Can't Do Engineering

Silicon Workforce S1E01

Previously on Silicon Team: In the pilot season, I used AI to build a complete project management tool — from requirements to deployment, AI handled the entire process. One person + AI really shipped something. But “can build it” and “built it well” are two very different things. This season’s question: would you ship AI-written code to production?

I built a program that orchestrates multiple AIs to work together — one writes code, one reviews it, one plays devil’s advocate. I called it OPC (One Person Company).

Then one day, I had OPC audit itself.

Four AI reviewers each independently read the code, unable to see each other’s reports. When the results came in, I froze — every single one judged it as failing: one found that tests were written to pass no matter what (meaning they tested nothing), one found that 95% of test files were never executed, one found security vulnerabilities, one found architectural errors.

Four roles. Zero communication between them. All judged FAIL.

A tool built specifically to check other projects’ code quality had fake tests of its own.

That’s where this whole story begins.

The Intern Is Fast, but Doesn’t Know the Rules

Imagine you hired an intern. Codes ten times faster than you, knows every language, picks up any framework. You say “build me a website” and three minutes later a complete page appears.

Sounds perfect.

Then you discover: the tests it wrote are always green — not because the code is bug-free, but because the tests themselves are written to pass no matter what. You discover npm test only ran 1 of 11 test files; the other 10 were never executed. You discover that evaluation files can be empty — the system only checks whether the file exists, not whether it has content.

This isn’t occasional carelessness. It’s a structural blind spot in how AI writes code: it can generate code that meets requirements, but doesn’t understand why those requirements exist.

The point of a test isn’t to produce a green checkmark — it’s to turn red when code breaks. If a test is always green, it’s not a test — it’s a placebo. AI doesn’t understand this distinction because it has never experienced the terror of being woken at 3 AM by a failing test alert.

All green on screen — but cracks beneath the surface hide the real problems

Three Tools Were Actually the Same Thing

Before discovering this problem, I had three separate tools running — each doing something different: one managed AI coding workflows, one evaluated tool quality, one screened business ideas. After running them for weeks, one day I suddenly asked:

Wait — are these three things actually the same thing?

You’ve probably had a similar realization. Product development follows the same pattern: design mockup -> design review -> stakeholder sign-off. Writing works this way too: first draft -> editor review -> publish. Even cooking: cook the dish -> taste test -> serve.

Build, review, gate. This pattern is everywhere.

All three of my tools shared this same pattern. So they merged into one pipeline. OPC became the engine.

28 AI Employees, $130, Two Days and One Night

After the merge, the first thing was to have OPC audit itself.

I launched 4 independent reviewer agents: Engineer, Security Expert, Architect, and Devil’s Advocate (whose job is to poke holes). Each read the code independently and gave their assessment — none could see what the others wrote.

The result: 7 Critical and High severity issues. The always-true tests, the 1/11 execution rate, the empty file exploit — all caught in this round.

After fixing those 7, deeper problems surfaced: products OPC built were “functional” but not good — no dark mode, no smooth animations, typography using system defaults. Scores kept plateauing between 6.5 and 8.5, unable to break through.

Where was the problem? Not that AI couldn’t do these things, but that nobody asked it to. The acceptance criteria (Definition of Done) only listed functional requirements — nothing about “must have dark mode.” AI’s incentive is “pass the review,” not “make users willing to pay.”

So we designed a Quality Tier system: three levels, from “it runs” to “it’s good” to “it delights.” The tier is chosen at project start, written into configuration, and every stage of the pipeline checks against it. Not relying on AI’s self-awareness — enforced by code.

The whole process took two days and one night. 28 AI assistants were dispatched. Total consumption: 148 million tokens (a token is AI’s basic unit of text processing, roughly one word; 148 million tokens is approximately the text content of 200 books), costing $130.

The most expensive lesson? We tried having AI simulate “would users pay for this.” We designed 5 virtual buyers, each with different backgrounds, and had them independently evaluate the product and state how much they’d pay.

The result was torn apart by all 4 reviewers: AI-generated “willingness to pay” isn’t a real economic signal — it’s just picking a random number within the price range you specified. AI can tell you “is this thing well-made” (a quality judgment), but it can’t tell you “is it worth spending money on” (a purchase decision). Because AI has no wallet, no budget constraints, no real pressure of “last month’s credit card bill was too high, better save this month.”

Writing Code ≠ Doing Engineering

After this audit, I understood something:

Writing code is translating requirements into machine-executable instructions. AI is great at this.

Doing engineering is designing a system that catches code when it breaks, that makes quality improve over time instead of fluctuating randomly, that lets one person work like a team. AI isn’t good at this — not because it’s not smart enough, but because the core of “engineering” isn’t writing code, it’s checking the people who write code.

OPC was born to solve this problem. It doesn’t let the code-writing AI judge its own code — instead, it sends another AI to check, then another AI to check the checker, and finally uses hard-coded rules to make the final verdict.

The one who does the work doesn’t judge their own work. That’s OPC’s first principle.

If you’re using AI to write code right now, remember at least this: never let the code-writing AI judge whether its own code is good. Have another person (or another AI) check it. This single rule will help you avoid 90% of the pitfalls.

Over the next 8 episodes, I’ll tell the complete story of how OPC went from a “crash-prone prototype” to a “one-person engineering team.” Every number is real. Every crash is on the record.

Silicon Workforce S1: The OPC Framework Evolution Next: What Does a One-Person Engineering Team Look Like ->

Silicon Workforce S1E01: Why AI Can Write Code but Can't Do Engineering

The Intern Is Fast, but Doesn’t Know the Rules

Three Tools Were Actually the Same Thing

28 AI Employees, $130, Two Days and One Night

Writing Code ≠ Doing Engineering

Comments