Silicon Team S1E07: When Tools Start Checking Themselves

Silicon Team S1E07

Using OPC to review a product. Full-stack flow, 14 gates checked one by one, to see if it’s ready for 100 users.

Result: all gates PASS.

All gates PASS — and that’s the biggest problem. Because if the checking tool is smart enough to understand the code, it’s smart enough to rationalize the code’s flaws.

You might think this is good news. My first reaction was different: if everything passes, either the product is perfect, or the checks are broken. There’s no such thing as a perfect product.

Sure enough, during the review OPC itself exposed more problems than the product under review. A meta-lesson: tools only reveal structural defects when used intensively.

Peeling the Onion

The problems didn’t surface all at once. Like an onion, they revealed themselves layer by layer.

Layer 1: The loop didn’t know when to stop. The OPC loop finished all units and announced pipeline_complete — but the loop harness didn’t check this status and kept spinning. One session: 50+ idle ticks, each one “checking for the next unit” only to find none.

Like a worker who finished every task but nobody said “you can go home.” Standing there, checking the task list every 30 minutes, finding it empty, continuing to wait. 50+ times.

The fix was simple: add a shouldTerminate: true field to the next-tick output, and have the loop check it before each tick. Classic mechanical gate — you can’t rely on AI to “realize” it should stop.

Layer 2: test-execute was theater. The flow had a test-execute node designed so that “test-design creates test cases, test-execute runs them.” In practice, the orchestrator treated test-execute as another review node — it “reviewed” the test plan and wrote a paragraph saying “tests passed.” It never ran a single command.

This is “aspirational theater” — claiming enforcement with zero mechanical validation. A security checkpoint that looks imposing but has no sensors installed. Everyone walks through, the light is always green.

Layer 3: Gate verdicts were unreliable. Sometimes 3 reviewers all said FAIL, but the Gate ruled PASS — because the last reviewer added “overall acceptable” at the end, and AI anchored on that phrase. Other times all findings were minor, but the Gate ruled ITERATE — overly conservative.

Root cause: verdict determination relied on AI’s “holistic judgment” of findings with no mechanical rule. When findings contained mixed signals, AI was swayed by the last sentence’s tone. In EP02 we designed Gate as mechanical counting logic. But the audit revealed that some gate nodes still relied on LLM holistic judgment — a drift between design intent and implementation. This is precisely what self-audit is for.

Fix: count. N critical findings = forced FAIL. 0 critical + <=2 major = PASS. Everything else = ITERATE. AI generates findings; a rules engine decides the verdict. Take objective judgment out of AI’s hands.

Layer 4: Every test was happy path. 108 tests, every one verifying “the feature works under normal conditions.” Not one verifying “the feature fails correctly under abnormal conditions.” If all tests are “correct input, expect correct output,” they can only tell you “in an ideal world, the code is right.”

Onion cross-section: each problem layer hides in the shadow of the one above

The truly valuable tests are negative paths: wrong input, expect proper error handling. Network drops, expect reasonable fallback. Database full, expect graceful degradation. These are the scenarios that wake you at 3 AM.

Eating Your Own Dog Food

This experience carried a deeper lesson.

During the review, 3 structural gaps in OPC itself were exposed. One fixed on the spot (loop termination guard), two recorded as tech debt (test-execute enforcement, gate verdict reliability).

Then I used OPC to fix OPC. 4 fix units, 8 ticks. Use the tool to review a product, expose gaps in the tool itself, fix the tool, review again. This cycle converges faster than any unit test.

This is the extreme form of dogfooding. Not “we also use this product,” but “we use this product to fix this product.”

But dogfooding has a cognitive boundary: you can’t transcend your own blind spots through introspection. Those 108 happy-path tests are the proof — OPC could spot problems in the product under review, but couldn’t detect its own happy-path bias. Because the bias lives in the structure of the checking logic itself.

It’s like looking in a mirror. A mirror shows you dirt on your face, but it can’t show you cracks in the mirror itself.

You need another mirror to see the cracks. Or you need someone who doesn’t use a mirror to walk over and tell you directly.

In OPC’s world, that “person who doesn’t use a mirror” is mechanical rules. They don’t work through AI judgment — they count, compare, check format. They don’t understand what the code does, but they know whether the code passed the checks.

This is why OPC’s core creed is “mechanical gate > LLM gate.” Anywhere that needs reliable determination — termination, verdict, scope check — must have hardcoded rules. Not because AI isn’t smart enough, but because intelligence itself is a source of bias.

Silicon Team S1: OPC Framework Evolution ← S1E06: $92 for a Product | S1E08: Agent Amnesia →

Silicon Team S1E07: When Tools Start Checking Themselves

Peeling the Onion

Eating Your Own Dog Food

Comments