Silicon Team S1E11: After the Crash

Silicon Team S1E11

Two cover images. Nine reviewers, all PASS. I opened them—title font shrunk 40%, avatars downsized to thumbnails, visual effects gone.

Gates all green. Output below standard.

This is the first time my safety net failed—not because AI wrote bad code, but because the system that checks code had its own bugs. EP03 covered security audits. EP07 exposed the all-PASS illusion. This episode is about what happens when the gate genuinely doesn’t catch the problem.

The answer isn’t rollback. It’s making the system learn from the crash. Three crashes taught me three rules: declarations must be enforced, silent failures must become explicit blocks, and anything that evaluates must itself be evaluated. Here’s what happened.

Crash One: Gates Green, Output Failing

Here’s what happened: OPC’s review flow ran the complete pipeline. Nine reviewer agents submitted evaluations. The gate ruled PASS. But one human glance at the cover images revealed the truth—compared to the previous version, title font was 40% smaller, avatars had shrunk from normal to thumbnail-size, and visual effects (gradients, shadows) had vanished.

Why didn’t the gate catch it? Four failures at once:

Failure 1: No skeptic in the room. None of the nine reviewers had “compare against previous version” in their mandate. They each evaluated “is this cover good?” independently. Nobody asked “is this cover better or worse than last time?” A mediocre cover passes when you judge it in isolation. Put it next to the previous version—40% font reduction is immediately obvious. This is a classic review failure mode: when every reviewer evaluates the artifact in isolation, regression is invisible. You need at least one reviewer whose explicit job is regression detection.

Failure 2: Declaration without enforcement. The skeptic-owner role (S0 ch03) had mandatory: true in its frontmatter. This was supposed to mean “must be included in every review.” But the orchestrator code never read that field. mandatory was declared, never enforced. The safety role “participated” in zero reviews.

S0 described “declaration vs. enforcement” (ch02) in the context of testing. Now it surfaced in the review system—with worse consequences, because review is the last link in the quality chain.

Failure 3: Test helpers generated empty evals. Unit test helpers that created mock eval files were missing critical score fields. Some tests “passed” by validating blank data—like grading a blank exam paper and giving it full marks.

Failure 4: Parser regex had four false-positive bugs.

Arrow prefixes (→) mismatched as list items
Line detection order skipped severity markers
Table rows parsed as review content
Verdict lines treated as plain text

Four independent bugs that, combined, made the parser unable to correctly extract reviewer scores in certain cases.

The fix (7 hours, $45):

Code-level enforcement for skeptic-owner—not frontmatter declaration, but a hard check in transition code. Review without skeptic-owner? Transition refused.
Fixed test helpers to generate all required eval fields.
Fixed all four regex bugs individually.

Declaration ≠ Enforcement: mandatory: true in a config file. You assumed it worked. It never did.

The most dangerous class of bug in the whole system: you think the safety net is there, but it isn’t. mandatory: true in a config file feels like protection. It’s not protection until code enforces it. Every config declaration without a corresponding code check is a ticking time bomb.

Crash Two: The Harness Crashes on Itself

The second crash has a certain irony—I was using OPC to test OPC’s own code.

opc-harness synthesize is a core command that aggregates multiple reviewer evaluations into a single verdict. Normally it outputs JSON. But when the eval directory was empty (no reviewers submitted), it wrote errors to stderr and left stdout blank.

The orchestrator called synthesize, then tried JSON.parse(stdout). Empty string isn’t valid JSON—crash. But this crash was silent. The orchestrator didn’t halt; it got an error, and moved on to the next step. A quality checkpoint effectively said “I can’t check,” and the system treated it as “check passed.”

How it was found: Running OPC’s own integration tests through the OPC loop, a negative-path test case (specifically designed to exercise error conditions) called synthesize on an empty directory. The silent failure surfaced immediately.

This is the value of dogfooding—bugs invisible on the happy path appear when you’re your own user. EP07 covered the all-PASS illusion from reviewer quality issues. This was deeper—the harness toolchain itself failed silently.

The fix: Modified four error exit paths to always output valid JSON on stdout. Introduced a new verdict semantic: BLOCKED—meaning “I cannot assess,” distinct from FAIL (“I assessed it, it failed”) and PASS (“I assessed it, it passed”).

BLOCKED solved a longstanding ambiguity: before this, encountering an unassessable state either crashed the system or defaulted to PASS. Now the orchestrator has a correct third option—encounter BLOCKED, pause and escalate. Not skip. Pause. This is a surprisingly common pattern in automated systems: the absence of a “can’t assess” state forces every ambiguous situation into either pass or fail, and the system’s default almost always favors “pass” because it’s less disruptive. Silence becomes consent.

New rule: CLI protocol invariant—stdout is always valid JSON, diagnostics go to stderr. Simple principle. Four exit paths violated it before the fix.

Crash Three: Eval Needs Its Own Eval

The first two crashes had clear failure events. The third was different—it surfaced gradually over a month.

The trigger: while doing the EP09 cover image review, I spot-checked a few reviewer scores. One reviewer gave a cover 8.5/10—“harmonious color palette, clean layout.” I opened the cover: the title was cropped outside the safe zone, completely unreadable on mobile. My score? 4/10 at best.

This wasn’t isolated. Checking several more review rounds revealed a pattern: reviewer scores often diverged from human judgment. Some reviewers rubber-stamped everything. Others hallucinated nonexistent problems and scored low.

But I had no mechanism to systematically measure how large these gaps were. I could see the cover image was bad (because I looked), but I couldn’t quantify how inaccurate the reviewer scores were overall. The eval system evaluated outputs, but nobody evaluated the eval system.

The solution: External rubric—a structured evaluation framework with dimension → criteria → evidence hierarchy. Each dimension has explicit scoring standards. Reviewers must cite concrete evidence for every score.

Key design decision: the rubric is an “informational sidecar.” It doesn’t change core eval logic. Synthesize and gate mechanics remain untouched—rubric scores are supplementary data, not replacements. Worst case: you get an extra set of possibly inaccurate reference scores. It can’t break existing gate decisions. This follows a principle I’ve come to rely on: when adding a new evaluation layer, make it additive rather than replacive. The moment you replace your working gate with an untested rubric, you’ve traded a known-imperfect system for an unknown-imperfect system.

Two meta-evaluation mechanisms were added:

Version drift detection: Rubric versions tracked in flow-state. If the rubric changes, old scores can’t be compared to new scores—preventing apples-to-oranges comparisons.
Convergence warning: If the same output gets scores with variance < 0.5 over three consecutive rounds, stop iterating. Why 0.5 and 3 rounds? On a 1-10 scale, variance below 0.5 means reviewer disagreement is less than half a point—continuing is unlikely to surface new perspectives. Three rounds filters out coincidental convergence (two rounds could be luck). These aren’t theoretically derived numbers—they’re empirical values from running dozens of review loops.

The lesson: When your system has an eval layer, ask one question—who evaluates the evaluator? If the answer is “nobody,” you have an unobservable quality risk. Meta-evaluation doesn’t need to be complex—a rubric sidecar with version tracking is enough. The key is making it exist.

The Common Pattern

Three crashes that look different—reviewer configuration, CLI protocol, evaluation framework. But they share one underlying pattern:

Error paths are the biggest blind spot.

The happy path was tested extensively: reviewers submit evaluations, synthesize aggregates, gate decides. But what if a reviewer is absent? What if synthesize gets empty input? What if the scoring standard itself is unreliable?

These questions aren’t in the “functional correctness” category. The functions work correctly on normal input. The failures are in the spaces between functions—boundary conditions, missing states, meta-level quality assurance.

EP03’s security audit examined the system from the outside. These three crashes came from the inside—the system exposing its own blind spots during operation. The distinction matters: external audits find the bugs you can predict; internal crashes find the bugs you can’t. Both are necessary, but the internal ones are more valuable precisely because they’re unpredictable. The repairs weren’t patches:

Declaration → Enforcement: Configuration constraints became code-level checks
Silent failure → Explicit blocking: “Can’t judge” changed from “skip” to “pause and escalate”
Evaluate outputs → Also evaluate evaluators: A meta-evaluation layer to measure reviewer quality itself

S0 described “gates as the last line of defense” (ch03). Now I’d add: gates break too. When they do, you don’t need more gates—you need mechanisms to detect whether gates are working. That’s meta-evaluation.

The Ledger

Crash	Time	Cost	What It Fixed
Crash 1: Gates Green	7 hours	$45	4 parser fixes + mandatory role enforcement
Crash 2: Harness Crash	~3 hours	~$15	4 exit path fixes + BLOCKED verdict
Crash 3: Eval Black Box	~2 hours	~$10	External rubric + 2 meta-eval mechanisms
Total	~12 hours	~$70

Worth it? Consider the alternative: without fixing Crash 1, the next bad cover still gets nine PASS votes, and you still need human eyes—so what are the reviewers for? Without fixing Crash 2, every empty directory silently passes—and every downstream decision built on that pass is unreliable.

Recovery isn’t rolling back to the last “good version.” Rollback addresses symptoms. True recovery makes the system learn new rules. The system after three crashes is stronger than before—not because it was tested more, but because it actually broke.

How Far Can Verification Go

The three crashes exposed a deeper question: who verifies the verification system itself?

This is meta-evaluation — evaluating the evaluators. OPC’s response is layered:

Layer 1: Mechanical rule self-checking. 60 rules have corresponding 450 assertions. Rules are code, assertions verify code behavior. This layer is deterministic — pass is pass, fail is fail.

Layer 2: Statistical calibration. Using Cohen’s kappa coefficient to measure AI reviewer consistency. If two independent reviewers’ agreement on the same code set falls below 0.6, the review criteria aren’t clear enough — it’s not the reviewers’ problem, it’s the rubric’s problem.

Layer 3: Dogfooding loop. Use OPC for real projects → expose OPC’s problems → use OPC to fix OPC → repeat. This episode’s three crashes are the output of dogfooding cycles.

There is no Layer 4. Meta-evaluation can’t recurse infinitely. At some point, you must accept a fact: a system cannot fully verify itself. This is the engineering version of Gödel’s incompleteness theorem. You can make enough verification layers with broad enough coverage that the problems slipping through become fewer and fewer — but not zero.

The S1 Ledger

Eleven episodes. One month. Approximately $2,000 in API costs — including all loop runs, reviews, test generation, and dogfooding cycles.

What did $2,000 buy? Not a perfect framework. Crashes one through three made that clear — OPC still has blind spots. $2,000 bought a repeatable quality assurance process and a set of rules that grew from practice.

Eleven episodes strung together:

EP01: Writing code ≠ engineering. AI is an intern that codes extremely fast, but it doesn’t understand “why we need tests” and “why we need reviews.” OPC’s story begins with this contradiction.

EP02: A one-person engineering team. Four roles (builder, reviewer, tester, adjudicator), 14 nodes. The one who does the work doesn’t judge their own work.

EP03: From “it runs” to “I trust it.” Security audit from 47 to 90. 60 enforcement rules. Don’t make AI better, make bad outcomes smaller.

EP04: Growing a skeleton. From hardcoded to capability contracts. Joints are extension points; you can’t hang muscles in the middle of a bone.

EP05: AI works while you sleep. Tick-based loops. AI excels at incremental polishing, not creative leaps. One human sentence > 8 hours of AI work.

EP06: $92 bought a product. Loops execute plans, they don’t generate plans. Direction is set outside the loop.

EP07: When tools check themselves. All PASS = the biggest problem. You can’t transcend your own blind spots through introspection. Mechanical gate > LLM gate.

EP08: AI ran for 8 hours and forgot who it was. Context compaction causes memory loss. Filesystem > AI memory. PreCompact / PostCompact dual hooks.

EP09: Don’t make AI better, make bad outcomes smaller. Gates are floors, not sources. AI can make ordinal judgments, not cardinal ones. Gates count — they don’t judge.

EP10: When humans should step in. 125 hours of autonomous looping, five intervention signals. Direction, context, protocol, environment, termination — autonomous ≠ unattended.

EP11 (this episode): After the crash. Gates break too. Declaration ≠ enforcement, silent failure → explicit blocking, evaluate outputs → also evaluate evaluators.

This isn’t an “AI replaces humans” story. It’s a “one person used AI to do what previously required a team” story. The difference: the human is always present. The human sets direction, makes judgment calls, guards taste. AI does execution, polishing, and the work that’s clearly defined, verifiable, and repetitive.

This process lets one person work like a team. It doesn’t rely on AI’s self-discipline — it enforces checks with code. It doesn’t assume AI will do the right thing — it assumes AI will make mistakes, then ensures the consequences of those mistakes are contained.

What will Season 2 cover? The application layer on top of OPC — once you have a reliable engineering pipeline, what can you build with it? First story: one-sentence requirement, $92, 23 hours, a family AI calendar.

Silicon Team S1: Can You Trust AI That Writes Code? ← S1E10: When Humans Should Step In | S2E01: Family Calendar →