Silicon Team S2E06: Three Reviewers and Not One Noticed the Wrong Color

Silicon Team S2E06

EP05 built the extension system — capability contracts stopped the core from getting fatter. The next question: now that the extension framework exists, what real problems can it solve?

The first answer came from the education tool’s frontend refactor. After the refactor, a standard review ran — security, backend, and frontend reviewers all PASS. The code had no bugs, the logic was correct, all tests passing.

I opened the page myself, felt something was off but couldn’t pinpoint it. After zooming in on screenshots and comparing, I found it: the blue in the new component differed from the design system’s blue by 7 degrees of hue. Nearly invisible on a small screen. But on a large screen, the drift accumulated — the entire interface felt tonally inconsistent, as if two different people had painted it.

bg-blue-600 versus bg-blue-700 — one digit apart in code. Fifteen degrees apart on screen. The security reviewer doesn’t check which CSS class color values are chosen. The backend reviewer doesn’t look at frontend rendering results. The frontend reviewer examines component logic and responsive layout — nobody’s job description includes “compare rendered output against design system color values.”

Code review has a structural blind spot: reviewers read code, they don’t look at rendered results.

EP01’s Tailwind default styles making the product look like an admin dashboard was also something I only caught by looking myself. EP03 added the skeptic-owner to solve direction blind spots, but skeptic-owner doesn’t look at colors either — it cares about “is the direction right,” not “is the blue right.” A new review capability was needed, and it couldn’t rely on human eyes — I can’t open every page full-screen for visual comparison after every commit.

Design Intelligence’s Three-Layer Architecture

Design Intelligence (DI) was the extension system’s first heavyweight application from EP05. It’s not a generic “check if the design looks good” tool — it has three layers, each solving a specific class of design review problem.

Layer 1: Palette guardrail. Mechanical checks, no LLM involved.

Layer 2: AI smell detectors. 28 specific detectors, each checking for one design anti-pattern.

Layer 3: Reference layer. Extracting the generator’s design knowledge into consultable documentation.

The three layers progress from shallow to deep: the first counts (is this color in the allowed range?), the second pattern-matches (is this combination an anti-pattern?), the third consolidates knowledge (why is this design decision correct?).

Palette Guardrail

A design system’s colors aren’t chosen randomly — from the base palette to semantic colors to component colors to motion colors, four levels of token hierarchy, each with explicit mapping relationships. What color is primary-500, which palette value does danger map to, how much does a button’s hover state shift — these are all deterministic rules.

The palette guardrail does something simple: scan all color values in the project and compare them against the design system’s registered values. If a color isn’t in the palette’s registered range, flag a violation.

No LLM needed to judge “does this blue look nice” — mechanical comparison suffices. This philosophy aligns directly with S1E07’s mechanical gates: count, don’t judge. Whether a color value falls within the allowed range is a factual question, not an aesthetic one.

In implementation, DI scans the project’s CSS variables, Tailwind configuration, and design token files during the startup hook phase, building an allowlist of “what colors this project may use.” During the review phase, it checks file by file and reports any color values not on the allowlist. The frontend reviewer doesn’t see this layer — they’re looking at component logic, not color compliance.

If DI had been online during the education tool refactor, the 7-degree hue drift would have been mechanically caught during review — because the drifted color value isn’t in the design system’s palette. No intelligence needed, just a table and one comparison.

28 AI Smell Detectors

The palette guardrail solves “are the colors right,” but design problems extend beyond color. Inconsistent spacing, confused font hierarchy, insufficient contrast, unusual animation duration, uneven component density — these can’t be solved by mechanical comparison because they’re not “right or wrong” questions but “is this combination good” pattern recognition questions.

DI’s second layer is 28 specific AI smell detectors, each checking for one specific design anti-pattern. Not “have an LLM look at the overall feel” — 28 independent check rules, each with explicit trigger conditions and fix suggestions.

A few examples:

Contrast detection: Does the contrast between text and background meet WCAG AA level (4.5:1)? EP01’s accessibility wall was this type of problem — Apple-style rgba(60,60,67,0.36) had a contrast ratio of just 1.7:1. Mechanical tools (axe-core) miscalculate with semi-transparent overlays; DI uses a vision language model to judge from rendered results.

Spacing consistency detection: Between same-level elements, does spacing follow the 4pt grid? When writing code, it’s easy to use p-4 in one place and p-5 in another — 4px difference, barely visible to the eye, but it creates a subtle disruption in overall visual rhythm.

Font hierarchy detection: Does the size and line-height relationship between title-subtitle-body remain consistent? When three different heading styles appear on one page, the design system isn’t being followed correctly.

Component density detection: Are interactive elements within a region too densely packed? Especially critical on mobile — buttons too close together cause mistaps.

Each detector runs independently, reports independently. One detector’s misjudgment doesn’t affect other detectors’ results. This design came from EP05’s circuit breaker lesson — one bad extension shouldn’t kill the whole pipeline; similarly, one inaccurate detector shouldn’t contaminate the entire design review.

Reference Layer: Turning Generator Knowledge Into Documentation

DI’s first two layers solve “checking” — finding problems. The third layer solves a deeper problem: design knowledge locked inside generator code.

OPC has a demo generator — generate-demos.py, 3,766 lines. It knows how to generate attractive demo pages: which color schemes to use, how to lay things out, which component combinations look good together. But all this knowledge exists as if-else statements and template strings within the code.

If a different coding agent were to generate a demo, it wouldn’t know these rules. It would use its training data’s default choices — the same problem as EP01’s calendar grid.

The reference layer’s approach: extract the generator’s implicit knowledge into explicit documentation, placed in a references/ directory. The document format is structured: each design rule has what (the rule itself), why (the reasoning), example (correct and incorrect examples), and enforce (how to check mechanically).

This isn’t writing a design spec — a design spec describes “what you should do.” The reference layer describes “why the generator does what it does” — its underlying reasoning process. When a new coding agent needs to generate a similar demo, it can consult the reference layer instead of reinventing the design judgments already distilled in those 3,766 lines of code.

The reference layer’s limitations are obvious: it only records my own (and my demo-generating agent’s) design preferences, not universal design principles. If someone else has different aesthetic sensibilities, the reference layer needs to be rewritten. This is an inherent constraint of a single-person tool — the tool embodies one person’s judgment standards, not industry standards.

DI as Extension System Stress Test

DI was the extension system’s first heavyweight user from EP05. It simultaneously uses three hook types: startup hook scans design system configuration, pre-dispatch hook injects design scoring criteria for reviewers, execute hook runs the vision language model for screenshot comparison. Six internal modules collaborating, plus depending on an external Python vision language model.

Building DI exposed capability contracts’ first boundary: when an extension is complex enough, it needs internal modularization itself. EP05’s extension system solved the separation between core and domain logic, but didn’t solve complexity within domain logic. DI’s 6 internal modules’ dependency relationships were isomorphic to the spaghetti problem EP05’s pre-split core had — just pushing complexity from the core into extensions.

The solution was inelegant but effective: DI uses folders to separate modules internally, modules communicate through explicit function signatures, each module has independent tests. Not framework-level decoupling — engineering-discipline-level. For a one-person tool, discipline is sufficient when architecture isn’t needed.

Design Isn’t Purely Subjective

This episode’s core finding: a portion of design review can be mechanized.

Whether a color is in the palette range — factual judgment. Whether contrast is sufficient — numerical calculation. Whether spacing is consistent — pattern matching. Whether font hierarchy is confused — rule checking. None of these require aesthetic judgment, just rules and comparison.

Of course, design does have genuinely subjective parts — the emotional feel of a color scheme, the rhythm of typography, the overall tone. DI doesn’t touch those. The palette guardrail and smell detectors solve “baseline” problems: they can’t guarantee things look good, but they can guarantee certain deterministic errors don’t happen.

EP03 added the Tenth Man to watch direction, EP05 added the extension system to decouple the framework, this episode gave the tool a pair of eyes for design. Three upgrades, three different blind spots — direction, architecture, visual. Products keep exposing shortcomings; shortcomings keep driving tool evolution.

Three reviewers read the same code, all said “no problems.” Not one of them opened a browser to look.

Silicon Team S2: Evolving the Toolchain Through Real Products ← S2E05: Every New Product Makes the Core Fatter | S2E07: If You Can’t See the Process, You Can’t Trust the Result →