Silicon Team S2E01: A Family Product in 23 Hours for $92

Silicon Team S2E01

Previously on Silicon Team: Last season we discovered that AI can write code but doesn’t understand engineering — tests were fake, security had holes, architecture was flawed. So I built OPC: multiple AIs checking and balancing each other. Verified, trustworthy. The natural next question: with a reliable AI toolchain, what products can you actually build?

“Build a family calendar. Family members log in. Image/voice/text input. LLM parses it. Notification reminders.”

One sentence. No PRD, no design spec, no competitive analysis. Just that one sentence, tossed into the OPC loop (an automated AI build pipeline — AI agents autonomously execute tasks in a plan → build → test → review cycle).

Two days later:

23 hours 9 minutes of AI execution time
$92 in API costs (Claude Opus, 95% cache hit — critical for long sessions, more on this later)
44 subagents (each an independent AI agent invocation), 35 git commits
103 vitest unit tests + 52 E2E tests all passing, zero tsc errors

A working family AI calendar MVP on screen. But the process was anything but smooth.

Night One: The Cost of a One-Sentence Spec

After receiving the requirement, the AI’s first move was scaffolding: Next.js 14 + Vercel Postgres + NextAuth (Google OAuth) + Web Push notifications. These choices were fine.

The problem was the frontend. The first version delivered a standard monthly Calendar grid — Monday through Sunday headers, one cell per day, click to pop up an event form. Functionally perfect: add, edit, delete, drag-and-drop. All tests passing.

But I knew it was wrong the moment I opened it.

Who is this for? My parents. When they open their phone, they want to see “what’s happening today” — not a Google Calendar clone. They won’t proactively check a calendar grid every day, just like you don’t proactively check a paper wall calendar. The core problem for a family calendar isn’t “where do events live” but “why would family members open this every day.”

I said one sentence: “The main screen should be an agent feed, not a calendar.”

Honestly, this wasn’t entirely the AI’s fault. The requirement said “build a family calendar” — and the AI built a calendar, correctly executing a vague spec. The root cause of the wrong direction was the requirement itself being underspecified. But AI does have a tendency: it defaults to the most common UI pattern in its training data — Calendar grid appears most frequently, so it’s the default choice. Functionally correct, directionally wrong. Both things are true: the spec should have been more specific, and AI’s default choices need human calibration.

Product direction can only come from humans. AI executes plans — it doesn’t generate them.

21 ticks (execution step units in the OPC loop) of work, and Calendar was demoted from the main screen to a /calendar subpage.

Rebuilding: The Agent Feed

The new main screen design: today’s overview (HeroHeader) + AI push notification cards (AgentCardList) + today’s tasks (TodayTaskList) + next few days preview (UpcomingChips) + quick input bar (FabBar).

Instead of you checking the calendar, the calendar comes to you.

Image and voice input were also built — but this article focuses on the direction decision and product polish story. The multimodal input technical details are for another time.

13 OPC tasks executed in parallel/serial. 35 commits auto-committed in the loop. One classic problem surfaced: FabBar’s E2E tests called the real Claude API — the test-writing agent didn’t realize “this isn’t a test, this is a prayer.” The review agent caught it during code review and switched to page.route() mocks. This is the principle: “the agent that does the work can’t evaluate its own work” — the E2E-writing agent won’t question its own choice to call a real LLM.

LLM Pool: Good Enough Is Good Enough

The family calendar uses LLMs to parse natural language input (“Take mom to the dentist Saturday afternoon” → structured event). A single provider isn’t reliable enough — any API can rate-limit, go down, or enter maintenance at any time, and for a family product “parsing failed, try again later” is indistinguishable from “this thing doesn’t work.” You need fallback. Final implementation: 135 lines of TypeScript, zero external dependencies, three providers auto-switching — Anthropic Opus → Anthropic Sonnet → DashScope Qwen. First one fails, automatically switch to second; second fails, switch to third. When the entire chain fails, it reports each provider’s specific failure reason rather than a generic “service unavailable.”

Not a single any in the code. No LangChain. Why not LangChain? Because for this use case, LangChain would add abstraction complexity without adding value — a linear fallback chain across 3 providers is just a for loop, no framework needed to “manage” it. For a family product, hardcoded three-provider fallback is enough — no registry, no plugin system. This is YAGNI (You Aren’t Gonna Need It), not an architectural conviction. When a 4th provider is needed, that’s when you abstract.

Design Language: From Dashboard to Family Product

After the features were working, I opened the app and looked at it — something felt off.

Tailwind’s default styling made it look like an admin dashboard, not a family product you’d open every day. Functionally fine, but the trust wasn’t there. When you open an app, within a fraction of a second you can feel whether “this was made with care” or “this was thrown together.” That’s not mysticism — it’s the subconscious signal from visual system consistency.

Introduced an Apple-inspired design token system: CSS variable hierarchy (accent color, three-level label hierarchy, 4pt spacing grid), backdrop-blur-xl glassmorphism, dark class manual dark mode toggle, spring elastic animation curves.

The Accessibility Wall

Then WCAG contrast ratios stopped us.

Apple’s secondary label uses rgba(60,60,67,0.36) — contrast ratio of only 1.7:1, far below WCAG AA standard (4.5:1). Worse: axe-core (the accessibility scanner) can’t accurately calculate contrast when semi-transparent colors layer over multiple backgrounds, producing both false positives and false negatives. False positives waste your time fixing non-issues; false negatives let real problems slip through to production. Both erode trust in your automated tooling.

Final solution: all text tokens switched to solid hex (#6e6e75 replacing rgba(60,60,67,0.36)), with rgba reserved only for decorative elements (background blur, shadows). The text color is visibly darker (from 1.7:1 to 4.65:1 contrast), but readability improved dramatically, and axe-core can now calculate correctly. The trade-off: sacrifice a bit of Apple’s original “frosted” aesthetic to ensure the toolchain can reliably enforce the accessibility floor.

Accessibility isn’t a bonus — it’s the floor. Someone in your family might have poor eyesight. If they can’t read the interface text, this app is useless to them.

To be honest, dark mode accessibility tests were added by the review agent during code review, not from the start. If accessibility is truly “the floor,” it should have been in the initial spec. That’s something to improve.

The Ledger

Metric	Data	Notes
AI execution time	23 hours 9 minutes	Excludes human review time
API cost	$92	Claude Opus, 95% prompt cache hit
Equivalent cost without caching	$276–$460	Same 115M tokens, no cache
Human involvement	~3–4 hours	Direction review, design decisions, quality gates
Subagents	44	Including rework after direction pivot
Git commits	35	Including pre-pivot discarded work
Tests	103 vitest + 52 E2E
Deployment status	Vercel, family testing	Still iterating

$92 is the API cost, not the total cost. The 95% cache hit rate is the lifeline of long sessions — 109M of 115M tokens were read from cache, enabled by OPC loop’s long-session architecture. If your workflow uses short conversations, cache hit rates will be much lower and costs will approach the $276–$460 upper bound.

From a product perspective: a Next.js 14 full-stack app with Google OAuth login, three-way LLM fallback, Apple-inspired UI, dark mode, PWA + Web Push notifications, WCAG AA accessibility, and full test coverage. Perfect? No — the LLM Pool’s require() is tech debt (CommonJS in an ESM project), and dark mode a11y tests were added in review, not from the start.

The app was deployed to Vercel but never actually used by the family. 103 tests passing, tsc zero errors — every technical metric passed. But between MVP and something people actually use daily, there’s a gap that hasn’t been crossed. Tests can verify that features work correctly; they can’t verify that your family wants to open the app every morning.

What the OPC loop gives you isn’t perfection — it’s the execution power to go from one sentence to a running app in two days. Direction must come from you. Standards must be held by you. But all the execution work between those two things — AI does it faster than you, and it doesn’t get tired at 3 AM. The most expensive lesson from EP01: when the direction is wrong, the stronger the execution power, the bigger the waste. AI won’t stop itself.

Silicon Team S2: Evolving the Toolchain Through Real Products ← S1E11: After the Crash | S2E02: 40 Ticks to Production →