Spec to Production: My AI Workflow Skill
Ship production-quality code with AI coding agents. A 10-action workflow skill: focus, plan, spec-review, spike, ship, fix, review, done. Agent-agnostic.
We use analytics cookies. No data is sold or shared.
We use analytics cookies. No data is sold or shared.
PrivacyShip production-quality code with AI coding agents. A 10-action workflow skill: focus, plan, spec-review, spike, ship, fix, review, done. Agent-agnostic.
A constraint-driven content pipeline where AI generates and human voice governs. Style references, platform adaptation, and analytics feedback loops.
AI removed the execution bottleneck. The differentiator is no longer building fast. It is knowing what to build, when to stop, and whether the output matters.
Traditional software is being hollowed out by AI agents. Not hype—structural economics. What survives, what dies, and what to do about it.
I've been shipping production-quality features faster than ever. Not because I write more code—but because I barely write any.
The difference isn't the AI model. It's the process around it.
Most developers use AI coding agents like a chat window. Ask for code, paste it in, ask for fixes, lose context, repeat. It ships—sometimes. But the quality is inconsistent, the process is chaotic, and nothing carries over between sessions.
I wanted a system where I define what to build, and the agent handles the entire implementation loop until the code is production-ready.
So I built one.
A structured development lifecycle for AI coding agents. 10 actions, from idea to shipped code:
| Action | What It Does |
|---|---|
focus | Scan the codebase for what to work on next, prioritized by impact |
plan | Create a spec with acceptance criteria and codebase impact analysis |
spec-review | Adversarial challenge of your spec before implementation starts |
spike | Time-boxed exploration for unknowns, go/no-go decision at the end |
ship | Implement, test, review—loop until all ACs pass and gates are green |
fix | Scientific debugging with hypothesis-driven investigation and anti-cascade TDD |
review | Multi-perspective code review (9 perspectives, risk-scaled) |
done | Final validation, retro, memory update, archive |
drop | Abandon gracefully, preserve learnings for next time |
workflow | Show current state and suggest next action |
The core insight: the spec is the source of truth. Everything flows from it—implementation, testing, review, validation.
Don't know what to build next? Ask the codebase:
The agent dispatches parallel scans across your entire project—checking code quality, missing tests, security gaps, performance issues, accessibility problems. Results come back scored by impact, effort, and risk-if-ignored.
It produces a prioritized task list and creates specs in specs/backlog/ for the top items.
Instead of deciding what to work on, you let the codebase tell you.
You don't start with code. You start with a spec.
The agent reads your codebase first—existing patterns, related code, potential conflicts. Then it writes a spec with:
The spec passes through 13 validation rules before implementation starts. Too big (>8 hours)? It gets split. Vague acceptance criteria? The agent pushes back.
No spec passes the gate if "done" isn't clearly defined.
Before writing a single line of code, challenge the plan:
The agent runs adversarial analysis on your spec:
Catch problems in the spec—not in production.
Not sure about the approach? Run a time-boxed exploration:
Hard time limit (default 1h, max 4h). The output is a GO/NO-GO decision—not code. Spike code is throwaway, deleted before proceeding. The learning gets logged.
This prevents committing to an approach before you know it works.
Once the spec is solid:
The agent enters a loop:
You're not pair programming. You're delegating. The agent handles the loop. You review the output.
Two modes adapt to context:
Testing follows an E2E-first protocol. Every acceptance criterion maps to an E2E test by default. Unit tests are reserved for pure functions only. A mock avoidance hierarchy enforces real systems over test doubles:
| Priority | Strategy |
|---|---|
| 1 (best) | Real system (test DB, sandbox API) |
| 2 | Docker/container |
| 3 | In-memory equivalent |
| 4 (last resort) | Mock (only for third-party APIs without sandbox) |
Mocking your own code—services, database, HTTP endpoints, auth—is never allowed. Every mock requires a justification comment explaining why a real system isn't available.
TDD is enforced structurally: every test must fail first (RED_CONFIRMED), then pass after implementation (GREEN_CONFIRMED). The skill tracks this in an E2E scenario registry—no shortcuts.
Bug fixes have their own dedicated action:
The agent classifies the bug (simple, complex, or highly complex) and adapts its investigation:
Every root cause must pass a 3-point validation before any code changes:
Then Anti-Cascade TDD kicks in:
The LEARN phase is what makes this compound: after fixing a bug, the agent proposes max 2 prevention rules so the same class of error doesn't recur. Rules are always proposed—never auto-applied—and target your agent's config files.
Every edit batch triggers a quick pass:
| Gate | Scope |
|---|---|
| Lint | Changed files |
| Typecheck | Changed files |
Before marking done, a full pass runs all 6 gates:
| Gate | Scope |
|---|---|
| Lint | Changed files |
| Typecheck | Full project |
| Build | Full project |
| Test | Related tests |
| E2E registry | All acceptance criteria mapped to E2E tests |
| TDD proof | Every GREEN_CONFIRMED has prior RED_CONFIRMED |
Coverage runs as an advisory gate—warns on drops over 5% but never blocks.
The skill auto-detects your tooling. Biome or ESLint? Vitest or Jest? pnpm, yarn, or bun? It reads your config files and figures it out.
No special setup. Production-quality validation from day one.
Code review runs automatically during the ship loop—and on demand:
| Perspective | Question | When |
|---|---|---|
| Correctness | Does it do the right thing? | Always |
| Security | Is it safe? | Always |
| Reliability | Does it handle failure? | Always |
| Performance | Is it fast enough? | Always |
| DX | Is it pleasant to maintain? | Always |
| Scalability | Shared state, multi-instance? | Conditional |
| Observability | Can you debug in production? | Conditional |
| Testability | Complex logic covered? | Conditional |
| Accessibility | Keyboard, screen reader, contrast? | Conditional |
Review depth scales with risk:
| Scope | Depth |
|---|---|
| 1-2 files, low risk | Quick (5 perspectives) |
| 3-5 files | Standard |
| 6+ files or high risk | Deep (all 9) |
| Deploy context detected | Production mode |
When everything passes:
Final validation:
Then a retro runs automatically:
The agent proposes memory updates—coding patterns, project conventions, anti-patterns discovered. These get written to your agent config so the next session starts smarter.
Spec archives to specs/shipped/. History logged. Knowledge retained.
Sometimes a feature doesn't work out:
Captures why it was abandoned. Preserves reusable pieces. Documents "if revisited" lessons. Archives to specs/dropped/.
No silent abandonment. Every dropped feature teaches the next one.
The spec defines what production-ready means before code exists. Acceptance criteria are testable—each one maps to an E2E test with TDD proof (RED_CONFIRMED before GREEN_CONFIRMED). Scope items trace to ACs. The agent validates against the spec—not against vibes.
This means quality is structural, not accidental.
Everything is scoped to what ships today. Features over 8 hours get split. The tiering system enforces it:
| Size | Ceremony |
|---|---|
| < 5 LOC | None—just do it |
| < 30 LOC | Inline comment spec |
| < 100 LOC | Mini template |
| 100+ LOC | Full spec with state machines |
No two-week sprints. No ceremony overhead. Ideas ship the same day they're conceived.
The skill never runs git push or deploy commands. The agent handles code quality. You handle production.
This separation matters for trust. I delegate the build loop because I know the agent won't touch anything irreversible without asking.
Context gets lost. It happens. The spec file enables perfect resume:
The agent reads the spec, checks current state, picks up exactly where it left off. No "remind me what we were building" moments.
Not locked to any single tool. The same skill works with Claude Code, Codex, OpenCode, Cursor, Windsurf, Aider—any agent that reads SKILL.md files.
The AI tooling landscape shifts fast. The workflow stays portable.
The workflow skill is available at skills.sh/bntvllnt/agent-skills/workflow:
Or copy directly from the repo:
No config required. The skill auto-detects your project's tooling.
This isn't magic. Real trade-offs:
Overhead for small changes: A one-line typo fix doesn't need a spec. The skill detects trivial changes and skips ceremony—but sometimes you just want to edit and commit.
Learning curve: The spec format and actions take time to internalize. First week feels slower. After that, faster than before.
Agent quality varies: The loop is only as good as the agent's implementation. Complex algorithms and domain-specific code still need careful human review.
Token usage: Multi-perspective review and iterative fixing consume tokens. Worth it for production code. Overkill for throwaway scripts.
Momentum matters more than perfection.
I used to lose half my energy to process—where was I? What was I building? Did I test that edge case? Now the spec holds all state. Quality gates run automatically. The agent reviews its own code from 9 perspectives, enforces E2E-first TDD, and even learns from bugs to prevent the same class of error from recurring.
The result: better software, shipped faster, from day one.
Ship → observe → adjust. Every day.
If that resonates, the skill is at skills.sh/bntvllnt/agent-skills/workflow. The source is on GitHub.
Acceptance Criteria (ACs) — Testable conditions that define when a feature is "done." Written in GIVEN/WHEN/THEN format. Example: GIVEN a user sends 100 requests in 1 minute, WHEN they send request 101, THEN they receive a 429 status with a Retry-After header. Every scope item traces back to at least one AC. If you can't write an AC for it, it's not in scope.
Scope Items — The specific implementation tasks that fulfill ACs. Each scope item maps to one or more ACs, creating bidirectional traceability.
Quality Gates — Automated validation checks that must pass before code is considered done. Quick pass (lint, typecheck on changed files) runs after each edit. Full pass (lint, typecheck, build, test, E2E registry, TDD proof) runs before completion. Coverage is advisory.
E2E-First Testing — Default to end-to-end tests. Unit tests are reserved for pure functions with no I/O or side effects. A mock avoidance hierarchy enforces real systems (test DB, sandbox API) over test doubles. Mocking your own code is never allowed.
Anti-Cascade TDD — Bug fix protocol that prevents fixes from introducing new failures. Baseline the full test suite, write a failing regression test (RED), implement the fix (GREEN), then compare the full suite to baseline (DIFF). Any new failures mean the fix introduced regressions.
SKILL.md — The standard file format for defining agent skills. Any AI coding agent that reads SKILL.md files can load and execute the workflow skill.
More posts on building software in the Building category.