Harness Engineering for Production AI Systems
harness engineering: Edmund Ng's journey spoke on governed AI, harness testing, and Vibe Coding for solo founders. Explore.
Published Updated 16 min read
ai-architectureharness

harness engineering matters when you move from demo velocity to production scrutiny. This article is Edmund Ng's field notes on AI testing protocol, harness discipline, and the journey toward auditable AI—written for solo founders and system rule designers who cannot afford silent regressions.
Continue with these journey spokes.
Continue with these journey spokes.
Continue with these journey spokes.
Continue with these journey spokes.
Continue with these journey spokes.
Continue with these journey spokes.
Complete Vibe Coding Guide for Non-Programmers · The Phase Document System for AI · The 10/80/10 Testing Protocol for AI Governance
On this page
- What — AI testing protocol — harness engineering
- Why — production AI harness — demos lie (kindly)
- When — AI testing protocol — invest in harness
- Where — production AI harness — 10/80/10 in the stack
- How — AI testing protocol — start a harness without over-building
- 是什么 — extended AI testing protocol — production AI harness
- 为什么 — extended production AI harness — AI testing protocol
- 何时 — extended AI testing protocol — production AI harness
Key takeaways
- harness engineering needs written rules—not hero prompts alone.
- AI testing protocol keeps demo speed from becoming production regret.
- Harness discipline connects this spoke to the wider governed production journey.
- Cross-link Phase docs, Harness retests, and written tradeoff logs before calling work done.
Takeaways above anchor the rest of this spoke.
What — AI testing protocol — harness engineering
Harness engineering is Edmund Ng's label for the QA and review machinery around AI systems — frozen snapshots, parallel axis review, remediation loops, smoke tiers (e.g. Playwright+), and long-run scenario packs.
It answers: "Does this system behave under scrutiny, not just in a demo?"
Prerequisite: Phase Document System.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the What layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for how to test AI systems beyond demos: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
In the What layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for how to test AI systems beyond demos: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
Why — production AI harness — demos lie (kindly)
The scariest bugs are the ones your demo celebrates.
API-green does not mean agent-safe. UI-green does not mean harness-green. Without this approach, Vibe Coding ships fast failure into production.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the Why layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for when do you need an AI harness layer: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
In the Why layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for when do you need an AI harness layer: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
When — AI testing protocol — invest in harness
| Signal | Action |
|---|---|
| First external user / client | Minimum smoke + constitution checks |
| Multi-agent workflows | 10/80/10 parallel lanes |
| Regulated domain | Evidence + harness before scale |
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the When layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for what is harness engineering for AI: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
In the When layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for what is this approach for AI: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
Where — production AI harness — 10/80/10 in the stack
| Phase | Who | What |
|---|---|---|
| PRE (10%) | Frontier | Run real API once; freeze canonical snapshot |
| PARALLEL (80%) | 6–8 sub-agents | Same snapshot; one lane each; no re-execution |
| POST (10%) | Frontier | RCA, fix, retest — never stop at report |
Layer note: 10/80/10 is development/QA methodology. Runtime orchestration (governed sequential paths) is a separate layer — they do not contradict.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the Where layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for how to test AI systems beyond demos: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
In the Where layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for how to test AI systems beyond demos: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
How — AI testing protocol — start a harness without over-building
- Define one frozen snapshot per release candidate
- Assign lanes (gap, error, contradiction, boundary, over-promise, quality)
- Frontier consolidates — single remediation plan
- Add Playwright smoke for route-level regression
- Document results honestly — smoke tier ≠ full highway closure
Next spoke (planned): 10/80/10 Testing Protocol.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the How layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for when do you need an AI harness layer: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
In the How layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for when do you need an AI harness layer: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
是什么 — extended AI testing protocol — production AI harness
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the 是什么 layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for what is this approach for AI: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
为什么 — extended production AI harness — AI testing protocol
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the 为什么 layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for how to test AI systems beyond demos: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
何时 — extended AI testing protocol — production AI harness
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. Edmund Ng's journey from non-programmer Vibe Coding to auditable AI systems shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Structured exports and harness retests matter more than demo velocity when reviewers ask for evidence.
Governed exports and harness checkpoints prevent demo velocity from collapsing under review.
In the 何时 layer of this Act 2 architecture and harness spoke, teams work from an operational contract—not a marketing label. Governed exports and harness checkpoints prevent demo velocity from collapsing under multi-axis review or compliance questions. A practical test for when do you need an AI harness layer: what is frozen before agents sweep, what gets logged at tradeoff time, and which Harness retest proves behavior instead of UI luck. Edmund Ng's field notes emphasize exportable rules and Decision Logs so six-month-later auditors can follow the chain—that is the same fast AND governed bridge Acts 1–3 teach.
Summary
harness engineering on Edmund Ng's journey means shipping with AI testing protocol, harness retests, and evidence-friendly decisions—not one-off prompts. Models change; written rules, exportable snapshots, and governance patterns endure.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. The journey from non-programmer Vibe Coding to auditable AI shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
What is harness engineering for AI
Edmund Ng treats each long-tail question as a production gate: freeze the spec, log the tradeoff, and prove behavior with Harness retests—not demo clicks alone.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. The journey from non-programmer Vibe Coding to auditable AI shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
How to test AI systems beyond demos
Edmund Ng treats each long-tail question as a production gate: freeze the spec, log the tradeoff, and prove behavior with Harness retests—not demo clicks alone.
Solo founders in Malaysia and APAC often face professional scrutiny early. Externalizing Phase documents, Decision Logs, and smoke tiers before the demo invitation arrives is cheaper than rebuilding trust after a silent regression reaches a customer walkthrough.
When do you need an AI harness layer
Edmund Ng treats each long-tail question as a production gate: freeze the spec, log the tradeoff, and prove behavior with Harness retests—not demo clicks alone.
Role separation matters: builder models may sweep diffs, but frontier models should audit frozen snapshots. Mixing those hats in one chat thread is how teams lose reproducibility and inherit context debt that no IDE upgrade fixes.
FAQ
What is harness engineering?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. The journey from non-programmer Vibe Coding to auditable AI shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
What is harness engineering for AI?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Solo founders in Malaysia and APAC often face professional scrutiny early. Externalizing Phase documents, Decision Logs, and smoke tiers before the demo invitation arrives is cheaper than rebuilding trust after a silent regression reaches a customer walkthrough.
How to test AI systems beyond demos?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Role separation matters: builder models may sweep diffs, but frontier models should audit frozen snapshots. Mixing those hats in one chat thread is how teams lose reproducibility and inherit context debt that no IDE upgrade fixes.
When should you you need an AI harness layer?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Governed builders treat written rules, frozen snapshots, and harness retests as production requirements—not optional polish after a green demo. The journey from non-programmer Vibe Coding to auditable AI shows why structure beats model churn when stakeholders ask how you decided, what you rejected, and what evidence you can export tomorrow.
Why does AI testing protocol matter for solo founders?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Solo founders in Malaysia and APAC often face professional scrutiny early. Externalizing Phase documents, Decision Logs, and smoke tiers before the demo invitation arrives is cheaper than rebuilding trust after a silent regression reaches a customer walkthrough.
When should teams freeze specs before agent sweeps?
Edmund Ng answers with structure first: freeze specs, separate builder and frontier roles, and prove behavior with Harness—not demo clicks. Written rules, Phase documents, and Decision Logs let teams explain tradeoffs months later without reconstructing chat history.
Role separation matters: builder models may sweep diffs, but frontier models should audit frozen snapshots. Mixing those hats in one chat thread is how teams lose reproducibility and inherit context debt that no IDE upgrade fixes.
About the author

Edmund Ng — Malaysia-based solo founder, AI systems architect, and system rule designer. He ships governed AI with Vibe Coding, harness engineering, and auditable evidence chains. About · Projects · LinkedIn.
