KimiHarness Replay
Accuracy crashed to 28% after the first rewrite, then recovered to 97% by iteration 11.
A Meta-Harness search run on Symptom2Disease: 12 iterations, $0.73 total cost, and a readable rule set that stabilizes after an early over-correction.
Current Focus
Iteration 0
69% accuracy · $0.02 spent
Run ID 2026-04-23_17-21-21
Baseline
69%
Iteration 1 Dip
28%
Converged
Iter 11 · 100%
Total Cost
$0.73
Replay Timeline
See the collapse, the correction, and the eventual convergence
Iteration 1 is the hook: the first rewrite makes the harness worse. The rest of the run is the system climbing back by reading failures and tightening rules.
Baseline
69%
A decent starting point before the search destabilizes.
Iteration 1 Dip
28%
The first rewrite makes the harness worse, not better.
Convergence
100%
By the end, the search lands on a clean diagonal.
Iteration Brief
What changed and why it mattered
Accuracy
69%
Failures
31
Iteration Cost
$0.02
Baseline: zero-shot single-call harness from src/mini_meta_harness/harnesses/baseline_zero_shot.py.
Filesystem Reads
Evidence pulled into memory
The proposer repeatedly revisits earlier traces and harness snapshots. That read pattern is the mechanism behind the later recovery.
Harness Delta
The actual code mutation
Compare the harness rewrite directly, then inspect the failures that still remain for this step.
Run Metrics
The search story in one strip
The dip, the recovery, and the token-heavy memory reads all show up here without changing the surrounding context on every scrub.
Accuracy
Accuracy over time
Cost
Cost per iteration
Tokens