KimiHarness Replay

Accuracy crashed to 28% after the first rewrite, then recovered to 97% by iteration 11.

A Meta-Harness search run on Symptom2Disease: 12 iterations, $0.73 total cost, and a readable rule set that stabilizes after an early over-correction.

Current Focus

Iteration 0

69% accuracy · $0.02 spent

kimi-k2.6

Run ID 2026-04-23_17-21-21

Baseline

69%

Iteration 1 Dip

28%

Converged

Iter 11 · 100%

Total Cost

$0.73

12 iterationsSymptom2Diseasekimi-k2.6 proposerSource repoPaper

Replay Timeline

See the collapse, the correction, and the eventual convergence

Iteration 1 is the hook: the first rewrite makes the harness worse. The rest of the run is the system climbing back by reading failures and tightening rules.

Baseline

69%

A decent starting point before the search destabilizes.

Iteration 1 Dip

28%

The first rewrite makes the harness worse, not better.

Convergence

100%

By the end, the search lands on a clean diagonal.

Iteration Brief

What changed and why it mattered

baseline

Accuracy

69%

Failures

31

Iteration Cost

$0.02

Baseline: zero-shot single-call harness from src/mini_meta_harness/harnesses/baseline_zero_shot.py.

Filesystem Reads

Evidence pulled into memory

The proposer repeatedly revisits earlier traces and harness snapshots. That read pattern is the mechanism behind the later recovery.

Harness Delta

The actual code mutation

Compare the harness rewrite directly, then inspect the failures that still remain for this step.

Starting point — no diff.

Run Metrics

The search story in one strip

The dip, the recovery, and the token-heavy memory reads all show up here without changing the surrounding context on every scrub.

Accuracy

Accuracy over time

Cost

Cost per iteration

Tokens

Proposer tokens in and out