← solosystems.dev

Uncontaminatable benchmarks

A class of fully public benchmarks with nothing to memorize. jcode bench is the first instance.

Agent benchmarks today fail in three ways.

They saturate. A fixed task set has a fixed ceiling, and as agents improve, scores pile up against it. The benchmark stops discriminating exactly when the differences matter most, and the field moves on to the next one every year or two.

They are secret. Hidden test sets resist contamination, but nobody outside can run them, reproduce a score, or audit a grader. Results become claims you are asked to trust, and the benchmark can't be used as a public, day-to-day yardstick.

Or they are public, and therefore contaminated. Published tasks and answers leak into training corpora within months. From then on, scores measure memorization instead of ability, and nobody can tell which. Every public fixed-answer benchmark dies this death.

Secrecy and publicity look like a forced choice. It isn't forced: design the benchmark so that memorization is worth nothing, and it can be fully public without contaminating. That is the class we propose.

Definition. A benchmark is uncontaminatable when a model trained on its entire public artifact, every spec, grader, and past submission, gains nothing beyond the skill the benchmark claims to measure. There is no answer to memorize.

Under this definition, "training on the benchmark" stops being cheating, because such training is indistinguishable from acquiring the skill itself. Everything can be published. Results are reproducible by anyone. No trust in a secret test set is required.

The pillars

Six properties, each guaranteed by the task's structure, not by policing.

PropertyMeaning
QuantifiableThe metric falls out of the task definition: bytes, cycles under a published cost model. No rubric, no judge.
DeterministicSame submission, same score, always. No sampled test set, no timing noise.
AnalogScored on a continuous quality axis, not pass/fail on a task set. Pass/fail yields one bit per task and only discriminates near its threshold; an analog score produces useful information at every capability level, so one bench stays informative from weak agents to frontier.
Cheat-resistantCorrectness is verified exhaustively, so there is nothing to overfit. The strongest known implementation is handed out as the starting point, so there is nothing to look up.
Fast to iterateGrading takes seconds. The loop is edit, grade, number. The benchmark measures how well an agent climbs, so the loop must be tight enough to climb.
Pure codingThe deliverable is a program doing real, useful software work. Not a disguised math puzzle.

jcode bench

The first instance. Each task hands the agent a working, tested, production-grade primitive, the kind of function that ships in real libraries, together with its exhaustive verifier and a published deterministic cost model. The task is one sentence:

Make it faster. It must stay correct on every possible input. We check all of them.

Because verification is exhaustive, there is no test suite to overfit; total correctness is the test. Because the best known implementation is the starting line, copying from the internet gains zero. Because the cost model is deterministic, the score is a pure function of the code.

Scoring

Improvements to optimized code are multiplicative, so the score is logarithmic: doublings of improvement over the given implementation.

0.0the given implementation, unimproved +1.0twice as fast under the cost model, still correct on all inputs +2.0four times as fast

Time is recorded, not capped. The harness tracks the best score continuously as the agent works, producing a score-over-time curve rather than a number at an arbitrary deadline. Agents are expected to become more useful over longer horizons, and the curve captures exactly that: how fast an agent climbs, and how far it can go when simply left to work. Every point above zero is improvement beyond what can be looked up.

See jcode bench →

jcode bench is in development. Task specs, graders, and the harness will be published in full. The only value that will ever be withheld is nothing: there is no hidden test set.