Uncontaminatable benchmarks
A class of fully public benchmarks with nothing to memorize. jcode bench is the first instance.
Agent benchmarks today fail in three ways.
They saturate. A fixed task set has a fixed ceiling, and as agents improve, scores pile up against it. The benchmark stops discriminating exactly when the differences matter most, and the field moves on to the next one every year or two.
They are secret. Hidden test sets resist contamination, but nobody outside can run them, reproduce a score, or audit a grader. Results become claims you are asked to trust, and the benchmark can't be used as a public, day-to-day yardstick.
Or they are public, and therefore contaminated. Published tasks and answers leak into training corpora within months. From then on, scores measure memorization instead of ability, and nobody can tell which. Every public fixed-answer benchmark dies this death.
Secrecy and publicity look like a forced choice. It isn't forced: design the benchmark so that memorization is worth nothing, and it can be fully public without contaminating. That is the class we propose.
Under this definition, "training on the benchmark" stops being cheating, because such training is indistinguishable from acquiring the skill itself. Everything can be published. Results are reproducible by anyone. No trust in a secret test set is required.
The pillars
Six properties, each guaranteed by the task's structure, not by policing.
| Property | Meaning |
|---|---|
| Quantifiable | The metric falls out of the task definition: bytes, cycles under a published cost model. No rubric, no judge. |
| Deterministic | Same submission, same score, always. No sampled test set, no timing noise. |
| Analog | Scored on a continuous quality axis, not pass/fail on a task set. Pass/fail yields one bit per task and only discriminates near its threshold; an analog score produces useful information at every capability level, so one bench stays informative from weak agents to frontier. |
| Cheat-resistant | Correctness is verified exhaustively, so there is nothing to overfit. The strongest known implementation is handed out as the starting point, so there is nothing to look up. |
| Fast to iterate | Grading takes seconds. The loop is edit, grade, number. The benchmark measures how well an agent climbs, so the loop must be tight enough to climb. |
| Pure coding | The deliverable is a program doing real, useful software work. Not a disguised math puzzle. |
jcode bench
The first instance. Each task hands the agent a working, tested, production-grade primitive, the kind of function that ships in real libraries, together with its exhaustive verifier and a published deterministic cost model. The task is one sentence:
Because verification is exhaustive, there is no test suite to overfit; total correctness is the test. Because the best known implementation is the starting line, copying from the internet gains zero. Because the cost model is deterministic, the score is a pure function of the code.
Scoring
Improvements to optimized code are multiplicative, so the score is logarithmic: doublings of improvement over the given implementation.
Time is recorded, not capped. The harness tracks the best score continuously as the agent works, producing a score-over-time curve rather than a number at an arbitrary deadline. Agents are expected to become more useful over longer horizons, and the curve captures exactly that: how fast an agent climbs, and how far it can go when simply left to work. Every point above zero is improvement beyond what can be looked up.
See jcode bench →jcode bench is in development. Task specs, graders, and the harness will be published in full. The only value that will ever be withheld is nothing: there is no hidden test set.