Working paper · PushDev Notes

A field note on agent evaluation

A rising benchmark score now tells you more about the benchmark than the agent. Notes from the other side of trusting one.

Yavari

PushDev Notes · May 7, 2026

Abstract

Agent benchmarks in 2026 keep posting higher scores while telling us less. This note works through three ways a benchmark number misleads the person reading it: saturation and contamination that hollow out coding scores, harnesses gullible enough that a do-nothing agent passes a third of the tasks, and single-attempt metrics that hide how rarely an agent succeeds twice in a row. It closes with the four habits, drawn mostly from independent and academic work, that I now trust further than any leaderboard.

Last year I shipped a change I was proud of for about a week.

The problem was image identification: getting a model to tell, reliably, which category a picture belonged to. An agent had proposed the approach, and on paper it was clean: it cleared the evaluation we'd set up, beat what we already had, and the write-up read like a solved problem. So I merged it.

Then it met the real world, and the real world was unimpressed. The numbers that had looked decisive in the eval didn't survive contact with the actual images we cared about. It failed in the specific, undramatic ways that never show up on a leaderboard: the edge cases, the messy inputs, the conditions nobody thought to put in the test set. After a few days of trying to talk myself out of what I was seeing, I reverted the whole thing.

I've been suspicious of evaluation numbers ever since. Not because the agent lied. It didn't. Because the eval did. Or more precisely: the eval told the truth about itself, and I mistook that for the truth about the world.

This is a field note about that mistake, and about how widespread it has quietly become. Agent evaluation in 2026 is in a strange place: the scores keep going up, and they keep meaning less. Below are the three ways I've watched a benchmark number betray the person reading it, and the few things I now trust instead.

1. Score up, information down

A benchmark is only useful while it can tell two systems apart. The moment everyone clusters near the top, the number stops carrying information. It becomes a participation ribbon that happens to have decimal places.

Coding agents got there fast. The headline benchmarks now report scores high enough that the interesting question isn't "who's winning" but "is anyone actually being measured." Part of the climb is real progress. A large part is contamination: when the test problems live in public repositories, they leak into training data, sometimes deliberately and often just by the ordinary gravity of scraping the internet. A model that has seen the answer isn't being evaluated. It's being quizzed on its own notes.

The cleanest way to see how much of a score was contamination is to take it away. SWE-bench Pro, built by Scale AI, does this by holding out private commercial codebases: proprietary repositories that, by construction, cannot be in anyone's training set. Run the same class of frontier models against problems they genuinely haven't seen and the numbers fall through the floor: on the private commercial leaderboard, GPT-5 drops from 23.1% to 14.9%, and Claude Opus 4.1 from 22.7% to 17.8%. These are the same models that clear far higher marks on the older, public, well-scraped benchmarks.

There's a smaller irony buried in this that I can't stop thinking about. When I went looking for these figures, the first few write-ups I found quoted much rosier numbers. One had the models at "46 to 58 percent." The inflation had crept into the reporting of the benchmark, not just the benchmark itself. The primary source, Scale's own leaderboard, was far less flattering than its secondhand summaries. That is a small lesson I've now learned twice: the further a number travels from the people who measured it, the better it tends to look. I've started reading the source or not citing the number.

The gap between the public score and the private one isn't the model getting worse. It's the public score showing you how much of itself was never real. Score up, information down.

2. The harness becomes the subject

When a benchmark says an agent scored 60%, the unspoken claim is that the missing 40% is the agent's shortfall. That arithmetic only holds if something doing no real work would score zero. Often it doesn't.

In 2025 a group at the University of Illinois ran a set of validity checks they called the Agentic Benchmark Checklist across ten popular agent benchmarks. They were looking for two specific failures: tasks you can pass without doing the work, and tasks you can't pass even when the work is done correctly. Most of the ten had at least one. On τ-bench, a customer-service benchmark, a "do-nothing" agent that takes no action passes 38% of tasks, and an agent that just spams replies passes 40%. On WebArena, an agent can satisfy the grader with the right string in its output without ever resolving the user's request. On OSWorld the failure runs the other way: in a slice of the desktop tasks, an agent that actually completes the job is scored as failing, because the checker is watching for the wrong signal.

Sit with the τ-bench number for a moment. If doing nothing earns 38 points, then a real agent's first 38 points are indistinguishable from inertia. The floor of the benchmark is not zero. It is noise in the costume of signal, and the whole figure gets reported as though every point were earned. The checklist's authors put the resulting distortion at up to roughly 40% on benchmarks as established as SWE-bench Verified and τ-bench, then built an auditing agent that hunts these holes automatically, which tells you they expect to keep finding them.

So here is the reframe I can't unsee: a score like this measures the agent and the gullibility of the harness at the same time, adds the two together, and prints the sum as a single number. Make the harness easy enough to fool and the thing you are really ranking is the harness. It is worth noticing that the people who found this are an academic lab with nothing to sell, not the leaderboards themselves. I have started weighting that kind of source more heavily, which is a bias I will defend in a minute.

3. The reliability illusion

Here is the failure that actually burned me, dressed up as a statistic.

Most benchmark scores are single-attempt. The agent gets one shot at each task and we record whether it worked. Call it pass@1. It answers an exact question: run this agent once, how often does it succeed? That is the right question for a demo and the wrong one for anything you deploy, because deployed agents do not run once. They run thousands of times, and a user meets the agent on attempt 847, not attempt one.

The τ-bench authors measured this gap on purpose, with a metric they call pass^k: the probability that all k independent attempts at the same task succeed. Single-attempt success for a strong model on their retail tasks sits under 50%. Raise the bar to pass^8, the chance of getting the same task right eight times running, and it drops below 25%. The agent that looked like a coin flip on one try is closer to a one-in-four bet across a handful of them. Nothing about the model changed between those two numbers. Only the question did.

Passing once is a demo. Passing eight times in a row is a product.

This is, in retrospect, exactly what bit me with the image model. The eval asked whether the approach could work, and the honest answer was yes. I read it as whether it would work, which is a different sentence with a much worse answer. The green checkmark showed me the ceiling. I shipped as if it were the floor.

4. What I trust instead

I don't think the fix is a better benchmark. Any fixed test, however clever, starts leaking into training data the day it gets popular and saturates the day models get good. So the honest move is to stop hunting for a number that can't be beaten and start collecting habits that survive the number being beaten. Four of them have earned my trust, and they do different jobs.

First, prefer tests that resist contamination by construction, not by promise. A benchmark built on private codebases that labs cannot legally train on, like the private side of SWE-bench Pro, tells you more than one assembled from public GitHub issues, however carefully the public set was filtered. Impossibility is a stronger guarantee than secrecy.

Second, run the cheap adversarial baseline before you believe a score. The do-nothing agent and the spam agent cost almost nothing to write, and they report the benchmark's noise floor on the spot. If a benchmark won't tell you what a null agent scores, treat its headline number as unaudited. This should be table stakes, the way a control group is in any other field that measures things.

Third, ask for the reliability number, not the best-case one. pass@1 is a sales figure. pass^k is an operations figure. For anything that runs unattended, the second is the only one I would put in a plan.

Fourth, and this one I came to slowly: watch a moving target instead of a fixed score. METR, a nonprofit, stopped asking "what percent did it pass" and started asking "how long a task can it finish," tracking the duration of work a model completes at a given reliability. Their headline is a doubling time, lately around three months, with confidence intervals they are refreshingly honest about being wide. A velocity is harder to fake than a position, because to fake it you would have to keep faking it, faster, forever.

Two of those four come from outside the companies whose models get ranked: an academic lab and a nonprofit. That is the bias I promised to defend, so here it is, plainly. A benchmark published by the maker of the model it flatters is a press release with a methodology section. I still read it. I just read it the way I read any number whose author profits from my believing it. The independent work is slower, less polished, and worse at marketing, and it is the only kind that has not burned me yet.

None of this is a single instrument. It is a fixed test for the things that stay still, a moving one for the things that don't, an adversarial sanity check on both, and a standing suspicion of whoever is holding the scoreboard. Different jobs, different tools.

Coda

I still don't have a number I fully believe. I have something more useful, which is a sharp memory of the week I believed one. The image model didn't teach me that evaluation is hopeless. It taught me that a benchmark answers exactly the question it was built to answer, and the only dangerous moment is the one where you forget which question that was. The score was never lying to me. I was reading it as the answer to a question nobody had asked it.

Sources

UIUC Kang Lab, "Establishing Best Practices for Building Rigorous Agentic Benchmarks" (2025), and the accompanying Agentic Benchmark Checklist project page. Source of the do-nothing/spam baseline results, the validity audits of ten benchmarks, and the "up to ~40% misrepresentation" figure.
Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan, "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (Sierra, 2024). Source of the pass^k reliability metric and the pass^8 figures.
METR, "Time Horizon 1.1" (29 January 2026). Source of the time-horizon framing and doubling-time figures.
Scale AI, "SWE-bench Pro" and the private commercial leaderboard. Source of the public-versus-private score gap.

Comments

Loading…