by joshwa on 2/21/25, 5:59 PM with 116 comments
by comex on 2/21/25, 8:52 PM
For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.
For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.
Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.
by modeless on 2/21/25, 6:51 PM
This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.
by bearjaws on 2/21/25, 7:36 PM
OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.
They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.
by ukFxqnLa2sBSBf6 on 2/21/25, 7:27 PM
1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?
2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.
3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.
by dang on 2/21/25, 9:42 PM
If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.
by semi-extrinsic on 2/21/25, 6:59 PM
Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.
This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.
by optimalsolver on 2/21/25, 7:16 PM
1) No known solutions, so there's no "ground truth" dataset to train on
2) Presumably hard to solve
3) But easy to verify a solution if one is provided.
This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?
by huac on 2/21/25, 7:34 PM
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
by perrygeo on 2/21/25, 9:08 PM
It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.
Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.
by brap on 2/21/25, 7:26 PM
by MattDaEskimo on 2/21/25, 7:28 PM
Instead of resolving it, some leaders are further complicating their meaning
Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".
by otterley on 2/21/25, 7:08 PM
I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.
by 1024core on 2/21/25, 9:12 PM
Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.
by OldGreenYodaGPT on 2/21/25, 8:40 PM
This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.
by ionwake on 2/21/25, 8:44 PM
by shayanh on 2/21/25, 8:21 PM
To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.
by acc_297 on 2/21/25, 6:47 PM
Is this what Hofstadter means by a strange-loop?
by alalv on 2/22/25, 8:35 AM
by htrp on 2/21/25, 8:01 PM