by zone411 on 2/18/25, 5:25 AM with 74 comments
by Tiberium on 2/18/25, 10:10 PM
by CSMastermind on 2/18/25, 11:23 PM
by Snuggly73 on 2/19/25, 6:38 AM
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....
2. Room showing empty - 14857
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...
I could go on and on and on...
The "extensive tests" are also laughable :(
I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.
by runako on 2/18/25, 11:15 PM
Does this work as an experiment if the questions under test were also used to train the LLMs?
by westurner on 2/19/25, 12:26 PM
What could be costed in an upwork or a mechanical turk task Value?
Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centrality
Task Complexity estimation: story points, planning poker, relative local complexity scales
Task Value estimation: cost/benefit analysis, marginal revenue
by bufferoverflow on 2/18/25, 6:11 PM
by moralestapia on 2/18/25, 10:38 PM
On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
by comeonbro on 2/18/25, 11:10 PM
Notably missing: o3
Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png
by neilv on 2/18/25, 10:43 PM
by ctoth on 2/19/25, 4:31 PM
by colesantiago on 2/18/25, 5:35 AM
OpenAI's AGI mission statement
> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."
https://openai.com/index/how-should-ai-systems-behave/
I would have to admit some humility as I sort of brought this on myself [1]
> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests
https://news.ycombinator.com/item?id=43032191
But curiously the question is still valid.
Related:
Sam Altman: "50ยข of compute of a SWE Agent can yield "$500 or $5k of work."