from Hacker News

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

by zone411 on 2/18/25, 5:25 AM with 74 comments

by Tiberium on 2/18/25, 10:10 PM
The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).
by CSMastermind on 2/18/25, 11:23 PM
I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.
by Snuggly73 on 2/19/25, 6:38 AM
First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....
2. Room showing empty - 14857
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...
I could go on and on and on...
The "extensive tests" are also laughable :(
I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.
by runako on 2/18/25, 11:15 PM
It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).
Does this work as an experiment if the questions under test were also used to train the LLMs?
by westurner on 2/19/25, 12:26 PM
> By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.
What could be costed in an upwork or a mechanical turk task Value?
Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centrality
Task Complexity estimation: story points, planning poker, relative local complexity scales
Task Value estimation: cost/benefit analysis, marginal revenue
by bufferoverflow on 2/18/25, 6:11 PM
And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).
by moralestapia on 2/18/25, 10:38 PM
The writing is very clearly on the wall.
On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
by comeonbro on 2/18/25, 11:10 PM
Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)
Notably missing: o3
Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png
by neilv on 2/18/25, 10:43 PM
"SWE-Lancer", like, skewering SWEs with a lance?
by ctoth on 2/19/25, 4:31 PM
Gonna lance them SWEs like a boil!
by colesantiago on 2/18/25, 5:35 AM
Can anyone explain how this research benefits humanity for OpenAI's mission?
OpenAI's AGI mission statement
> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."
https://openai.com/index/how-should-ai-systems-behave/
I would have to admit some humility as I sort of brought this on myself [1]
> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests
https://news.ycombinator.com/item?id=43032191
But curiously the question is still valid.
Related:
Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."
https://news.ycombinator.com/item?id=43032098
https://x.com/vitrupo/status/1889720371072696554