by 0xedb on 8/12/24, 2:39 PM with 12 comments
by pacomerh on 8/12/24, 4:56 PM
by uzumak on 8/12/24, 10:17 PM
by mronetwo on 8/12/24, 5:03 PM
I'm just not seeing a machine that is "likely correct", constantly interrupting the "operator" to be that much of a win. I have seen some software influencers reflect on how much more fun it is to code, after dropping the LLM assistant.
All of these feel like offerings to the Productivity God. As a salary guy I'll never get excited that I can do more during my work day. It's already easy to hit my capacity.
by henning on 8/12/24, 5:18 PM
by difosfor on 8/12/24, 10:49 PM
by Y_Y on 8/12/24, 4:48 PM
by ramon156 on 8/12/24, 4:53 PM
by Bjorkbat on 8/12/24, 5:25 PM
A while back someone on Twitter seemed to confirm that Claude-3.5 was aware of the Github issues inside the dataset by mentioning them, but I couldn't find the original post.
30% performance on the full SWE-bench benchmark is quite the leap, but just how "real" of an achievement is this? Anecdotal reports mention that GPT-4o is marginally better than GPT-4 turbo at best, and yet agents leveraging the LLM did perform better.
What would happen if SWE-bench was updated, top to bottom, with completely new Github issues? Would all these agents just completely shit the bed?
by log101 on 8/12/24, 5:24 PM
closes the tab