from Hacker News

OpenAI o1 Results on ARC-AGI-Pub

by z7 on 9/13/24, 10:14 PM with 118 comments

by killthebuddha on 9/14/24, 3:14 AM
In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:
https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...
Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.
by Stevvo on 9/14/24, 3:01 AM
"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248
So, how well might o1 do with Greenblatt's strategy?
by w4 on 9/14/24, 2:22 AM
> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.
Sheesh. We're going to need more compute.
by fsndz on 9/14/24, 1:26 AM
As expected, I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
by GaggiX on 9/14/24, 1:15 AM
It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.
That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.
I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.
by alphabetting on 9/13/24, 10:37 PM
This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.
by mrcwinn on 9/14/24, 3:03 AM
How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?
by fancyfredbot on 9/14/24, 10:02 AM
I found the level headed explanation of why log linear improvements in test score with increased compute aren't revolutionary the best part of this article. That's not to say the rest wasn't good too! One of the best articles on o1 I've read.
by benreesman on 9/14/24, 2:42 AM
The test you really want is the apples-to-apples comparison between GPT-4o faced with the same CoT and other context annealing that presumably, uh, Q* sorry Strawberry now feeds it (on your dime). This would of course require seeing the tokens you are paying for instead of being threatened with bans for asking about them.
Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.
by Terretta on 9/14/24, 2:32 AM
TL;DR (direct quote):
“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”
“We still need new ideas for AGI.”
by ec109685 on 9/14/24, 3:31 AM
Why is this considered such a great AGI test? It seems possible to extensively train a model on the algorithms used to solve these cases, and some cases feel beyond what a human could straightforwardly figure out.
by a_wild_dandan on 9/14/24, 3:21 AM
This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.
by lossolo on 9/14/24, 2:37 PM
It seems like o1 is a lot worse than Claude on coding tasks https://livebench.ai
by perching_aix on 9/14/24, 8:15 AM
Is it possible for me, a human, to undertake these benchmarks?
by Alifatisk on 9/14/24, 11:11 AM
This is a great marketing for Anthropic
by meowface on 9/14/24, 1:34 AM
Takeaway:
>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.
Scores:
>GPT-4o: 9%
>o1-preview: 21%
>Claude 3.5 Sonnet: 21%
>MindsAI: 46% (current highest score)
by bulbosaur123 on 9/14/24, 7:49 AM
Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?
by devit on 9/14/24, 9:52 AM
Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?
It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.
The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).
It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.
Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).
And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.
For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.