by z7 on 9/13/24, 10:14 PM with 118 comments
by killthebuddha on 9/14/24, 3:14 AM
https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...
Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.
by Stevvo on 9/14/24, 3:01 AM
So, how well might o1 do with Greenblatt's strategy?
by w4 on 9/14/24, 2:22 AM
Sheesh. We're going to need more compute.
by fsndz on 9/14/24, 1:26 AM
by GaggiX on 9/14/24, 1:15 AM
That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.
I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.
by alphabetting on 9/13/24, 10:37 PM
by mrcwinn on 9/14/24, 3:03 AM
by fancyfredbot on 9/14/24, 10:02 AM
by benreesman on 9/14/24, 2:42 AM
Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.
by Terretta on 9/14/24, 2:32 AM
“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”
“We still need new ideas for AGI.”
by ec109685 on 9/14/24, 3:31 AM
by a_wild_dandan on 9/14/24, 3:21 AM
by lossolo on 9/14/24, 2:37 PM
by perching_aix on 9/14/24, 8:15 AM
by Alifatisk on 9/14/24, 11:11 AM
by meowface on 9/14/24, 1:34 AM
>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.
Scores:
>GPT-4o: 9%
>o1-preview: 21%
>Claude 3.5 Sonnet: 21%
>MindsAI: 46% (current highest score)
by bulbosaur123 on 9/14/24, 7:49 AM
by devit on 9/14/24, 9:52 AM
It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.
The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).
It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.
Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).
And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.
For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.