by alphabetting on 12/6/24, 7:20 PM with 56 comments
by mikeknoop on 12/6/24, 7:52 PM
One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.
But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.
I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.
And I'm proud to say that all the code and papers from this year's winners are now open source!
We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)
The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.
We'll be back in 2025!
by celeritascelery on 12/6/24, 8:24 PM
They do mention that some of the tasks here are susceptible to brute force and they plan to address that in ARC-AGI-2.
> nearly half (49%) of the private evaluation set was solved by at least one team during the original 2020 Kaggle competition all of which were using some variant of brute-force program search. This suggests a large fraction of ARC-AGI-1 tasks are susceptible to this kind of method and does not carry much useful signal towards general intelligence.
by YeGoblynQueenne on 12/6/24, 8:56 PM
So what happened to that assumption? Is it dead?
The second question I still have is about the defenses of ARC against memorisation-based, big-data approaches. I note that the second best system is based on an LLM with "test time training" where the first two steps are:
initial finetuning on similar tasks
auxiliary task format and augmentations
Which is to say, a data augmentation approach. With big data comes great responsibility and the authors of the second-best system don't disappoint: they claim that by training on more examples they achieve reasoning.So what happened to the claim that ARC is secure against big-data approaches? Is it dead?
by nnx on 12/7/24, 3:00 AM
Either I'm really dumb or the test is getting into captcha-like territory where humans aren't really good at solving/deciphering the test anymore.
by hulium on 12/6/24, 9:26 PM
by a_wild_dandan on 12/6/24, 11:45 PM
1. Existing brute force algorithms solve 40% of this "reasoning" and "generalization" test.
2. AGI must evidently fit on a single 16GB, decade-old GPU?
3. If ARC fails blind people, it's not a reasoning test. Reasoning is independent of visual acuity. So ARC is at best a vision processing then reasoning test. SotA model "failure" is meaningless. ("But what about the other format, JSON?" Yeah, I would love to see the human solve rate on that...)
by jebarker on 12/7/24, 4:49 AM