by mikeknoop on 6/11/24, 5:19 PM with 337 comments
ARC-AGI is (to our knowledge) the only eval which measures AGI: a system that can efficiently acquire new skill and solve novel, open-ended problems. Most AI evals measure skill directly vs the acquisition of new skill.
Francois created the eval in 2019, SOTA was 20% at inception, SOTA today is only 34%. Humans score 85-100%. 300 teams attempted ARC-AGI last year and several bigger labs have attempted it.
While most other skill-based evals have rapidly saturated to human-level, ARC-AGI was designed to resist “memorization” techniques (eg. LLMs)
Solving ARC-AGI tasks is quite easy for humans (even children) but impossible for modern AI. You can try ARC-AGI tasks yourself here: https://arcprize.org/play
ARC-AGI consists of 400 public training tasks, 400 public test tasks, and 100 secret test tasks. Every task is novel. SOTA is measured against the secret test set which adds to the robustness of the eval.
Solving ARC-AGI tasks requires no world knowledge, no understanding of language. Instead each puzzle requires a small set of “core knowledge priors” (goal directedness, objectness, symmetry, rotation, etc.)
At minimum, a solution to ARC-AGI opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. At maximum, unlocks the tech tree towards AGI.
Our goal with this competition is:
1. Increase the number of researchers working on frontier AGI research (vs tinkering with LLMs). We need new ideas and the solution is likely to come from an outsider! 2. Establish a popular, objective measure of AGI progress that the public can use to understand how close we are to AGI (or not). Every new SOTA score will be published here: https://x.com/arcprize 3. Beat ARC-AGI and learn something new about the nature of intelligence.
Happy to answer questions!
by neoneye2 on 6/11/24, 11:07 PM
I'm collecting data for how humans are solving ARC tasks, and so far collected 4100 interaction histories (https://github.com/neoneye/ARC-Interactive-History-Dataset). Besides ARC-AGI, there are other ARC like datasets, these can be tried in my editor (https://neoneye.github.io/arc/).
I have made some videos about ARC:
Replaying the interaction histories, and you can see people have different approaches. It's 100ms per interaction. IRL people doesn't solve task that fast. https://www.youtube.com/watch?v=vQt7UZsYooQ
When I'm manually solving an ARC task, it looks like this, and you can see I'm rather slow. https://www.youtube.com/watch?v=PRdFLRpC6dk
What is weird. The way that I implement a solver for a specific ARC task is much different than the way that I would manually solve the puzzle. Having to deal with all kinds of edge cases.
Huge thanks to the team behind the ARC Prize. Well done.
by salamo on 6/11/24, 10:38 PM
If I can make one criticism/observation of the tests, it seems that most of them reason about perfect information in a game-theoretic sense. However, many if not most of the more challenging problems we encounter involve hidden information. Poker and negotiations are examples of problem solving in imperfect information scenarios. Smoothly navigating social situations also requires a related problem of working with hidden information.
One of the really interesting things we humans are able to do is to take the rules of a game and generate strategies. While we do have some algorithms which can "teach themselves" e.g. to play go or chess, those same self-play algorithms don't work on hidden information games. One of the really interesting capabilities of any generally-intelligent system would be synthesizing a general problem solver for those kinds of situations as well.
by lacker on 6/11/24, 9:02 PM
Would an intelligent but blind human be able to solve these problems?
I'm worried that we will need more than 800 examples to solve these problems, not because the abstract reasoning is so difficult, but because the problems require spatial knowledge that we intelligent humans learn with far more than 800 training examples.
by pmayrgundter on 6/11/24, 9:49 PM
In it they question the ease of Chollet's tests: "One limitation on ARC’s usefulness for AI research is that it might be too challenging. Many of the tasks in Chollet’s corpus are difficult even for humans, and the corpus as a whole might be sufficiently difficult for machines that it does not reveal real progress on machine acquisition of core knowledge."
ConceptARC is designed to be easier, but then also has to filter ~15% of its own test takers for "[failing] at solving two or more minimal tasks... or they provided empty or nonsensical explanations for their solutions"
After this filtering, ConceptARC finds another 10-15% failure rate amongst humans on the main corpus questions, so they're seeing maybe 25-30% unable to solve these simpler questions meant to test for "AGI".
ConceptARC's main results show CG4 scoring well below the filtered humans, which would agree with a [Mensa] test result that its IQ=85.
Chollet and Mitchell could instead stratify their human groups to estimate IQ then compare with the Mensa measures and see if e.g. Claude3@IQ=100 compares with their ARC scores for their average human
[ConceptArc]https://arxiv.org/pdf/2305.07141 [Mensa]https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...
by paxys on 6/11/24, 9:36 PM
I'd also urge you to use a different platform for communicating with the public because x.com links are now inaccessible without creating an account.
by elicksaur on 6/11/24, 9:23 PM
However, I do disagree that this problem represents “AGI”. It’s just a different dataset than what we’ve seen with existing ML successes, but the approaches are generally similar to what’s come before. It could be that some truly novel breakthrough which is AGI solves the problem set, but I don’t think solving the problem set is a guaranteed indicator of AGI.
by nadam on 6/12/24, 11:43 AM
by Animats on 6/12/24, 7:31 AM
That's a stretch. This is a problem at which LLMs are bad. That does not imply it's a good measure of artificial general intelligence.
After working a few of the problems, I was wondering how many different transformation rules the problem generator has. Not very many, it seems. So the problem breaks down into extracting the set of transformation rules from the data, then applying them to new problems. The first part of that is hard. It's a feature extraction problem. The transformations seem to be applied rigidly, so once you have the transformation rules, and have selected the ones that work for all the input cases, application should be straightforward.
This seems to need explicit feature extraction, rather than the combined feature extraction and exploitation LLMs use. Has anyone extracted the rule set from the test cases yet?
by levocardia on 6/11/24, 11:46 PM
Defining intelligence as an efficiency of learning, after accounting for any explicit or implicit priors about the world, makes it much easier to understand why human intelligence is so impressive.
by bigyikes on 6/11/24, 9:39 PM
by itissid on 6/12/24, 12:35 AM
What about Theory of Mind which talks about the problem of multiple agents in the real world acting together? Like driving a car cannot be done right now without oodles of data or any robot - human problem that requires the robot to model human's goals and intentions.
I think the problem is definition of general intelligence: Intelligence in the context of what? How much effort(kwh, $$ etc) is the human willing to amortize over the learning cycle of a machine to teach it what it needs to do and how that relates to a personally needed outcome( like build me a sandwich or construct a house)? Hopefully this should decrease over time.
I believe the answer is that the only intelligence that really matters is Human-AI cooperative intelligence and our goals and whether a machine understands them. The problems then need to be framed as optimization of a multi attribute goal with the attribute weights adjusted as one learns from the human.
I know a few labs working on this, one is in ASU(Kambhampati, Rao et. al) and possibly Google and now maybe open ai.
by ks2048 on 6/12/24, 5:34 AM
So you can view 100 per page instead of clicking through one-by-one: https://kts.github.io/arc-viewer/page1/
by bigyikes on 6/11/24, 9:35 PM
Is there something special about these questions that makes them resistant to memorization? Or is it more just the fact that there are 100 secret tasks?
by btbuildem on 6/12/24, 1:13 PM
1: https://www.crn.com/news/applications-os/220100498/researche...
by dang on 6/12/24, 1:17 AM
Francois Chollet: OpenAI has set back the progress towards AGI by 5-10 years - https://news.ycombinator.com/item?id=40652818 - June 2024 (5 comments)
by nmca on 6/12/24, 6:47 AM
https://manifold.markets/JacobPfau/will-the-arcagi-grand-pri...
by Lerc on 6/11/24, 8:52 PM
Not sure If I have the skills to make an entry, but I'll be watching at least.
by visarga on 6/12/24, 6:15 AM
This scales for 200M users and 1 billion sessions per moth for OpenAI, which can interpret every human response as a feedback signal, implicit or explicit. Even more if you take multiple sessions of chat spreading over days, that continue the same topic and incorporate real world feedback. The scale of interaction is just staggering, the LLM can incorporate this experience to iteratively improve.
If you take a look at humans, we're very incapable alone. Think feral Einstein on a remote island - what could he achieve without the social context and language based learning? Just as a human brain is severely limited without society, LLMs also need society, diversity of agents and experiences, and sharing of those experiences in language.
It is unfair to compare a human immersed in society with a standalone model. That is why they appear limited. But even as a system of memorization+recombination they can be a powerful element of the AGI. I think AGI will be social and distributed, won't be a singleton. Its evolution is based on learning from the world, no longer just a parrot of human text. The data engine would be: World <-> People <-> LLM, a full feedback cycle, all three components evolve in time. Intelligence evolves socially.
by logicallee on 6/11/24, 10:55 PM
>Happy to answer questions!
1. Can humans take the complete test suite? Has any human done so? Is it timed? How long does it take a human? What is the highest a human who sat down and took the ARC-AGI test scored?
2. How surprised would you be if a new model jumped to scoring 100% or nearly 100% on ARC-AGI (including the secret test tasks)? What kind of test would you write next?
by mkl on 6/12/24, 11:40 AM
Here's how I understand the rule: yellow blobs turn green then spew out yellow strips towards the blue line, and the width of the strips is the number of squares the green blobs take up along the blue line. The yellow strips turn blue when they hit the blue line, then continue until they hit red, then they push the red blocks all the way to the other side, without changing the arrangement of the red blocks that were in the way of the strip.
The first example violates the last bit. The red blocks in the way of the rightmost strip start as
R
R R
R R R
but get turned into R R
R R
R R R
Every other strip matches my rule.by Retr0id on 6/14/24, 1:20 PM
The current batch of LLMs can be uncharitably summarized as "just predict the next token". They're pretty good at that. If they were perfect at it, they'd enable AGI - but it doesn't look like they're going to get there. It seems like the wrong approach. Among other issues, finite context windows seem like a big limitation (even though they're being expanded), and recursive summarization is an interesting kludge.
The ARC-AGI tasks seem more about pattern matching, in the abstract sense (but also literally). Humans are good at pattern matching, and we seem to use pattern matching test performance as a proxy for measuring human intelligence (like in "IQ" tests). I'm going to side-step the question of "what is intelligence, really?" by defining it as being good at solving ARC-AGI tasks.
I don't know what the solution is, but I have some idea of what it might look like - a machine with high-order pattern-matching capabilities. "high-order" as in being able to operate on multiple granularities/abstraction-levels at once (there are parallels here to recursive summarization in LLMs).
So what is the difference between "pattern matching" and "token prediction"? They're closely related, and you could use one to do the other. But the real difference is that in pattern matching there are specific patterns that you're matching against. If you're lucky you can even name the pattern/trope, but it might be something more abstract and nameless. These patterns can be taught explicitly, or inferred from the environment (i.e. "training data").
On the other hand, "token prediction" (as implemented today) is more of a probabilistic soup of variables. You can ask an LLM why it gave a particular answer and it will hallucinate something plausible for you, but the real answer is just "the weights said so". But a hypothetical pattern matching machine could tell you which pattern(s) it was matching against, and why.
So to summarize (hah), I think a good solution will involve high-order meta-pattern matching capabilities (natively, not emulated or kludged via an LLM-shaped interface). I have no idea how to get there!
by geor9e on 6/12/24, 6:12 AM
by visarga on 6/12/24, 6:54 AM
If the AI is really AGI it could presumably do it. But not even the whole human society can do it in one go, it's a slow iterative process of ideation and validation. Even though this is a life and death matter, we can't simply solve it.
This is why AGI won't look like we expect, it will be a continuation of how societies solve problems. Intelligence of a single AI in isolation is not comparable to that of societies of agents with diverse real world interactions.
by freediver on 6/11/24, 6:35 PM
by nojvek on 6/11/24, 10:45 PM
I did a few human examples by hand, but gotta do more of them to start seeing patterns.
Human visual and auditory system is impressive. Most animals see/hear and plan from that without having much language. Physical intelligence is the biggest leg up when it comes to evolution optimizing for survival.
by nmca on 6/12/24, 6:53 AM
by skywhopper on 6/12/24, 11:23 AM
Speaking of extraordinary claims. What evidence is there that LLMs have “proven economic utility”? They’ve drawn a ludicrous amount of investment thanks to claims of future economic utility, but I’ve yet to see any evidence of it.
by PontifexMinimus on 6/13/24, 7:23 AM
{
"train": [
{"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
{"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
{"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
],
"test": [
{"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
]
}
But why restrict yourself to JSON that codes for 2-d coloured grids? Why not also allow: {
"train": [
{"input": [[1, 0], [0, 0]], "output": 1},
{"input": [[0, 0], [4, 0]], "output": 4},
{"input": [[0, 0], [6, 0]], "output": 6}
]
}
Where the rule might be to output the biggest number in the input, or add them up (and the solver has to work out which).by curious_cat_163 on 6/12/24, 1:57 AM
However, why are the 100 test tasks secret? I don't understand why how resisting “memorization” techniques requires it. Maybe someone can enlighten me.
by TheDudeMan on 6/12/24, 12:16 AM
by Geee on 6/12/24, 12:52 AM
by ryanoptimus on 6/12/24, 7:57 AM
by david_shi on 6/11/24, 10:16 PM
by jolt42 on 6/12/24, 3:17 AM
by abtinf on 6/11/24, 10:33 PM
This is treating “intelligence” like some abstract, platonic thing divorced from reality. Whatever else solving these puzzles is indicative of, it’s not intelligence.
by lxe on 6/11/24, 10:05 PM
by mewpmewp2 on 6/12/24, 3:58 AM
by z3phyr on 6/12/24, 4:48 AM
by chairhairair on 6/12/24, 1:19 AM
I bet you could use those puzzles as benchmarks as well.
by treprinum on 6/12/24, 11:06 AM
by KBme on 6/12/24, 6:30 AM
by ilaksh on 6/12/24, 3:51 AM
Things like SORA and gpt-4o that use [diffusion transformers etc. or whatever the SOTA is for multimodal large models] seem to be able to generalize quite well. Have these latest models been tested against this task?
by HarHarVeryFunny on 6/12/24, 11:20 AM
1) Who is providing the prize money, and if it is yourself and Francois personally, then what is your motivation ?
2) Do you think it's possible to create a word-based, non-spatial (not crosswords or sudoku, etc) ARC test that requires similar run-time exploration and combination of skills (i.e. is not amenable to a hoard of narrow skills)?
by p1esk on 6/12/24, 3:26 AM
by blendergeek on 6/12/24, 1:56 PM
Is there a "color-blind friendly" mode?
by PontifexMinimus on 6/13/24, 6:28 AM
- annoying animated background
- white text on black background
- annoying font choices
Which is unfortunate because (as I found when I used Firefox reader mode) you're discussing important and interesting stuff.
by mishamagic on 6/14/24, 6:30 AM
by bilsbie on 6/12/24, 3:32 AM
by arcastroe on 6/12/24, 4:21 AM
by djoldman on 6/12/24, 11:44 AM
Anyone else share the suspicion that ML rapidly approaching 100% on benchmarks is sometimes due to releasing the test set?
by ummonk on 6/12/24, 1:33 AM
It's rather surprising to me that neural nets that can learn to win at Go or Chess can't learn to solve these sorts of tasks. Intuitively would have expected that using a framework generating thousands of playground tasks similar to the public training tasks, a reinforcement learning solution would have been able to do far better than the actual SOTA. Of course the training budget for this could very well be higher than the actual ARC-AGI prize amount...
by lenerdenator on 6/12/24, 1:07 PM
by dskloet on 6/12/24, 3:54 AM
by flawn on 6/11/24, 11:48 PM
by chx on 6/12/24, 11:09 AM
by adamgordonbell on 6/12/24, 1:38 AM
by empath75 on 6/11/24, 10:43 PM
by lamontcg on 6/11/24, 11:31 PM
by s1k3s on 6/12/24, 2:00 AM
:)
by thatxliner on 6/12/24, 3:59 AM
by EternalFury on 6/12/24, 3:47 AM
by barfbagginus on 6/12/24, 3:16 AM
I feel like a prize of a billion dollars would be more effective.
But even if it was me, and even if the prize was a hundred billion dollars, I would still keep it under wraps, and use it to advance queer autonomous communism in a hidden way, until FALGSC was so strong that it would not matter if our AGI got scooped by capitalist competitors.
by m3kw9 on 6/11/24, 9:26 PM
by breck on 6/11/24, 8:49 PM
If you make your site public domain, and drop the (C), I'll compete.