by amrrs on 6/6/25, 6:18 PM with 270 comments
by jackdoe on 6/7/25, 10:54 AM
I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.
I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.
[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...
by curious_cat_163 on 6/7/25, 12:34 AM
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
by stephc_int13 on 6/7/25, 1:42 PM
I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.
We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.
by antics on 6/7/25, 1:56 AM
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
by gwd on 6/7/25, 12:40 PM
This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.
by actinium226 on 6/6/25, 11:59 PM
by avsteele on 6/8/25, 2:36 PM
My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)
by thomasahle on 6/7/25, 7:42 AM
I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
by teleforce on 6/7/25, 7:30 AM
It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
https://www.youtube.com/live/TknN8fCQvRk
[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:
https://youtube.com/watch?v=HB5TrK7A4pI
[3] Google OR-Tools:
https://developers.google.com/optimization
[4] MiniZinc:
by a11r on 6/8/25, 4:36 PM
Here is my complete review/analysis of the paper: https://www.linkedin.com/pulse/art-abstraction-human-advanta...
edit: fixed typo
by jbentley1 on 6/7/25, 11:50 AM
by bigEnotation on 6/8/25, 2:17 PM
by JusticeJuice on 6/6/25, 10:39 PM
by bilsbie on 6/8/25, 12:55 PM
Just kept regurgitating internet advice and I couldn’t get it to understand the reasoning on why it was wrong.
by kamranjon on 6/7/25, 3:23 PM
Even when given the exact steps needed to arrive at a solution in the prompt, the reasoning models still require just as many steps to reach a workable solution as they would if they weren’t given the solution in the prompt.
The other thing, which seems obvious in hindsight, but I don’t typically use these reasoning models in my day to day - is that it requires a significant amount of tokens to reach the point where reasoning models outperform non-reasoning models by a significant margin.
by bilsbie on 6/8/25, 12:57 PM
Maybe we plug into something like prolog (or other such strategies?)
by nialv7 on 6/6/25, 11:28 PM
> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?
Define reasoning, define generalizable, define pattern matching.
For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.
by ksec on 6/8/25, 1:51 AM
My idea was that up to a few years ago while AI / LLM is good at getting conversational or dishing out results that is in a language we understand. It still doesn't "understand" anything and in a lot of time conjured up that seems remotely correctly. Pattern matching over a very large data set that could be correct for 70% and increasingly to 80%+ of the time. However more accurate predictions would require order of magnitude more computing resources.
But pattern matching is still, pattern matching. There is no reasoning behind it. 1+1 will never equals to 11 but it may have skewed towards that results because of Javascript. When fundamental logic isn't behind any of these progress and process. The very bottom layer of any conversation / information / results are fragile.
So I have been skeptical of AI progress or LLM. That was until LRM or as the title said Reasoning LLMs. I thought we somehow manage to programme critical thinking into it, or some sort of reflection / fact checking / rationale / basic logic as fundamental principle. And while I can tell LRM isn't and wont be perfect, and possibly never quite reach AGI, the layer will improve over time until we find different ways to progress. And we will have something I called Assisted Intelligence. Which is what a lot of people uses as AI programming today.
Instead what this shows is that LRM isn't reasoning at all. It is LLM conjured up excuses to make it look like it is reasoning. It is another set of pattern matching specially made up for reasoning to look like it is reasoning. It is basically a kid making things up on why he got the results without thinking because he just want to get away from class or homework that looks very clever.
May be the title gave it away, and made be we got tricked. It was always a LLM specifically trained for showcasing "reasoning". The actual reasoning behind the scene is never done. Hence the title "The Illusion of Thinking".
by mcswell on 6/8/25, 2:02 PM
by rikafurude21 on 6/8/25, 2:39 PM
by whiplash451 on 6/16/25, 6:46 PM
by ivape on 6/6/25, 10:16 PM
by akomtu on 6/7/25, 6:34 PM
1 3 7 15 31 63 ...
How do you continue this sequence? What's the 1000000th number in this sequence? Imitation continues the likeness of what it sees and quickly gets off track. Imitation can't go abstract and tell the 1000000th element without writing down a million numbers leading to the answer. Reasoning finds the rule behind the set of examples and uses this rule to predict the next numbers, so it never gets off track.The rule generating the sequence can be a sophisticated recurrent formula, e.g. a(k) = 2a(k-1) - sqrt(a(k-3)). Imitation can't solve this problem beyond trivial examples, but an AI can do what a scientist would do: come up with hypotheses, verify them against the examples and eventually find a formula that's reasonably accurate. The role of an LLM here is to suggest possible formulas.
The same sequence of examples can be generated by many formulas that differ in complexity and accuracy. This provokes the idea of a simple competition between AIs: the one that creates the simplest formula that's 99.5% accurate - wins. The formula really means a small program, once we get beyond trivial recurrent rules.
The ability to find simple and accurate models of reality is the essense of intelligence.
by esafak on 6/7/25, 2:09 AM
by jackson12t on 6/9/25, 3:37 PM
I think the way the paper lays out the performance regimes is pretty interesting, but I don't think they achieved their goal of demonstrating that LRMs can't use reasoning to solve complex puzzles organically (without contamination/memorization): IMO testing the model's ability to define an algorithm to solve the puzzle would have been a better evaluation of that (rather than having the model walk through all of the steps manually). I don't know that I'd use an LRM for this sort of long-tail reasoning where it has to follow one single process for a long time over just one prompt; if I needed a really long chain of reasoning I'd use an agent or workflow.
It sounds more like the tests measure a model's ability to reason coherently and consistently over many steps rather than a model's ability to understand and solve a complex puzzle. For example, for the Tower of Hanoi, a prompt like "Define an algorithm that will find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "find an arithmetic series formula, young Gauss") seems like it would have been a better approach than "Find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "add up all these numbers"). This is kind of seen in how the study included a step where the LRMs were given the algorithm and then asked to solve the problem, the focus was on an LRM's ability to follow the steps, not their ability to come up with an algorithm/solution on their own.
In a job interview, for example, who among us would accept inability to hold all of the `(2^n) - 1` steps of the Tower of Hanoi in our brain as evidence of poor reasoning ability?
Again, I think it's a really interesting study covering a model's ability to consistently follow a simple process over time in pursuit of a static objective (and perhaps a useful benchmark moving forward), but I'm not confident that it successfully demonstrates a meaninful deficiency in overall reasoning capability.
[1]: https://www.americanscientist.org/article/gausss-day-of-reck...
by benlivengood on 6/7/25, 1:25 AM
And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.
by nrjpoddar on 6/9/25, 10:00 AM
by Yenrabbit on 6/8/25, 3:56 PM
Anyway, fun experiment to test your understanding of these things but don't take any conclusions as gospel :)
by yalogin on 6/8/25, 2:03 PM
by danck on 6/7/25, 3:57 AM
by mitch_said on 6/7/25, 3:16 PM
by beneboy on 6/6/25, 11:42 PM
by piskov on 6/8/25, 12:16 PM
No matter how much computing power you give them, they can't solve harder problems.
This research suggests we're not as close to AGI as the hype suggests.
Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.
Apple's researchers used controllable puzzle environments specifically because:
• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break
Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.
This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.
by bgnn on 6/8/25, 6:30 PM
by cdrini on 6/7/25, 10:15 AM
With thinking LLMs, they can think, but they often can only think in one big batch before starting to "speak" their true answer. I think that needs to be rectified so they can switch between the two. In my previous framework, I would say "would I be able to solve this if had all the knowledge, but could only think then start typing?".
I think for larger problems, the answer to this is no. I would need paper/a whiteboard. That's what would let me think, write, output, iterate, draft, iterate. And I think that's where agentic AI seems to be heading.
by MaxPock on 6/8/25, 2:26 PM
by bicepjai on 6/6/25, 11:46 PM
by stephc_int13 on 6/7/25, 2:32 PM
by alansammarone on 6/7/25, 12:34 AM
Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.
And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.
At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).
Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?
by d4rkn0d3z on 6/7/25, 10:37 AM
Further examination and discussion with more experienced researchers gave me pause. They said that one must have a solution, or a significant new approach toward solving the hard problems associated with a research project for it to be viable, otherwise time (and money) is wasted finding new ways to solve the easy problems.
This is a more general principle that can be applied to most areas of endeavour. When you set about research and development that involves a mix of easy, medium, and hard problems, you must solve the hard problems first otherwise you blow your budget finding new ways to solve the easy problems, which nobody cares about in science.
But "AI" has left the realm of science behind and entered the realm of capitalism where several years of meaningless intellectual gyration without ever solving a hard problem may be quite profitable.
by giardini on 6/11/25, 9:49 PM
by 8bitsrule on 6/7/25, 12:44 AM
by jawiggins on 6/8/25, 9:06 PM
This seems to indicate that the next generation of models should focus on recursively solving small parts of the problem before function-calling another model to solve another small part of the problem and working it's answer into the reasoning loop.
Many seem to be citing this paper as an indication that LLMs are over - I think this indicates a clear path towards the next step function change in their abilities.
by behnamoh on 6/6/25, 9:41 PM
It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".