from Hacker News

Understanding the Limitations of Mathematical Reasoning in LLMs

by hnhn34 on 10/11/24, 11:55 AM with 266 comments

by parsimo2010 on 10/11/24, 2:27 PM
I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.
So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.
by woopwoop on 10/11/24, 3:46 PM
This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.
I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.
by s-macke on 10/11/24, 1:14 PM
These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".
You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.
To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.
[1] https://arxiv.org/html/2406.02061v1
[2] https://news.ycombinator.com/item?id=40811329
by bob1029 on 10/11/24, 2:01 PM
> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning
I'd offer a simpler explanation: Tokenization.
If you tokenize "12345 * 27271" you will get the following:
```
  "123", "45", " *", " ", "272", "71"
```
The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.
You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".
by dev1ycan on 10/11/24, 1:58 PM
I don't understand the idiocracy we live in, it is beyond obvious not just that the stock market is a bubble but ESPECIALLY the AI related stocks are a massive bubble, when it pops, and it will, it is going to be very very ugly, yet people keep pouring in, as Sabine said it, it's starting to look like particle physics where they keep asking for bigger colliders, just because you have a bigger collider, if your methodology is flawed you aren't gonna get any more significant returns.
Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.
Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.
by trehalose on 10/11/24, 5:21 PM
I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:
> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.
This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).
by thenoblesunfish on 10/11/24, 1:04 PM
Very interesting, and aligns with what I would expect in terms of the type of "thinking" LLMs do. I think that it's also the type of "thinking" that will let a student pass most school courses, except of course for the ones where the teacher has taken the time to pose test questions that aren't as amenable to pattern matching. (Hard, but I assume most readers here are familiar with leetcode style interviews and what makes questions of that kind higher or lower quality for assessing candidates)
(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)
by yk on 10/11/24, 1:19 PM
I test llms actually similar. For example there is a well known logic puzzle were a farmer tries to cross a river with a cabbage a goat and a wolf. Llms can solve that since at least GPT-2, however if we replace the wolf with a cow, gpt-o does correctly infer the rules of the puzzle but can't solve it.
by criddell on 10/11/24, 1:29 PM
It would be interesting if this kind of work could ever be extended to show the limitations of mathematical reasoning in animals and humans.
For example, just as a dog will never understand a fourier transform, there are likely ideas that humans cannot understand. If we know what our limits are, I wonder if we could build machines that can reason in ways we aren't capable of?
by codelion on 10/11/24, 11:54 PM
This is surprising to only those that have not worked in formal reasoning. Yes, LLMs cannot do true logical reasoning in a formal sense, you can do better with an SMT solver. But it is also true that you can solve a lot of logical problems by just applying “reasoning steps” from the training data, specially when your training data is the entirety of written content ever produced. Both of these can be true at the same time it is not a contradiction just an interesting dichotomy.
by dang on 10/11/24, 8:10 PM
Related ongoing thread:
LLMs don't do formal reasoning - https://news.ycombinator.com/item?id=41812523 - Oct 2024 (70 comments)
by singularity2001 on 10/11/24, 2:16 PM
If the argument is that LLMs are bad at reasoning because they are easily distractible and the results vary with modifications in the question, one should be reminded of the consistency and distractability of humans.
by K0balt on 10/12/24, 11:29 AM
Trying to solve (much less explore) mathematics using probabilistic next-token prediction seems like the really long way around, especially when we have pretty good deterministic tools available for our use. I don’t know why anyone would bother doing anything besides working on the correct manipulation of tools.
Brains have various structures that have distinct architectures. I don’t see any indication that the best way forward is to try to shoehorn everything into a single computational paradigm.
It’s like trying to make a flying submarine car. It might technically be possible, but it might not be worth the trouble, and it’s unlikely to result in a vehicle that works excellently in any of its environments.
by gradientsrneat on 10/11/24, 6:01 PM
Could this be Goodhart's Law in action? AI tools like to showcase benchmarks in bar graphs to show how well they perform compared to other models.
Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?
Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.
by eigenform on 10/11/24, 6:59 PM
The difference is that, if we are solving a math problem together, you and I [explicitly or implicitly] can come to an agreement over the context and decide to restrict our use of language with certain rules. The utility behind our conversation [generally] rests on those rules!
An LLM is very good at recovering rules, but being good at pattern recognition is not the same thing as being good at unambiguously following rules in the appropriate context.
edit: Natural language is far from an efficient/sufficient/necessary intermediate representation for doing math, just ask any general-purpose computer. Sometimes, it's worth "putting rules in stone," and it seems unreasonable to believe that there is always an unambiguous rule for this that you can mechanically recover from a corpus of language use.
by i007 on 10/12/24, 7:32 AM
LLMs are designed to carry out "associative reasoning" which captures logic based on recognition and recall of compositional patterns learned during training.
Having said that, we can still get semantically and logically idempotent output that makes sense but with lots of work outside of the LLM, which contrasts with the current hyper focus on the LLM itself as the be all and end all. It is just one component in what ought to be a larger and more involved system for reasoning.
Look at what we were able to accomplish here for Legal AI, not so mathematical logic per se but mimicking (capturing) axiomatic logic in the legal domain:
https://www.youtube.com/watch?v=_9Galw9-Z3Q
marc at sunami dot ai
by jgord on 10/12/24, 1:17 AM
I propose 'gords rule' : "any sufficiently advanced LLM will learn the laws of logic, the principles of scientific method, and Reinforcement Learning"
until that happens .. I think RL startups focused on real problems are much undervalued : https://quantblog.wordpress.com/2024/10/11/llm-hype-means-th...
by gtsop on 10/12/24, 9:42 AM
LLMs are inherently emulators of digitaly imprinted artifacts of human consciousness. When people trully grasp what this means they will stop being buffled by the fact that LLMs performance deteriorate when novelty of the task increases.
EDIT: Had there been an ounce of actual true reasoning emerging in LLMs, openai would have been running this thing privatly 24/7 to produce new science and capture pattents that would give them economic dominance. Not trying to sell tokens to us all.
by ak_111 on 10/11/24, 10:09 PM
As an outsider can anyone enlighten me how this squares with the news that models that adapt similar LLM architecture can obtain silver medal in mathematical olympiad?
by uptownfunk on 10/12/24, 2:43 AM
The very fundamental problem with LLM is there is no guarantee on any of the reasoning it gives you without a human there to give a thumbs up. They are working on solving this (alpha proof, lean agent etc) but getting this to run at inference time in an optimized way is what I would call one of the millenial prize problems of AI which will lead to a quantum leap in the path towards the singularity.
by woopwoop on 10/11/24, 2:23 PM
I'm curious about what happens with the no-op dataset if you include in the prompt that the questions may contain irrelevant information.
by teleforce on 10/12/24, 8:52 AM
In terms of usefulness and realistic implementation mathematical reasoning is the next frontier of LLM not autonomous level 5 driving or AGI. More research fund and investment are much better spent on the former rather than the latter but apparently it seems that the reverse situation is the case.
by resters on 10/11/24, 4:44 PM
I think it's obvious that LLMs will be able to do "reasoning" far better than humans. We must separate our notion of what is remarkably human. Rarely is it the reasoning, it's the intuition that a logical path exists -- for example a mathematical proof that draws from separate sub-disciplines of mathematics, etc.
Consider that in a LLM, language inputs are tokenized and fed as inputs into the neural network, and connections in the network create output sequences that are not just syntactically correct (trivial) or form semantically plausible sentences (early transformers did this). LLM output sequences follow the deep patterns of language which include sometjhing that resembles reasoning as the model has learnt from its training data.
LLMs seem to fall short because they often fail at truly abstract reasoning tasks that humans find easy. If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.
However, human mathematicians have not even unified around constructive mathematics as a must for the study of mathematics. This reveals that even highly evolved mathematical disciplines rely on objects whose characteristics do not lend themselves to full logical scrutiny and are in a way socially constructed and effectively hard to audit.
While notation in mathematics is incredible technology it is also a highly limiting factor that suffers major tradeoffs. Humans struggle to invent new notation fast enough and to discard outdated notation fast enough. If we do see an AI-powered boom in mathematics, I suspect our notion of notation and the fluidity we demand from it will change dramatically.
by dr_dshiv on 10/11/24, 1:13 PM
It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?
by Animats on 10/12/24, 5:51 AM
It's an expected result.
Whatever happened with that result which found some representation of the state of a game inside an LLM? That indicated some degree of model-building. Haven't heard about that again/
by qwerty456127 on 10/11/24, 9:50 PM
Can't al LLM just detect a mathematical reasoning task then produce a formula (not even display it in the production mode) to invoke on an external service engineered for formal logical and mathematical computations?
by bubble12345 on 10/12/24, 3:30 PM
Can LLMs even do addition, with say 20+ digit numbers? Multiplication?
by jumploops on 10/11/24, 8:27 PM
> Overall, while o1-preview and o1-mini exhibit significantly stronger results compared to current open models—potentially due to improved training data and post-training procedures—they still share similar limitations with the open models.
tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.
I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).
It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.
This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."
There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.
by beardyw on 10/11/24, 1:32 PM
I honestly can't see why LLMs should be good at this sort of thing. I am convinced you need a completely different approach. At the very least you mostly only want one completely correct result. Good luck getting current models to do that.
by throwaway918299 on 10/11/24, 10:37 PM
limitations of mathematical reasoning?
They have none. Literally zero. That’s the limit. Thank you for reading my paper.
by apsec112 on 10/11/24, 1:43 PM
()