by hnhn34 on 10/11/24, 11:55 AM with 266 comments
by parsimo2010 on 10/11/24, 2:27 PM
So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.
by woopwoop on 10/11/24, 3:46 PM
I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.
by s-macke on 10/11/24, 1:14 PM
You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.
To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.
by bob1029 on 10/11/24, 2:01 PM
I'd offer a simpler explanation: Tokenization.
If you tokenize "12345 * 27271" you will get the following:
"123", "45", " *", " ", "272", "71"
The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".
by dev1ycan on 10/11/24, 1:58 PM
Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.
Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.
by trehalose on 10/11/24, 5:21 PM
> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.
This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).
by thenoblesunfish on 10/11/24, 1:04 PM
(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)
by yk on 10/11/24, 1:19 PM
by criddell on 10/11/24, 1:29 PM
For example, just as a dog will never understand a fourier transform, there are likely ideas that humans cannot understand. If we know what our limits are, I wonder if we could build machines that can reason in ways we aren't capable of?
by codelion on 10/11/24, 11:54 PM
by dang on 10/11/24, 8:10 PM
LLMs don't do formal reasoning - https://news.ycombinator.com/item?id=41812523 - Oct 2024 (70 comments)
by singularity2001 on 10/11/24, 2:16 PM
by K0balt on 10/12/24, 11:29 AM
Brains have various structures that have distinct architectures. I don’t see any indication that the best way forward is to try to shoehorn everything into a single computational paradigm.
It’s like trying to make a flying submarine car. It might technically be possible, but it might not be worth the trouble, and it’s unlikely to result in a vehicle that works excellently in any of its environments.
by gradientsrneat on 10/11/24, 6:01 PM
Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?
Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.
by eigenform on 10/11/24, 6:59 PM
An LLM is very good at recovering rules, but being good at pattern recognition is not the same thing as being good at unambiguously following rules in the appropriate context.
edit: Natural language is far from an efficient/sufficient/necessary intermediate representation for doing math, just ask any general-purpose computer. Sometimes, it's worth "putting rules in stone," and it seems unreasonable to believe that there is always an unambiguous rule for this that you can mechanically recover from a corpus of language use.
by i007 on 10/12/24, 7:32 AM
Having said that, we can still get semantically and logically idempotent output that makes sense but with lots of work outside of the LLM, which contrasts with the current hyper focus on the LLM itself as the be all and end all. It is just one component in what ought to be a larger and more involved system for reasoning.
Look at what we were able to accomplish here for Legal AI, not so mathematical logic per se but mimicking (capturing) axiomatic logic in the legal domain:
https://www.youtube.com/watch?v=_9Galw9-Z3Q
marc at sunami dot ai
by jgord on 10/12/24, 1:17 AM
until that happens .. I think RL startups focused on real problems are much undervalued : https://quantblog.wordpress.com/2024/10/11/llm-hype-means-th...
by gtsop on 10/12/24, 9:42 AM
EDIT: Had there been an ounce of actual true reasoning emerging in LLMs, openai would have been running this thing privatly 24/7 to produce new science and capture pattents that would give them economic dominance. Not trying to sell tokens to us all.
by ak_111 on 10/11/24, 10:09 PM
by uptownfunk on 10/12/24, 2:43 AM
by woopwoop on 10/11/24, 2:23 PM
by teleforce on 10/12/24, 8:52 AM
by resters on 10/11/24, 4:44 PM
Consider that in a LLM, language inputs are tokenized and fed as inputs into the neural network, and connections in the network create output sequences that are not just syntactically correct (trivial) or form semantically plausible sentences (early transformers did this). LLM output sequences follow the deep patterns of language which include sometjhing that resembles reasoning as the model has learnt from its training data.
LLMs seem to fall short because they often fail at truly abstract reasoning tasks that humans find easy. If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.
However, human mathematicians have not even unified around constructive mathematics as a must for the study of mathematics. This reveals that even highly evolved mathematical disciplines rely on objects whose characteristics do not lend themselves to full logical scrutiny and are in a way socially constructed and effectively hard to audit.
While notation in mathematics is incredible technology it is also a highly limiting factor that suffers major tradeoffs. Humans struggle to invent new notation fast enough and to discard outdated notation fast enough. If we do see an AI-powered boom in mathematics, I suspect our notion of notation and the fluidity we demand from it will change dramatically.
by dr_dshiv on 10/11/24, 1:13 PM
by Animats on 10/12/24, 5:51 AM
Whatever happened with that result which found some representation of the state of a game inside an LLM? That indicated some degree of model-building. Haven't heard about that again/
by qwerty456127 on 10/11/24, 9:50 PM
by bubble12345 on 10/12/24, 3:30 PM
by jumploops on 10/11/24, 8:27 PM
tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.
I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).
It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.
This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."
There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.
by beardyw on 10/11/24, 1:32 PM
by throwaway918299 on 10/11/24, 10:37 PM
They have none. Literally zero. That’s the limit. Thank you for reading my paper.
by apsec112 on 10/11/24, 1:43 PM