by enum on 2/9/25, 6:14 PM with 80 comments
by XCabbage on 2/9/25, 7:47 PM
As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.
Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).
(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)
by rahidz on 2/9/25, 10:06 PM
Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."
Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."
by mkoubaa on 2/9/25, 8:30 PM
by windsignaling on 2/10/25, 4:25 AM
Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.
I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.
IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.
by enum on 2/9/25, 6:14 PM
by sega_sai on 2/9/25, 6:48 PM
by lokimedes on 2/9/25, 7:43 PM
I know this is a rant, sorry, just so tired of the stupidity.
by zone411 on 2/10/25, 6:32 AM
LLM Confabulation (Hallucination): https://github.com/lechmazur/confabulations/
LLM Step Game: https://github.com/lechmazur/step_game
LLM Thematic Generalization Benchmark: https://github.com/lechmazur/generalization
LLM Creative Story-Writing Benchmark: https://github.com/lechmazur/writing
Extended NYT Connections LLM Benchmark: https://github.com/lechmazur/nyt-connections/
and a couple more that I haven't updated very recently.
by akomtu on 2/9/25, 9:39 PM
1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.
2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.
Simply imitating answers won't get you very far.
by zinccat on 2/9/25, 7:28 PM
by scotty79 on 2/10/25, 12:15 PM
by brokensegue on 2/9/25, 11:24 PM
by bryan0 on 2/10/25, 10:49 AM
by aghilmort on 2/9/25, 6:31 PM