by JoelEinbinder on 8/17/24, 7:21 PM with 103 comments
by jsnell on 8/17/24, 7:46 PM
You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.
First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.
by JoelEinbinder on 8/17/24, 7:23 PM
I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.
by chmod775 on 8/18/24, 4:46 AM
you: 4/15
gpt-4o: 0/15
gpt-4: 1/15
gpt-4o-mini: 2/15
llama-2-7b: 2/15
llama-3-8b: 3/15
mistral-7b: 4/15
unigram: 1/15
Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.
Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?
Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.
by layer8 on 8/17/24, 7:55 PM
by nojs on 8/18/24, 3:39 AM
This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.
by modeless on 8/18/24, 8:33 AM
I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.
by anikan_vader on 8/17/24, 9:47 PM
Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.
by moritzwarhier on 8/17/24, 9:12 PM
Quizzes can be magical.
Haven't seen any cooler new language-related interactive fun-project on the web since:
It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.
Sharing this with a general audience could spark funny discussions about bubbles and biases :)
by RheingoldRiver on 8/18/24, 4:09 AM
For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?
Also, I got bored halfway through and selected "D" for all of them
by pizza on 8/18/24, 7:34 AM
edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.
by jdthedisciple on 8/18/24, 9:17 AM
by Garlef on 8/17/24, 8:10 PM
by TacticalCoder on 8/17/24, 7:59 PM
by ChrisArchitect on 8/17/24, 9:24 PM
Who's Smarter: AI or a 5-Year-Old?
by stackghost on 8/17/24, 8:22 PM
I don't see what this has to do with being "smarter" than anything. Example:
1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____
a) ether
b) a
c) the
d) more
But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?
by Kiro on 8/17/24, 8:58 PM
by kqr on 8/18/24, 3:51 PM
by dataflow on 8/18/24, 5:12 AM
by zoklet-enjoyer on 8/17/24, 8:27 PM
Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!
by blitzar on 8/18/24, 8:53 AM
by silisili on 8/17/24, 7:42 PM
by akira2501 on 8/17/24, 8:08 PM
by nick3443 on 8/17/24, 11:24 PM
by greesil on 8/18/24, 3:15 PM
by shakna on 8/17/24, 8:46 PM
by lupire on 8/17/24, 11:23 PM
by wesselbindt on 8/17/24, 7:47 PM
by playingalong on 8/18/24, 10:05 AM
by fsndz on 8/18/24, 3:05 PM
by moralestapia on 8/17/24, 9:24 PM
Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.
by ZoomerCretin on 8/17/24, 8:29 PM
This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.
I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.
by StefanBatory on 8/17/24, 10:12 PM
On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.
by xanderlewis on 8/17/24, 8:31 PM
by lostmsu on 8/17/24, 8:28 PM
by globular-toast on 8/18/24, 8:14 AM
by mjcurl on 8/17/24, 7:44 PM
I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.
by card_zero on 8/18/24, 5:51 AM
by rlt on 8/18/24, 2:28 AM
It would be interesting to try varying it, as well as the seed.
by efilife on 8/19/24, 8:46 AM
> You scored 7/15. The best language model, mistral-7b, scored 7/15.
I guess it's a success
by lelanthran on 8/18/24, 8:19 AM
I'm pretty certain that LLMs are unable to work at all without context.
by nyrikki on 8/17/24, 8:05 PM
LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.
You can unroll and cyclic graph into a dag, but you constrict the solution space.
Take the 'spoken': sentence:
"I never said she stole my money"
And say it multiple times with emphasis on each word and notice how the meaning changes.
That is text being a forgetful functor.
As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.
But that is existential quantification, limited based on your corpus based on pattern matching and finding.
I guess if "Smart" is defined as pattern matching and finding it would apply.
But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.
Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.
This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.
by User23 on 8/17/24, 8:16 PM
by lemoncookiechip on 8/18/24, 5:32 PM
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 5/15
llama-2-7b: 6/15
llama-3-8b: 6/15 (Slowest Bot: 14sec)
mistral-7b: 5/15
unigram: 2/15
by fidla on 8/19/24, 4:58 PM
by drakonka on 8/18/24, 2:22 PM
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 4/15
llama-2-7b: 7/15
llama-3-8b: 7/15
mistral-7b: 7/15
unigram: 4/15
by lingualscorn on 8/18/24, 2:12 PM
by EugeneOZ on 8/17/24, 9:04 PM