from Hacker News

Are you better than a language model at predicting the next word?

by JoelEinbinder on 8/17/24, 7:21 PM with 103 comments

  • by jsnell on 8/17/24, 7:46 PM

    It's a neat idea, though not what I expected from the title talking about "smart" :)

    You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.

    First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.

  • by JoelEinbinder on 8/17/24, 7:23 PM

    I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments and compete against various language models. I used llama2 to generate three alternative completions for each comment creating a multiple choice question. For the local language models that you are competing against, I consider them having picked the answer with the lowest total perplexity of prompt + answer. I am able to replicate this behavior with the OpenAI models by setting a logit_bias that limits the llm to pick only one of the allowed answers. I tried just giving the full multiple choice question as a prompt and having it pick an answer, but that led to really poor results. So I'm not able to compare with Claude or any online LLMs that don't have logit_bias.

    I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.

  • by chmod775 on 8/18/24, 4:46 AM

        you: 4/15
        gpt-4o: 0/15
        gpt-4: 1/15
        gpt-4o-mini: 2/15
        llama-2-7b: 2/15
        llama-3-8b: 3/15
        mistral-7b: 4/15
        unigram: 1/15
    
    Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.

    If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.

    Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?

    Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.

  • by layer8 on 8/17/24, 7:55 PM

    This is also a good test for noticing that you spend too much time reading HN comments.
  • by nojs on 8/18/24, 3:39 AM

    Nice. I found you can beat this by picking the word least likely to be selected by a language model, because it seems like the alternative choices are generated by an LLM. “Pick the outlier” is the best strategy.

    This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.

  • by modeless on 8/18/24, 8:33 AM

    > You scored 11/15. The best language model, llama-2-7b, scored 10/15.

    I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.

  • by anikan_vader on 8/17/24, 9:47 PM

    Got 8/15, best AI model got 7/15, and unigram got 1/15.

    Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.

  • by moritzwarhier on 8/17/24, 9:12 PM

    This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.

    Quizzes can be magical.

    Haven't seen any cooler new language-related interactive fun-project on the web since:

    https://wikispeedruns.com/

    It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.

    Sharing this with a general audience could spark funny discussions about bubbles and biases :)

  • by RheingoldRiver on 8/18/24, 4:09 AM

    I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?

    For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?

    Also, I got bored halfway through and selected "D" for all of them

  • by pizza on 8/18/24, 7:34 AM

    If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?

    edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.

  • by jdthedisciple on 8/18/24, 9:17 AM

    Some of them are excerpts from a much larger context, which the LLM would be using for prediction, obviously giving them a gigantic edge.
  • by Garlef on 8/17/24, 8:10 PM

    I like it. It's a humorous reversal of the usual articles that boil down to "Look! I made the AI fail at something!"
  • by TacticalCoder on 8/17/24, 7:59 PM

    My computer can compute 573034897183834790x3019487439184798 in less than a millisecond. Doesn't make it smarter than me.
  • by ChrisArchitect on 8/17/24, 9:24 PM

  • by stackghost on 8/17/24, 8:22 PM

    This is just a test of how likely you are to generate the same word as the LLM. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.

    I don't see what this has to do with being "smarter" than anything. Example:

    1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____

    a) ether

    b) a

    c) the

    d) more

    But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?

  • by Kiro on 8/17/24, 8:58 PM

    Where do the incorrect options come from?
  • by kqr on 8/18/24, 3:51 PM

    For anyone else daring the full 100 question quiz: you need to get at least a third right to be considered better than guessing by traditional statistical standards. (You'd need more than half to be better than LLMs.)
  • by dataflow on 8/18/24, 5:12 AM

    I got 9/15, vs. 4/15 for an LLM. I assume these are lifted from HN? Seems like an indication I should spend less time here...
  • by zoklet-enjoyer on 8/17/24, 8:27 PM

    You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.

    Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!

  • by blitzar on 8/18/24, 8:53 AM

    I took some mushrooms and hallucinated the answers.
  • by silisili on 8/17/24, 7:42 PM

    Was mine broken? One of my prompts was just '>'. So of course I guessed a random word. The answer key showed I got it wrong, but showed the right answer inserted into a longer prompt. Or is that how it's supposed to work?
  • by akira2501 on 8/17/24, 8:08 PM

    Yes. I can tell you about things that happened this morning. Your language model cannot.
  • by nick3443 on 8/17/24, 11:24 PM

    This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.
  • by greesil on 8/18/24, 3:15 PM

    Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.
  • by shakna on 8/17/24, 8:46 PM

    So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?
  • by lupire on 8/17/24, 11:23 PM

    I got one of my own comments on the 15 question quiz!
  • by wesselbindt on 8/17/24, 7:47 PM

    I like the website, but it could be a bit more explicit about the point it's trying to make. Given that a lot of people tend to think of LLM as somehow a thinking entity rather than a statistical model for guessing the most likely next word, most will probably look at these questions and think the website is broken.
  • by playingalong on 8/18/24, 10:05 AM

    I've got 2/15, so worse then random choice... I guess partly because English is not my mother tongue.
  • by fsndz on 8/18/24, 3:05 PM

    Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: https://www.lycee.ai/blog/why-no-agi-openai
  • by moralestapia on 8/17/24, 9:24 PM

    >the quintessential language model task of predicting the next word?

    Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.

  • by ZoomerCretin on 8/17/24, 8:29 PM

    > 8. All of local politics in the muni I live in takes place in a forum like this, on Facebook[.] The electeds in our muni post on it; I've gotten two different local laws done by posting there (and I'm working on a bigger third); I met someone whose campaign I funded and helped run who is now a local elected. It is crazy to think you can HN-effortpost your way to changing the laws of the place you live in but I'm telling you right now that you can.

    This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.

    I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.

  • by StefanBatory on 8/17/24, 10:12 PM

    7/15, 90 seconds. I'll blame it on fact that I'm not English native speaker, right? Right?

    On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.

  • by xanderlewis on 8/17/24, 8:31 PM

    I feel like I recognise the comment about tensors from HN a few days ago, haha.
  • by lostmsu on 8/17/24, 8:28 PM

    I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.
  • by globular-toast on 8/18/24, 8:14 AM

    Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?
  • by mjcurl on 8/17/24, 7:44 PM

    5/15, so the same as choosing the most common word.

    I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.

  • by card_zero on 8/18/24, 5:51 AM

    The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.
  • by rlt on 8/18/24, 2:28 AM

    Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.

    It would be interesting to try varying it, as well as the seed.

  • by efilife on 8/19/24, 8:46 AM

    Tried to respond like a LLM would

    > You scored 7/15. The best language model, mistral-7b, scored 7/15.

    I guess it's a success

  • by lelanthran on 8/18/24, 8:19 AM

    This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.

    I'm pretty certain that LLMs are unable to work at all without context.

  • by nyrikki on 8/17/24, 8:05 PM

    7/10 This is more about set shattering than 'smarts'

    LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.

    You can unroll and cyclic graph into a dag, but you constrict the solution space.

    Take the 'spoken': sentence:

    "I never said she stole my money"

    And say it multiple times with emphasis on each word and notice how the meaning changes.

    That is text being a forgetful functor.

    As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.

    But that is existential quantification, limited based on your corpus based on pattern matching and finding.

    I guess if "Smart" is defined as pattern matching and finding it would apply.

    But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.

    Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.

    This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.

  • by User23 on 8/17/24, 8:16 PM

    With some brief experimentation ChatGPT also fails this test.
  • by lemoncookiechip on 8/18/24, 5:32 PM

    you: 6/15 (336sec)

    gpt-4o: 5/15

    gpt-4: 5/15

    gpt-4o-mini: 5/15

    llama-2-7b: 6/15

    llama-3-8b: 6/15 (Slowest Bot: 14sec)

    mistral-7b: 5/15

    unigram: 2/15

  • by fidla on 8/19/24, 4:58 PM

    Yes definitely
  • by drakonka on 8/18/24, 2:22 PM

    you: 5/15

    gpt-4o: 5/15

    gpt-4: 5/15

    gpt-4o-mini: 4/15

    llama-2-7b: 7/15

    llama-3-8b: 7/15

    mistral-7b: 7/15

    unigram: 4/15

  • by lingualscorn on 8/18/24, 2:12 PM

    The only ones I got right were ones where I had read the actual HN comment…
  • by EugeneOZ on 8/17/24, 9:04 PM

    Just proves why IQ tests are worthless.