by Wilsoniumite on 3/10/25, 6:16 PM with 164 comments
by rainsford on 3/10/25, 11:02 PM
It seems like an incredibly bad outcome if we accept "AI" that's fundamentally flawed in a way similar to if not worse than humans and try to work around it rather than relegating it to unimportant tasks while we work towards a standard of intelligence we'd otherwise expect from a computer.
LLMs certainly appear to be the closest to real AI that we've gotten so far. But I think a lot of that is due to the human bias that language is a sign of intelligence and our measuring stick is unsuited to evaluate software specifically designed to mimic the human ability to string words together. We now have the unreliability of human language processes without most of the benefits that comes from actual human level intelligence. Managing that unreliability with systems designed for humans bakes in all the downsides without further pursuing the potential upsides from legitimate computer intelligence.
by tehsauce on 3/10/25, 7:49 PM
by lxe on 3/10/25, 10:37 PM
by smallnix on 3/10/25, 8:44 PM
From that follows that LLMs fit to produce all kinds of human biases. Like preferring the first choice out of many, and the last our of many (primacy biases). Funnily the LLM might replicate the biases slightly wrong and by doing so produce new derived biases.
by henlobenlo on 3/11/25, 12:57 AM
by bawolff on 3/11/25, 2:24 AM
Hardly a shocker. I think this say more about the experimental design then it does about AI & humans.
by markbergz on 3/10/25, 8:40 PM
The authors discuss the person 1 / doc 1 bias and the need to always evaluate each pair of items twice.
If you want to play around with this method there is a nice python tool here: https://github.com/vagos/llm-sort
by jayd16 on 3/10/25, 11:03 PM
by velcrovan on 3/10/25, 7:58 PM
by isaacremuant on 3/11/25, 2:35 AM
The experiment itself is so fundamentally flawed it's hard to begin criticizing it. HN comments as a predictor of good hiring material is just as valid as social media profile artifacts or sleep patterns.
Just because you produce something with statistics (with or without LLMs) and have nice visuals and narratives doesn't mean is valid or rigorous or "better than nothing" for decision making.
Articles like this keep making it to the top of HN because HN is behaving like reddit where the article is read by few and the gist of the title debated by many.
by le-mark on 3/10/25, 10:42 PM
by devit on 3/10/25, 7:56 PM
Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.
by jopsen on 3/10/25, 9:09 PM
by satisfice on 3/10/25, 9:41 PM
by andrewmcwatters on 3/10/25, 8:10 PM
To me it’s literally the same as testing one Markov chain against another.
by megadata on 3/10/25, 10:51 PM
It can be incredibly hard to get a person to acknowledge that they might be remotely wrong on a topic they really care about.
Or, for some people, the thought that they might be wrong about anything attall is just like blasphemy to them.
by oldherl on 3/11/25, 7:24 AM
by K0balt on 3/10/25, 9:26 PM
Also, often less capable of carrying on a decent conversation.
I’ve noticed an periconcious urge when talking to people to judge them against various models and quants, or to decide they are truly SOTA.
I need to touch grass a bit more, I think.
by soared on 3/10/25, 10:37 PM
by vivzkestrel on 3/11/25, 4:01 AM
by djaouen on 3/11/25, 3:10 AM
by bxguff on 3/10/25, 8:32 PM
by th0ma5 on 3/10/25, 7:44 PM
by raincole on 3/11/25, 2:36 AM
TL;DR: the author found a very, very specific bias that is prevalent in both humans and LLMs. That is it.
by mdp2021 on 3/10/25, 8:00 PM
Now: some people can't count. Some people hum between words. Some people set fire to national monuments. Reply: "Yes we knew", and "No, it's not necessary".
And: if people could lift the tons, we would not have invented cranes.
Very, very often in these pages I meet people repeating "how bad people are". That is "how bad people can be", and "and we would have guessed these pages are especially visited by engineers, who must be already aware of the importance of technical boosts" - so, besides the point relevant to the fact that the median does not represent the whole set, the other point relevant to the fact that tools are not measured on reaching mediocre results.