from Hacker News

Llama 4 Smells Bad

by alexmolas on 4/24/25, 6:49 AM with 26 comments

by simonw on 4/24/25, 7:32 AM
This article's credibility suffers a little from the way it talks about GPT-4o mini:
"just in front of GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters, depending on when you ask."
Then later:
"On the Artificial Analysis benchmark Scout achieved the same score as GPT 4o mini. A 109B model vs a 1.5B model (allegedly). This is ABYSMAL."
Asking models how many parameters they have doesn't make sense.
There is absolutely no way GPT-4o mini is 1.5B. I can run a 3B model on my iPhone, but it's a fraction of the utility of GPT-4o mini.
by danielhanchen on 4/24/25, 7:42 AM
There were actually multiple bugs which impacted long context benchmarks and general inference - I helped fix some of them.
1. RMS norm eps was 1e-6, but should be 1e-5 - see https://github.com/huggingface/transformers/pull/37418
2. Llama 4 Scout changed RoPE settings after release - conversion script for llama.cpp had to be fixed. See https://github.com/ggml-org/llama.cpp/pull/12889
3. vLLM and the Llama 4 team found QK Norm was normalizing across entire Q & K which was wrong - accuracy increased by 2%. See https://github.com/vllm-project/vllm/pull/16311
If you see https://x.com/WolframRvnwlf/status/1909735579564331016 - the GGUFs I uploaded for Scout actually did better than inference providers by +~5% on MMLU Pro. https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-... has more details
by lhl on 4/24/25, 8:33 AM
While Llama 4 had a pretty bad launch (the LM Arena gaming in particular is terrible), having run my own evals on it (using the April 5 v0.8.3 vLLM release - https://blog.vllm.ai/2025/04/05/llama4.html , so before the QKNorm fix https://github.com/vllm-project/vllm/pull/16311) - it seemed pretty decent to me.
For English, on a combination of MixEval, LiveBench, IFEval, and EvalPlus Maverick FP8 (17B/400B) was about on par with DeepSeek V3 FP8 (37B/671B) and Scout (17B/109B) was punching in the ballpark of Gemma 3 27B, but not too far off Llama 3.3 70B and Mistral Large 2411 (123B).
Llama 4 claimed to be trained on 10X more multilingual tokens than Llama 3 and testing on Japanese (including with some new, currently unreleased evals) the models did perform better than Llama 3 (although I'd characterize their overall Japanese performance as "middle of the pack": https://shisa.ai/posts/llama4-japanese-performance/
I think a big part of the negative reaction is that in terms of memory footprint, Llama 4 looks more built for Meta (large scale inference provider) than home users, although with the move to APUs and more efficient CPU offloading, there's still something to be said for strong capabilities at 17B of inference.
I think people are quick to forget that Llama 3, while not so disastrous, was much improved with 3.1. Also the competitive landscape is pretty different now. And I think the visual capabilities are being a bit slept upon, but I think that's also the case of releasing before the inference code was baked...
by GaggiX on 4/24/25, 7:32 AM
>GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters
I have no idea how the author can remotely trust GPT-4o-mini in this case. The number of parameters is almost certainly way off.
by bradley13 on 4/24/25, 7:39 AM
This seems to be a general problem at the moment. The most usable models are not the newest. The newer models (obviously, I haven't tried them all) may do better on benchmarks, but actual usability is worse.
To create useful LLMs required some genuine breakthroughs. It seems to me that we have reached the limits of what we can achieve with current architectures. Progress will require some new insights and breakthroughs.
by fancyfredbot on 4/24/25, 7:41 AM
If you game the benchmark then you always get found out by your users. Yet the practice remains common in hardware. Outright lies are uncommon but misleading and cherry picked numbers are pretty much standard practice.
The fact that misleading benchmarks don't even drive profit at Meta didn't seem to stop them doing the same thing, but perhaps this isn't very surprising. I imagine internal incentives are very similar.
Unlike the hardware companies though, gaming the benchmark in LLMs seems to involve making the actual performance worse, so perhaps there is more hope that the practice will fade away in this market.
by anonymousiam on 4/24/25, 3:11 PM
[trying to confuse an android]
Spock: Logic is a little tweeting bird chirping in a meadow. Logic is a wreath of pretty flowers which smell bad. Are you sure your circuits are registering correctly? Your ears are green.
https://www.imdb.com/title/tt0708432/quotes/?item=qt0406609
by croisillon on 4/24/25, 7:28 AM
did Meta open a time wormhole to release Llama 4 on May 5th?
by pixelesque on 4/24/25, 7:42 AM
> This is a draft. Come back later for the final version.
There are quite a few issues with the content from a factual point-of-view (several sibling comments mention things): could have done with a lot more proof-reading and research I think.
by simonw on 4/24/25, 7:45 AM
The initial Llama 4 release is disappointing: the models are too big for most people to run, and not high quality enough be worth running if you can afford the hardware.
I'm still optimistic for Llama 4.1 and 4.2.
Llama 3 got exciting at the 3.2 and 3.3 stages: smaller models that were distilled from the big ones and ran on a laptop (or even a phone).
3.2 3B and 3.3 70B were really interesting models.
I'm hopeful that we will get a Llama 4 ~25B, since that seems to be a sweet spot for laptop models right now - Gemma 3 27B and Mistral Small 3.1 (24B) are both fantastic.
by bambax on 4/24/25, 8:10 AM
> Anyway, on Saturday (!) May the 5th, Cinco de Mayo, Meta released Llama 4
Wat. We're still in April. Cinco de Abril.
by NanoYohaneTSU on 4/24/25, 7:45 AM
Reminder that 1 year ago, AI tech bronies were saying that AI is only going to improve from here. It didn't. It stagnated because it's reached the peak of LLMs, as predicted.
And it still can't create images correctly, as in actual image creation, not woven pixels with tons of artifacts.