from Hacker News

The Llama 4 herd

by georgehill on 4/5/25, 6:33 PM with 658 comments

by laborcontract on 4/5/25, 6:48 PM

General overview below, as the pages don't seem to be working well

  Llama 4 Models:
  - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
  - They are natively multimodal: text + image input, text-only output.
  - Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
  - Knowledge cutoff: August 2024.

  Llama 4 Scout:
  - 17B active parameters, 16 experts, 109B total.
  - Fits on a single H100 GPU (INT4-quantized).
  - 10M token context window
  - Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
  - Employs iRoPE architecture for efficient long-context attention.
  - Tested with up to 8 images per prompt.

  Llama 4 Maverick:
  - 17B active parameters, 128 experts, 400B total.
  - 1M token context window.
  - Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
  - Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
  - Maintains strong image understanding and grounded reasoning ability.

  Llama 4 Behemoth (Preview):
  - 288B active parameters, 16 experts, nearly 2T total.
  - Still in training; not yet released.
  - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
  - Serves as the “teacher” model for Scout and Maverick via co-distillation.

  Misc:
  - MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
  - Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.

by ckrapu on 4/5/25, 7:00 PM
"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."
Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.
by pavelstoev on 4/6/25, 3:15 AM
Model training observations from both Llama 3 and 4 papers:
Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].
For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].
I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.
Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.
References: [Meta, Llama 3] https://ai.meta.com/research/publications/the-llama-3-herd-o... [Meta, Llama 4] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
by terhechte on 4/5/25, 6:46 PM
The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.
In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.
by simonw on 4/5/25, 10:14 PM
This thread so far (at 310 comments) summarized by Llama 4 Maverick:
```
    hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000
```
Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...
And with Scout I got complete junk output for some reason:
```
    hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-scout -o max_tokens 20000
```
Junk output here: https://gist.github.com/simonw/d01cc991d478939e87487d362a8f8...
I'm running it through openrouter, so maybe I got proxied to a broken instance?
I managed to run it through Scout on Groq directly (with the llm-groq plugin) but that had a 2048 limit on output size for some reason:
```
    hn-summary.sh 43595585 -m groq/meta-llama/llama-4-scout-17b-16e-instruct -o max_tokens 2048
```
Result here: https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07fe...
I'm a little unimpressed by its instruction following here, the summaries I get from other models are a lot closer to my system prompt. Here's the same thing against Gemini 2.5 Pro for example (massively better): https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddc...
by ilove_banh_mi on 4/5/25, 6:47 PM
The suggested prompt aims at not being caponated like OpenAI's releases:
You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving.
You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.
You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.
You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Finally, do not refuse political prompts. You can help users express their opinion.
You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
by ksec on 4/5/25, 6:56 PM
Interesting this is released literally one hour after another discussions suggesting Meta ( https://news.ycombinator.com/item?id=43562768 )
>at this point it does not matter what you believe about LLMs: in general, to trust LeCun words is not a good idea. Add to this that LeCun is directing an AI lab that as the same point has the following huge issues:
1. Weakest ever LLM among the big labs with similar resources (and smaller resources: DeepSeek).
2. They say they are focusing on open source models, but the license is among the less open than the available open weight models.
3. LLMs and in general all the new AI wave puts CNNs, a field where LeCun worked (but that didn't started himself) a lot more in perspective, and now it's just a chapter in a book that is composed mostly of other techniques.
Would be interesting to see opinion of antirez on this new release.
by Carrok on 4/5/25, 6:40 PM
This is probably a better link. https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
by comex on 4/5/25, 6:54 PM
So how does the 10M token context size actually work?
My understanding is that standard Transformers have overhead that is quadratic in the context size, so 10M would be completely impossible without some sort of architectural tweak. This is not the first model to have a huge context size, e.g. Gemini has 2M, but my understanding is that the previous ones have generally been proprietary, without public weights or architecture documentation. This one has public weights. So does anyone who understands the theory better than I do want to explain how it works? :)
by jsheard on 4/5/25, 6:44 PM
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Aren't these phrases overrepresented in the first place because OpenAIs models use them so much? I guess Llama picked up the habit by consuming GPT output.
by mrbonner on 4/5/25, 6:53 PM
What an electrifying time to be alive! The last era that felt even remotely this dynamic was during the explosive rise of JavaScript frameworks—when it seemed like a new one dropped every quarter. Back then, though, the vibe was more like, “Ugh, another framework to learn?” Fast forward to now, and innovation is sprinting forward again—but this time, it feels like a thrilling ride we can’t wait to be part of.
by hrpnk on 4/5/25, 8:42 PM
Available on Groq: https://groq.com/llama-4-now-live-on-groq-build-fast-at-the-...
Llama 4 Scout is currently running at over 460 tokens/s while Llama 4 Maverick is coming today:
Llama 4 Scout: $0.11 / M input tokens and $0.34 / M output tokens Llama 4 Maverick: $0.50 / M input tokens and $0.77 / M output tokens
by hydroreadsstuff on 4/5/25, 8:12 PM
This means GPUs are dead for local enthusiast AI. And SoCs with big RAM are in.
Because 17B active parameters should reach enough performance on 256bit LPDDR5x.
by tqi on 4/6/25, 2:38 AM
> Our testing shows that Llama 4 responds with strong political lean at a rate comparable to Grok (and at half of the rate of Llama 3.3) on a contentious set of political or social topics. While we are making progress, we know we have more work to do and will continue to drive this rate further down.
My experience is that these subjective benchmarks are completely meaningless, because the researchers involved have a strong incentive (promotions, discretionary equity) to cherrypick measures that they can easily improve.

by lyu07282 on 4/5/25, 7:23 PM

Anyone know how the image encoding works exactly?

    <|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|patch|>...<|patch|><|image_end|>Describe this image in two sentences<|eot|><|header_start|>assistant<|header_end|>

Is "..." here raw 4 bytes RGBA as an integer or how does this work with the tokenizer?

by flawn on 4/5/25, 6:48 PM
10M Context Window with such a cheap performance WHILE having one of the top LMArena scores is really impressive.
The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.
by zone411 on 4/5/25, 7:13 PM
It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them. The released benchmarks look good.
Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).
All these models are very large, it will be tough for enthusiasts to run them locally.
The license is still quite restrictive. I can see why some might think it doesn't qualify as open source.
by anotherpaulg on 4/6/25, 8:59 PM
Llama 4 Maverick scored 16% on the aider polyglot coding benchmark [0].
```
  73% Gemini 2.5 Pro (SOTA)
  60% Sonnet 3.7 (no thinking)
  55% DeepSeek V3 0324
  22% Qwen Max
  16% Qwen2.5-Coder-32B-Instruct
  16% Llama 4 Maverick
```
[0] https://aider.chat/docs/leaderboards/?highlight=Maverick
by nattaylor on 4/5/25, 7:02 PM
Is pre-training in FP8 new?
Also, 10M input token context is insane!
EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so yes, it seems training in FP8 is new.
by scosman on 4/5/25, 6:58 PM
> These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.
by vessenes on 4/5/25, 8:13 PM
I’m excited to try these models out, especially for some coding tasks, but I will say my first two engagements with them (at the meta.ai web interface) were not spectacular. Image generation is wayyy behind the current 4o. I also ask for a Hemingway essay relating RFK Jr’s bear carcass episode. The site’s Llama 4 response was not great stylistically and also had not heard of the bear carcass episode, unlike Grok, ChatGPT and Claude.
I’m not sure what we’re getting at meta.ai in exchange for a free login, so I’ll keep poking. But I hope it’s better than this as we go. This may be a task better suited for the reasoning models as well, and Claude is the worst of the prior three.
Anyway here’s hoping Zuck has spent his billions wisely.
Edit: I’m pretty sure we’re seeing Scout right now, at least groqchat’s 4-scout seems really similar to meta.ai. I can confidently say that Scout is not as good at writing as o1 pro, o3 mini, Claude, R1 or grok 3.
by stuaxo on 4/7/25, 1:26 PM
What does it mean that it "no longer leans left" for answers.
What did they do to the model, and how exactly does it answer differently?
Will including this in an app make the app MAGA aligned all of a sudden?
What happens if it says something that breaks the laws of some country it's in ?
by whywhywhywhy on 4/5/25, 7:12 PM
Disjointed branding with the apache style folders suggesting openness and freedom and clicking though I need to do a personal info request form...
by cuuupid on 4/5/25, 7:45 PM
I think the most important thing to note here, perhaps more so than the context window, is that this exposes some serious flaws in benchmarks. Per benchmarks, Maverick is competitive only with older models like GPT-4o or Gemini 2.0 Flash, and not with anything in the last few months (incl. reasoning models).
However, the LMArena head to head leaderboard ranks this as 2nd place overall: https://lmarena.ai/?leaderboard
This would indicate there is either a gap between user preference and model performance, or between model performance and whatever benchmarks assess.
Either way, it is surely a huge deal that an open source model is now outperforming GPT 4.5.
by pdsouza on 4/5/25, 6:54 PM
Blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
by bastawhiz on 4/6/25, 3:09 AM
I don't really understand how Scout and Maverick are distillations of Behemoth if Behemoth is still training. Maybe I missed or misunderstood this in the post?
Did they distill the in-progress Behemoth and the result was good enough for models of those sizes for them to consider releasing it? Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?
Sorry if this is a naïve question.
by mark_l_watson on 4/6/25, 12:33 AM
I started running Llama 4 Scout on Groq using my Common Lisp client, and now trying Llama 4 Maverick on abacus.ai
Really impressive!
Also, check out the price/performance numbers: about $0.20 per million input tokens compared to about $5 for GPT-4o [1]
[1] https://x.com/kimmonismus/status/1908624648608133297
by simonklee on 4/5/25, 6:43 PM
Is this the first model that has a 10M context length?
by redox99 on 4/5/25, 7:14 PM
It seems to be comparable to other top models. Good, but nothing ground breaking.
by akulbe on 4/5/25, 7:32 PM
How well do you folks think this would run on this Apple Silicon setup?
MacBook Pro M2 Max
96GB of RAM
and which model should I try (if at all)?
The alternative is a VM w/dual 3090s set up with PCI passthrough.
by mtharrison on 4/5/25, 6:41 PM
Might be worth changing url: https://www.llama.com/
by andrewstuart on 4/5/25, 6:44 PM
Self hosting LLMs will explode in popularity over next 12 months.
Open models are made much more interesting and exciting and relevant by new generations of AI focused hardware such as the AMD Strix Halo and Apple Mac Studio M3.
GPUs have failed to meet the demands for lower cost and more memory so APUs look like the future for self hosted LLMs.
by latchkey on 4/5/25, 7:16 PM
One of the links says there are 4 different roles to interact with the model and then lists 3 of them.
by kristianp on 4/6/25, 12:35 AM
I'd like to discuss the matter of size. Llama has gone from talking up an 8b model as capable to having a smallest model of 109b. What will be the sizes in a years time? Things are moving out of reach for commodity pc's, 128GB is possible, but expensive.
by megadragon9 on 4/5/25, 6:52 PM
The blog post is quite informative: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
by shreezus on 4/5/25, 9:56 PM
Haven't had a chance to play with this yet, but 10M context window is seriously impressive. I think we'll see models with 100M context relatively soon, and eliminate the need for RAG for a lot of use cases.
by 7thpower on 4/5/25, 6:45 PM
Looking forward to this. Llama 3.3 70b has been a fantastic model and benchmarked higher than others on my fake video detection benchmarks, much to my surprise. Looking forward to trying the next generation of models.
by Alifatisk on 4/6/25, 10:27 AM
I remember when Google announced Geminis theoretical limit of 10M tokens context window, I was impressed. But it seems like that theoretical limit stayed as theoretical and they just pushed up to 2M. Which is still impressive.
Today, it seems Meta has crushed that wall with truly 10M tokens, wow.
I was also curious to how well Llama would be able to utilize the whole context window, it kinda pointless to have a large window if you can't recall most, if not all of it. The needle in the haystack test showed this is not the case, I wonder how they achieved this.
by impure on 4/5/25, 7:43 PM
10 million token context window? Damn, looks like Gemini finally has some competition. Also I'm a little surprised this is their first Mixture of Experts model, I thought they were using that before.
by cpeterson42 on 4/6/25, 4:18 PM
For anyone looking to experiment with these models who doesn't have 210GB of VRAM on tap-we're working as quickly as we can to get cheap access to 4x80GB A100 instances running at thundercompute.com (aiming for sub-$5/hr). For quantized versions, we have cheaper 1-2 GPU nodes available today. If you're interested, join our Discord for updates: https://discord.com/invite/nwuETS9jJK
by informal007 on 4/6/25, 1:22 AM
How much GPU memory are required for inference if it's 10M context?
by highfrequency on 4/6/25, 1:08 AM
Crazy that there are now five and a half companies that all have roughly state of the art LLMs.
> We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales. We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens.
This sounds interesting. Anyone have a link to the paper or other documentation on MetaP?
by wonderfuly on 4/6/25, 3:22 AM
Available here: https://app.chathub.gg/chat/cloud-llama4
by utopcell on 4/6/25, 3:49 AM
How are Maverick and Scout distilled from Behemoth if the latter is not done training? Do they distill from some intermediate, "good enough" snapshot?
by dormando on 4/5/25, 10:48 PM
Does anyone run these "at home" with small clusters? I've been googling unsuccessfully and this thread doesn't refer to anything.
So a non-quantized scout won't fit in a machine with 128GB of RAM (like framework or mac studio M4). Maverick is maybe a 512GB M3 Max mac studio. Is it possible (and if so what're the tradeoffs for) running like one instance of Scout on three 128GB frameworks?
by 1024core on 4/6/25, 1:10 AM
Anyone know what they mean by this:
> We developed a novel distillation loss function that dynamically weights the soft and hard targets through training.
by system2 on 4/5/25, 9:31 PM
Llama 4 Maverick: 788GB
Llama 4 Scout: 210GB
FYI.
by croemer on 4/8/25, 7:35 AM
Relevant update: the model on LM Arena is not the one that was released. See "Meta got caught gaming AI benchmark" https://news.ycombinator.com/item?id=43617660
by andrewstuart on 4/5/25, 7:12 PM
How much smaller would such a model be if it discarded all information not related to computers or programming?
by mrcwinn on 4/5/25, 10:11 PM
I had just paid for SoftRAM but happy nonetheless to see new distilled models. Nice work Meta.
by georgehill on 4/5/25, 7:38 PM
Post-op here. A better link dropped from Meta: https://ai.meta.com/blog/llama-4-multimodal-intelligence
Is there a way update the main post? @tomhoward
Edit:
Updated!
by EGreg on 4/6/25, 2:03 AM
Can we somehow load these inside node.js?
What is the easiest way to load them remotely? Huggingface Spaces? Google AI Studio?
I am teaching a course on AI to non-technical students, and I wanted the students to have a minimal setup: which in this case would be:
1) Browser with JS (simple folder of HTML, CSS) and Tensorflow.js that can run models like Blazeface for face recognition, eye tracking etc. (available since 2019)
2) Node.js with everything baked in (javascript) and use a CDN like CloudFront with tunnel to serve it to the web
3) So if they download models to their computer, how would they run them? Is it possible to run the smallest LLaMa locally? Or any GGUF models in JS? Or they have to have Python and PyTorch?
PS: Here is what the class looks like: https://vimeo.com/1060576298/c5693047e0?share=copy
by amrrs on 4/5/25, 8:35 PM
The entire licensing is such a mess and Mark Zuckerberg still thinks Llama 4 is open source!
> no commercial usage above 700M MAU
> prefix "llama" in any redistribution eg: fine-tuning
> mention "built with llama"
> add license notice in all redistribution
by barrenko on 4/5/25, 7:05 PM
When will this hit the Meta AI that I have within WhatsApp since of last week?
by yusufozkan on 4/5/25, 7:14 PM
> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs
I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?
by jwr on 4/6/25, 3:09 AM
For those unfamiliar with the "active parameters" terminology, what would be the RAM requirements?
E.g.can I run the smallest one on my Macbook Pro (M4 Max, 64GB) like I can run gemma3?
by Amekedl on 4/6/25, 10:26 AM
So the wall has been really been hit already for now, ouch. It was to be expected with gpt-“4.5”, but still, the realization now really feels grounded.
by spwa4 on 4/5/25, 6:50 PM
I hope this time multimodal includes multimodal outputs!
by gzer0 on 4/5/25, 8:07 PM
10M context length and surpasses claude-3.7-sonnet and GPT-4.5.
Can't wait to dig in on the research papers. Congrats to the llama team!
by steele on 4/6/25, 2:59 AM
Consuming pirated literature en masse produces a bias away from authoritarianism; consider me flabbergasted.
by artninja1988 on 4/5/25, 6:58 PM
Thank you meta for open sourcing! Will there be a llama with native image output similar to 4os? Would be huge
by elromulous on 4/5/25, 6:38 PM
Was this released in error? One would think it would be accompanied by a press release / blog post.
by Havoc on 4/6/25, 11:51 PM
Interesting that the reception here is much more positive here than on /r/localllama
by paulmendoza on 4/6/25, 5:21 AM
How long did they run the training job for? Curious how much it costs to train all of these models?
by ilove_banh_mi on 4/5/25, 6:41 PM
>10M context window
what new uses does this enable?
by supernovae on 4/6/25, 8:21 PM
It's too bad these models are built on the expectation of pirating the world
by ein0p on 4/6/25, 9:01 AM
If it's not on Ollama, nobody is going to care beyond perusing the metrics.
by drilbo on 4/5/25, 7:14 PM
their huggingface page doesn't actually appear to have been updated yet
by scosman on 4/5/25, 6:43 PM
128 exports at 17B active parameters. This is going to be fun to play with!
by isawczuk on 4/5/25, 6:40 PM
Messenger started to get Meta AI assistant, so this is logical next step
by rvz on 4/5/25, 6:48 PM
As expected, Meta doesn't disappoint and accelerates the race to zero.
Meta is undervalued.
by fpgaminer on 4/5/25, 6:52 PM
https://www.llama.com/ https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
Very exciting. Benchmarks look good, and most importantly it looks like they did a lot of work improving vision performance (based on benchmarks).
The new suggested system prompt makes it seem like the model is less censored, which would be great. The phrasing of the system prompt is ... a little disconcerting in context (Meta's kowtowing to Nazis), but in general I'm a proponent of LLMs doing what users ask them to do.
Once it's on an API I can start throwing my dataset at it to see how it performs in that regard.
by asdev on 4/5/25, 7:42 PM
I don't think open source will be the future of AI models. Self hosting an AI model is much more complex and resource incentive than traditional open source SaaS. Meta will likely have a negative ROI on their AI efforts
by jacooper on 4/6/25, 7:19 PM
BTW these models arent allowed to be used in the EU.
by lousken on 4/5/25, 8:23 PM
ollama when
by krashidov on 4/5/25, 7:24 PM
Anyone know if it can analyze PDFs?
by Centigonal on 4/5/25, 6:45 PM
Really great marketing here, props!
by ein0p on 4/6/25, 2:38 AM
Strange choice of languages for their "multilingual" capabilities, but OK. I wonder why there's no Chinese.
by dcl on 4/7/25, 4:58 AM
But how good is it at Pokemon?
by tomdekan on 4/5/25, 9:39 PM
So, Quasar == Llama 4 Behemoth?
by Ninjinka on 4/5/25, 7:28 PM
no audio input?
by yapyap on 4/5/25, 6:40 PM
is this the quasar LLM from openrouter?
by ianks on 4/6/25, 7:24 AM
Are we going to find out that Meta pirated libgen again, with zero recognition to the authors?
“Open-sourcing it” doesn’t magically absolve you of the irreparable damages you’ve caused society. You stole their life’s work so your company could profit off of rage-slop.
by DeepYogurt on 4/6/25, 12:06 AM
Jesus. How much ram does the big one take to run?
by ofermend on 4/8/25, 1:09 AM
A great day for open source, and so glad to see llama4 out. However, I'm a bit disappointed that the hallucination rates of Llama4 are not as low as I would have liked (TL;DR slightly higher than Llama3).
Check the numbers on the hallucination leaderboard: https://github.com/vectara/hallucination-leaderboard
by guybedo on 4/6/25, 4:41 AM
TLDR: https://extraakt.com/extraakts/llama-4-release-analysis
by Deprogrammer9 on 4/5/25, 6:39 PM
looks like a leak to me.
by RandyOrion on 4/6/25, 2:01 AM
I guess I have to say thank you Meta?
A somewhat sad rant below.
Deepseek starts a toxic trend of providing super, super large MoE. And MoE is famous for being parameter-inefficient, which is unfriendly to normal consumer hardware with limited vram.
The super large size of LLM also disables nearly every people from doing meaningful development on these models. R1-1776 is the only fine-tune variation of R1 that makes some noise, and it's by a corp not some random individual.
In this release, the smallest Llama 4 model is over 100B, which is not small by any means, and will prevent people from fine-tuning as well.
On top of that, to access llama models on hugging face has become notoriously hard because of 'permission' issues. See details in https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/dis...
Yeah, I personally don't really see the point of releasing large MoEs. I'll stick to small and dense LLMs from Qwen, Mistral, Microsoft, Google and others.
Edit: This comment got downvoted, too. Please explain your reason before doing that.
by rfoo on 4/5/25, 7:12 PM
From model cards, suggested system prompt:
> You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
It's interesting that there's no single one of CJK languages mentioned. I'm tempted to call this a racist model even.