from Hacker News

LLaMA now goes faster on CPUs

by lawrencechen on 4/1/24, 2:17 AM with 451 comments

by speps on 4/1/24, 8:00 AM
Regarding this bit at the end:
> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS
If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?
by bottlepalm on 4/1/24, 2:52 AM
I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.
by marshallward on 4/1/24, 2:05 PM
There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.
The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.
Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.
Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.
by ajtulloch on 4/1/24, 3:32 AM
- https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html (e.g. here https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...)
- https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...
might be of interest
by TimPC on 4/1/24, 4:49 PM
Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".
by aaronscott on 4/1/24, 5:10 PM
> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.
This is great. I love the idea of measuring performance differences in “years of Moore’s law.”
Twenty years puts the delta in an easy to understand framework.
by wokwokwok on 4/1/24, 4:05 AM
> You don't need a large computer to run a large language model
While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.
Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.
That doesn’t mean “you don’t need a computer to run an LM”…
I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.
I don’t realllly believe you can do a lot of useful LLM work on a pi
by tiffanyh on 4/1/24, 12:29 PM
Pixar uses CPUs …
I wonder if we’ll end up in a situation like rendered movies.
Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).
https://news.ycombinator.com/item?id=25616372
by ein0p on 4/1/24, 5:07 PM
As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
by AbuAssar on 4/1/24, 12:05 PM
regarding AMD zen4 with avx512:
"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."
by pama on 4/1/24, 3:04 AM
Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.
by saagarjha on 4/3/24, 2:04 AM
> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.
Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.
by none_to_remain on 4/1/24, 4:24 AM
From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"
I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it
by jongjong on 4/1/24, 5:18 AM
That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.
To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).
But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.
by kiratp on 4/1/24, 3:08 AM
It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.
https://github.com/ggerganov/llama.cpp/issues/2555
by politelemon on 4/1/24, 6:02 AM
This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?
Will definitely be giving this a try.
by aniijbod on 4/1/24, 3:17 AM
A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.
by kristianp on 4/1/24, 5:50 AM
Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).
by s_Hogg on 4/1/24, 1:02 PM
I'd pay good money to watch jart in conversation with Carmack
by miki123211 on 4/1/24, 12:23 PM
If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?
by mijoharas on 4/1/24, 9:32 AM
Has Justine written anywhere about her disassembly setup?
> I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.
I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?
It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)
by hrkfmud50k on 4/1/24, 3:12 PM
> It's clearly optimal since my CPU is listed as only being capable of going 780 gigaflops
780 GFLOP is the iGPU spec. Is this a valid comparison?
https://nanoreview.net/en/cpu/intel-core-i9-14900k
by moffkalast on 4/1/24, 9:56 AM
> the Raspberry Pi
Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.
llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second
llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second
It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.
by isusmelj on 4/1/24, 9:23 AM
Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?
by 1-6 on 4/1/24, 2:58 AM
Question is, how much of an improvement has it gotten to over a GPU or ASIC?
by bee_rider on 4/1/24, 4:07 AM
Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.
by column on 4/2/24, 9:08 AM
Unfortunately BitDefender (corporate) blocks llamafile as a ransomware "atc.heur.crypt" and it seems there is no workaround. :(
by Ono-Sendai on 4/1/24, 4:58 AM
Multithreading support in llama.cpp is probably still pretty busted, assuming it uses the same underlying NN inference code as whisper.cpp: https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...
by rbnsl on 4/1/24, 6:15 PM
Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out
by DrNosferatu on 4/3/24, 12:39 PM
Any performance benchmark against intel's 'IPEX-LLM'[0] or others?
[0] - https://github.com/intel-analytics/ipex-llm
by yieldcrv on 4/1/24, 4:26 PM
note, this is "goes faster on CPUs than before", not faster than GPUs.
by Dobiasd on 4/3/24, 8:43 AM
Are there any benchmarks on the performance of these new matrix multiplication kernels compared to the Eigen library (ideally for float32)?
by discordance on 4/1/24, 3:00 AM
"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "
Can anyone here answer why this is?
by arendtio on 4/1/24, 3:43 PM
Does someone else see llamafile using Wine on Linux?
Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
by seangrogg on 4/1/24, 6:54 AM
Mmm, I wonder how well this would work on a mobile device. Maybe I'll try grabbing my ubuntu touch here in a sec...
by m3kw9 on 4/1/24, 1:30 PM
So Nvidia in trouble now because intel can be used instead for faster/cheaper? inference?
by 6r17 on 4/1/24, 11:31 AM
today being today ; I must ask ; anyone has actually tried this ?
by JohnnyHerz on 4/1/24, 7:05 PM
Awesomeness. thank you for sharing!
by tubs on 4/1/24, 2:03 PM
The ram is not on the cpu on a mac. It's in the same can but it's still regular ddr dimms.
by aimonster2 on 4/1/24, 12:19 PM
Posted too early.
by wtallis on 4/1/24, 6:00 AM
I know this post is focused specifically on CPU performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.
by 4bpp on 4/1/24, 12:31 PM
It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).
[1] https://news.ycombinator.com/item?id=35393284
by pknerd on 4/1/24, 8:25 AM
So, I can now run it on my 2015 Macbook with 8GB RAM?
by sublimefire on 4/1/24, 12:21 PM
re:funding
my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.
by tomp on 4/1/24, 11:56 AM
TL;DR: unroll the outer two loops of matrix multiplication