by lawrencechen on 4/1/24, 2:17 AM with 451 comments
by speps on 4/1/24, 8:00 AM
> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS
If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?
by bottlepalm on 4/1/24, 2:52 AM
by marshallward on 4/1/24, 2:05 PM
The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.
Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.
Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.
by ajtulloch on 4/1/24, 3:32 AM
by TimPC on 4/1/24, 4:49 PM
by aaronscott on 4/1/24, 5:10 PM
This is great. I love the idea of measuring performance differences in “years of Moore’s law.”
Twenty years puts the delta in an easy to understand framework.
by wokwokwok on 4/1/24, 4:05 AM
While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.
Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.
That doesn’t mean “you don’t need a computer to run an LM”…
I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.
I don’t realllly believe you can do a lot of useful LLM work on a pi
by tiffanyh on 4/1/24, 12:29 PM
I wonder if we’ll end up in a situation like rendered movies.
Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).
by ein0p on 4/1/24, 5:07 PM
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
by AbuAssar on 4/1/24, 12:05 PM
"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."
by pama on 4/1/24, 3:04 AM
by saagarjha on 4/3/24, 2:04 AM
Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.
by none_to_remain on 4/1/24, 4:24 AM
I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it
by jongjong on 4/1/24, 5:18 AM
To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).
But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.
by kiratp on 4/1/24, 3:08 AM
by politelemon on 4/1/24, 6:02 AM
Will definitely be giving this a try.
by aniijbod on 4/1/24, 3:17 AM
by kristianp on 4/1/24, 5:50 AM
by s_Hogg on 4/1/24, 1:02 PM
by miki123211 on 4/1/24, 12:23 PM
by mijoharas on 4/1/24, 9:32 AM
> I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.
I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?
It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)
by hrkfmud50k on 4/1/24, 3:12 PM
780 GFLOP is the iGPU spec. Is this a valid comparison?
by moffkalast on 4/1/24, 9:56 AM
Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.
llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second
llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second
It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.
by isusmelj on 4/1/24, 9:23 AM
by 1-6 on 4/1/24, 2:58 AM
by bee_rider on 4/1/24, 4:07 AM
by column on 4/2/24, 9:08 AM
by Ono-Sendai on 4/1/24, 4:58 AM
by rbnsl on 4/1/24, 6:15 PM
by DrNosferatu on 4/3/24, 12:39 PM
by yieldcrv on 4/1/24, 4:26 PM
by Dobiasd on 4/3/24, 8:43 AM
by discordance on 4/1/24, 3:00 AM
Can anyone here answer why this is?
by arendtio on 4/1/24, 3:43 PM
Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
by seangrogg on 4/1/24, 6:54 AM
by m3kw9 on 4/1/24, 1:30 PM
by 6r17 on 4/1/24, 11:31 AM
by JohnnyHerz on 4/1/24, 7:05 PM
by tubs on 4/1/24, 2:03 PM
by aimonster2 on 4/1/24, 12:19 PM
by wtallis on 4/1/24, 6:00 AM
by 4bpp on 4/1/24, 12:31 PM
by pknerd on 4/1/24, 8:25 AM
by sublimefire on 4/1/24, 12:21 PM
my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.
by tomp on 4/1/24, 11:56 AM