by dataminer on 12/20/23, 1:46 PM with 83 comments
by phh on 12/20/23, 4:21 PM
After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM
So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)
I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)
by 127 on 12/20/23, 3:29 PM
by Const-me on 12/20/23, 11:27 PM
Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.
by brucethemoose2 on 12/20/23, 3:19 PM
For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.
Hopefully the "cold" neurons eventually get offloaded to the IGP instead?
Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?
by jupp0r on 12/20/23, 4:11 PM
by EwanG on 12/20/23, 3:20 PM
We have tested PowerInfer on the following platforms:
x86-64 CPU (with AVX2 instructions) on Linux
x86-64 CPU and NVIDIA GPU on Linux
Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)
And new features coming soon:
Mistral-7B model
Metal backend for sparse inference on macOS
by peter_d_sherman on 12/21/23, 5:27 PM
Brilliant!
by modeless on 12/20/23, 4:39 PM
by superkuh on 12/20/23, 9:24 PM
by causality0 on 12/20/23, 9:08 PM
by nextaccountic on 12/20/23, 8:21 PM
Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?
edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?
by PoignardAzur on 12/21/23, 3:04 PM
by robwwilliams on 12/21/23, 4:02 AM
by ComputerGuru on 12/20/23, 4:21 PM
by ekianjo on 12/20/23, 3:32 PM
by coder543 on 12/20/23, 3:19 PM