from Hacker News

High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

by dataminer on 12/20/23, 1:46 PM with 83 comments

  • by phh on 12/20/23, 4:21 PM

    Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)

    After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM

    So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)

    I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)

  • by 127 on 12/20/23, 3:29 PM

    Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.
  • by Const-me on 12/20/23, 11:27 PM

    Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: https://github.com/Const-me/Cgml

    Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.

  • by brucethemoose2 on 12/20/23, 3:19 PM

    This is super cool.

    For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.

    Hopefully the "cold" neurons eventually get offloaded to the IGP instead?

    Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?

  • by jupp0r on 12/20/23, 4:11 PM

    From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.
  • by EwanG on 12/20/23, 3:20 PM

    The important stuff from the readme (if you're not looking to tinker with it directly):

    We have tested PowerInfer on the following platforms:

    x86-64 CPU (with AVX2 instructions) on Linux

    x86-64 CPU and NVIDIA GPU on Linux

    Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)

    And new features coming soon:

    Mistral-7B model

    Metal backend for sparse inference on macOS

  • by peter_d_sherman on 12/21/23, 5:27 PM

    >"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."

    Brilliant!

  • by modeless on 12/20/23, 4:39 PM

    Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.
  • by superkuh on 12/20/23, 9:24 PM

    This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).
  • by causality0 on 12/20/23, 9:08 PM

    All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?
  • by nextaccountic on 12/20/23, 8:21 PM

    > Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.

    Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?

    edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?

  • by PoignardAzur on 12/21/23, 3:04 PM

    This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.
  • by robwwilliams on 12/21/23, 4:02 AM

    Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!
  • by ComputerGuru on 12/20/23, 4:21 PM

    It’s not too much faster than exllama2 with flash attention, no?
  • by ekianjo on 12/20/23, 3:32 PM

    how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?
  • by coder543 on 12/20/23, 3:19 PM

    "Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)