from Hacker News

21.2× faster than llama.cpp? plus 40% memory usage reduction

by helloericsf on 6/12/24, 9:58 PM with 14 comments

  • by worstspotgain on 6/12/24, 10:34 PM

    The speed improvement is only for models that don't entirely fit in memory, i.e. memory-starved llama.cpp degenerates to ~20x slower.

    However, this scheme does reduce memory usage by 40%, meaning it allows models that are 67% bigger. It's a quality improvement, not a performance one.

    > For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.

  • by helloericsf on 6/12/24, 9:58 PM

  • by russianGuy83829 on 6/12/24, 10:32 PM

    It seems like this can’t run all models, and needs custom ones trained from scratch: “ We introduce two new models: TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B. These models are sparsified versions of Mistral and Mixtral […]. Notbly, our models are trained with just 150B tokens within just 0.1M dollars”.

    It remains to be seen how good these custom models are.

  • by x0n on 6/12/24, 10:27 PM

    The hyperbollocks marketingspeak in the summary paragraph put me off:

    "The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations."

    Ahem, what? Let's overload a biological construct "neuron" to imbue it with magical technopowers and then derive the rest of our BS from this. No sale.