by helloericsf on 6/12/24, 9:58 PM with 14 comments
by worstspotgain on 6/12/24, 10:34 PM
However, this scheme does reduce memory usage by 40%, meaning it allows models that are 67% bigger. It's a quality improvement, not a performance one.
> For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.
by helloericsf on 6/12/24, 9:58 PM
by russianGuy83829 on 6/12/24, 10:32 PM
It remains to be seen how good these custom models are.
by x0n on 6/12/24, 10:27 PM
"The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations."
Ahem, what? Let's overload a biological construct "neuron" to imbue it with magical technopowers and then derive the rest of our BS from this. No sale.