from Hacker News

Run Llama2-70B in Web Browser with WebGPU Acceleration

by ruihangl on 7/24/23, 5:39 PM with 6 comments

by brucethemoose2 on 7/24/23, 5:55 PM
Apache TVM is super cool in theory. Its fast thanks to the autotuning, and it supports tons of backends like Vulkan, Metal, WASM + WebGPU, fpgas, weird mobile accelerators and such. It supports quantization, dynamism and other cool features.
But... It isn't used much outside MLC? And MLC's implementations are basically demos.
I dunno why. AI inference communities are dying for fast multiplatform backends without the fuss of PyTorch.
by ruihangl on 7/24/23, 5:39 PM
Purely running in web browser. Generating 6.2 tok/s on Apple M2 Ultra with 64GB of memory.