from Hacker News

Run Llama2-70B in Web Browser with WebGPU Acceleration

by ruihangl on 7/24/23, 5:39 PM with 6 comments

  • by brucethemoose2 on 7/24/23, 5:55 PM

    Apache TVM is super cool in theory. Its fast thanks to the autotuning, and it supports tons of backends like Vulkan, Metal, WASM + WebGPU, fpgas, weird mobile accelerators and such. It supports quantization, dynamism and other cool features.

    But... It isn't used much outside MLC? And MLC's implementations are basically demos.

    I dunno why. AI inference communities are dying for fast multiplatform backends without the fuss of PyTorch.

  • by ruihangl on 7/24/23, 5:39 PM

    Purely running in web browser. Generating 6.2 tok/s on Apple M2 Ultra with 64GB of memory.