from Hacker News

Fast Llama inference in pure, modern Java

by mukel on 10/14/24, 8:12 AM with 3 comments

by mukel on 10/14/24, 8:12 AM
Features: - Single file, no dependencies - GGUF format parser - Llama 3 tokenizer - Support Llama 3, 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings) - Fast matrix-vector multiplication routines for Q4_0 and Q8_0 quantized tensors using Java's Vector API - GraalVM's Native Image support - AOT model preloading for instant time-to-first-token
by mukel on 10/14/24, 8:19 AM
Code: https://github.com/mukel/llama3.java
by nunobrito on 10/15/24, 9:44 PM
Quite good stuff. Was looking for something like this since a long time.