from Hacker News

llama.cpp now supports StarCoder model series

by wsxiaoys on 9/18/23, 4:06 PM with 1 comments

  • by wsxiaoys on 9/18/23, 4:06 PM

    For the 1B version of the model, it operates at approximately 100 tokens per second when decoding with Metal on an Apple M2 Max.

    llama_print_timings: load time = 114.00 ms

    llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)

    llama_print_timings: prompt eval time = 107.79 ms / 22 tokens ( 4.90 ms per token, 204.11 tokens per second)

    llama_print_timings: eval time = 1315.10 ms / 127 runs ( 10.36 ms per token, 96.57 tokens per second)

    llama_print_timings: total time = 1427.08 ms

    (Disclaimer: I submited the PR)