from Hacker News

Small offline large language model – TinyChatEngine from MIT

by physicsgraph on 12/18/23, 2:57 AM with 24 comments

by antirez on 12/18/23, 8:18 AM
Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.
Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.
by upon_drumhead on 12/18/23, 6:17 AM
I’m a tad confused
> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.
But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?
by rodnim on 12/18/23, 10:26 AM
"Small large" ..... so, medium? :)
by aravindgp on 12/18/23, 8:00 AM
I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.
by collyw on 12/18/23, 12:08 PM
Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?
by dkjaudyeqooe on 12/18/23, 11:04 AM
I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).
Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.