by jmorgan on 9/26/23, 4:29 PM with 54 comments
Over the last few months I've been working with some folks on a tool named Ollama (https://github.com/jmorganca/ollama) to run open-source LLMs like Llama 2, Code Llama and Falcon locally, starting with macOS.
The biggest ask since then has been "how can I run Ollama on Linux?" with GPU support out of the box. Setting up and configuring CUDA and then compiling and running llama.cpp (which is a fantastic library and runs under the hood) can be quite painful on different combinations of linux distributions and Nvidia GPUs. The goal for Ollama's linux version was to automate this process to make it easy to get up and running.
The is the first Linux release! There's still lots to do, but I wanted to share it here for to see what everyone thinks. Thanks for anyone who has given it a try and sent feedback!
by brucethemoose2 on 9/26/23, 4:50 PM
I saw this on HN before, but I thought it was another from-scratch llama implementation... Which is fine, but much less interesting to me, as a from-scratch implementation probably not as fast/feature packed as llama.cpp or the TVM implementation.
Keeping up with llama.cpp's rapid evolution is very difficult, and there's a need for projects like this.
by sqs on 9/26/23, 5:38 PM
As an app dev, we have 2 choices:
(1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.
(2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).
Obviously choice 2 is much, much simpler. There are some things in the middle, like less polished wrappers around llama.cpp, but Ollama is the only thing that 100% of people I've told about have been able to install without any problems.
That's huge because it's finally possible to build real apps that use local LLMs—and still reach a big userbase. Your userbase is now (pretty much) "anyone who can download and run a desktop app and who has a relatively modern laptop", which is a big population.
I'm really excited to see what people build on Ollama.
(And Ollama will simplify deploying server-side LLM apps as well, but right now from participating in the community, it seems most people are only thinking of it for local apps. I expect that to change when people realize that they can ship a self-contained server app that runs on a cheap AWS/GCP instance and uses an Ollama-executed LLM for various features.)
[1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.
by ForkMeOnTinder on 9/26/23, 6:07 PM
Getting started was literally as easy as:
pacman -S ollama
ollama serve
ollama run llama2:13b 'insert prompt'
You guys are doing the lord's work hereby jrm4 on 9/26/23, 5:29 PM
by dang on 9/26/23, 6:32 PM
Show HN: Ollama – Run LLMs on your Mac - https://news.ycombinator.com/item?id=36802582 - July 2023 (94 comments)
(about this see https://news.ycombinator.com/showhn.html)
by biddit on 9/26/23, 5:13 PM
As a solutions developer not so much interested in training models but leveraging them in a pipeline, I hadn’t bothered to try to run anything locally due to the complexity of setup, even with llama.cpp. You enabled me to be up and running in just a few minutes.
by sestinj on 9/26/23, 6:34 PM
Also curious, do you plan to support speculative sampling if/when the feature is merged into llama.cpp? Excited about the possibility of running a 34b at high speeds on a standard laptop
by jerrysievert on 9/26/23, 5:02 PM
for those that haven't used ollama, being able to specify how a model behaves via a "modelfile" is pretty darned awesome. I have a chef, a bartender, and a programmer that I use, personally.
by aftbit on 9/26/23, 6:22 PM
https://github.com/vllm-project/vllm
by kelvie on 9/26/23, 5:56 PM
Somewhat related note -- does anyone know what are the performance differences for GPU-only inference using this loader (llama.cpp + GGUF/GGML modles) vs exllama using GPTQ? My understanding is that exllama/GPTQ gets a lot higher tok/s on a consumer GPU like a [34]090.
Would save me many gigabytes of downloads of testing if someone knew.
by WiSaGaN on 9/27/23, 1:39 AM
This is either for backup purpose, or to share model files with other applications. Those model files are large!
by binarymax on 9/26/23, 5:38 PM
by hathym on 9/27/23, 6:31 AM
by aglazer on 9/26/23, 9:58 PM
by politelemon on 9/26/23, 6:43 PM
by agilob on 9/26/23, 6:17 PM