from Hacker News

Ollama for Linux – Run LLMs on Linux with GPU Acceleration

by jmorgan on 9/26/23, 4:29 PM with 54 comments

Hi HN,

Over the last few months I've been working with some folks on a tool named Ollama (https://github.com/jmorganca/ollama) to run open-source LLMs like Llama 2, Code Llama and Falcon locally, starting with macOS.

The biggest ask since then has been "how can I run Ollama on Linux?" with GPU support out of the box. Setting up and configuring CUDA and then compiling and running llama.cpp (which is a fantastic library and runs under the hood) can be quite painful on different combinations of linux distributions and Nvidia GPUs. The goal for Ollama's linux version was to automate this process to make it easy to get up and running.

The is the first Linux release! There's still lots to do, but I wanted to share it here for to see what everyone thinks. Thanks for anyone who has given it a try and sent feedback!

  • by brucethemoose2 on 9/26/23, 4:50 PM

    Oh, this is a llama.cpp frontend. Y'all should have lead with that!

    I saw this on HN before, but I thought it was another from-scratch llama implementation... Which is fine, but much less interesting to me, as a from-scratch implementation probably not as fast/feature packed as llama.cpp or the TVM implementation.

    Keeping up with llama.cpp's rapid evolution is very difficult, and there's a need for projects like this.

  • by sqs on 9/26/23, 5:38 PM

    Ollama is awesome. I am part of a team building a code AI application[1], and we want to give devs the option to run it locally instead of only supporting external LLMs from Anthropic, OpenAI, etc. Those big remote LLMs are incredibly powerful and probably the right choice for most devs, but it's good for devs to have a local option as well—for security, privacy, cost, latency, simplicity, freedom, etc.

    As an app dev, we have 2 choices:

    (1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.

    (2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).

    Obviously choice 2 is much, much simpler. There are some things in the middle, like less polished wrappers around llama.cpp, but Ollama is the only thing that 100% of people I've told about have been able to install without any problems.

    That's huge because it's finally possible to build real apps that use local LLMs—and still reach a big userbase. Your userbase is now (pretty much) "anyone who can download and run a desktop app and who has a relatively modern laptop", which is a big population.

    I'm really excited to see what people build on Ollama.

    (And Ollama will simplify deploying server-side LLM apps as well, but right now from participating in the community, it seems most people are only thinking of it for local apps. I expect that to change when people realize that they can ship a self-contained server app that runs on a cheap AWS/GCP instance and uses an Ollama-executed LLM for various features.)

    [1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.

  • by ForkMeOnTinder on 9/26/23, 6:07 PM

    Huge fan of ollama. Although this is the first official linux release, I've been using it on linux already for a few months now with no issues (through the arch package which builds from source).

    Getting started was literally as easy as:

      pacman -S ollama
      ollama serve
      ollama run llama2:13b 'insert prompt'
    
    You guys are doing the lord's work here
  • by jrm4 on 9/26/23, 5:29 PM

    Very cool. Does anyone know exactly how out of luck us AMD folk are? I know there are efforts out there, but I'm kind of hoping for something "as easy as this?"
  • by dang on 9/26/23, 6:32 PM

    It looks like great work but this isn't different enough from the recent Show HN to make a new Show HN:

    Show HN: Ollama – Run LLMs on your Mac - https://news.ycombinator.com/item?id=36802582 - July 2023 (94 comments)

    (about this see https://news.ycombinator.com/showhn.html)

  • by biddit on 9/26/23, 5:13 PM

    I’ve been using this on my MacBook Pro for the last couple weeks and want to say thank you!

    As a solutions developer not so much interested in training models but leveraging them in a pipeline, I hadn’t bothered to try to run anything locally due to the complexity of setup, even with llama.cpp. You enabled me to be up and running in just a few minutes.

  • by sestinj on 9/26/23, 6:34 PM

    This is a huge deal, congrats! We've had a ton of users asking how to run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicated. Having a single-click to download option is going to open this up for so many more people! If anyone is looking for a way to use Ollama inside VS Code, one option (what I've been working on) is https://continue.dev

    Also curious, do you plan to support speculative sampling if/when the feature is merged into llama.cpp? Excited about the possibility of running a 34b at high speeds on a standard laptop

  • by jerrysievert on 9/26/23, 5:02 PM

    this is awesome, congrats on an amazingly useful release!

    for those that haven't used ollama, being able to specify how a model behaves via a "modelfile" is pretty darned awesome. I have a chef, a bartender, and a programmer that I use, personally.

  • by aftbit on 9/26/23, 6:22 PM

    How does this compare to vLLM or exllama? Can it run llama2 30B on one 3090 24G or 70B on two 3090 24G?

    https://github.com/vllm-project/vllm

    https://github.com/turboderp/exllama

    https://github.com/turboderp/exllamav2

  • by kelvie on 9/26/23, 5:56 PM

    Amazing! I use text-generation-webui to play with LLMs, but was always jealous of this much simpler interface.

    Somewhat related note -- does anyone know what are the performance differences for GPU-only inference using this loader (llama.cpp + GGUF/GGML modles) vs exllama using GPTQ? My understanding is that exllama/GPTQ gets a lot higher tok/s on a consumer GPU like a [34]090.

    Would save me many gigabytes of downloads of testing if someone knew.

  • by WiSaGaN on 9/27/23, 1:39 AM

    Ollama is awesome. I am however still waiting for the support of controlling the model cache location: https://github.com/jmorganca/ollama/issues/153

    This is either for backup purpose, or to share model files with other applications. Those model files are large!

  • by binarymax on 9/26/23, 5:38 PM

    Congrats on the launch! I'll give it a try. I've been using vLLM on Linux so far but have wanted to be able to use a ggml backend - have you done any perf comparisons?
  • by hathym on 9/27/23, 6:31 AM

    There is also https://faraday.dev/
  • by aglazer on 9/26/23, 9:58 PM

    Ollama is fantastic. Thanks for building it!
  • by politelemon on 9/26/23, 6:43 PM

    How is WSL2 able to work with GPUs?
  • by agilob on 9/26/23, 6:17 PM

    On NVidia GPU*