from Hacker News

Guide to running Llama 2 locally

by bfirsh on 7/25/23, 4:58 PM with 170 comments

  • by shortrounddev2 on 7/26/23, 12:46 AM

    For my fellow Windows shills, here's how you actually build it on windows:

    Before steps:

    1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads

    2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...

    In Windows Terminal with Powershell:

        git clone https://github.com/ggerganov/llama.cpp
        cd llama.cpp
        mkdir build
        cd build
        cmake .. -DLLAMA_CUBLAS=ON
        cmake --build . --config Release
        cd bin/Release
        mkdir models
        mv Folder\Where\You\Downloaded\The\Model .\models
        .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
    
    `-DLLAMA_CUBLAS` uses cuda

    `2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal

    Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:

        function llama {
            .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
        }
    
    adjust your paths as necessary. It has a tendency to talk to itself.
  • by jawerty on 7/26/23, 12:20 AM

    Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.

    In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.

    Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI

  • by andreyk on 7/25/23, 7:05 PM

    This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)

    Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.

  • by krychu on 7/25/23, 10:48 PM

    Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:

    https://github.com/krychu/llama

    It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.

  • by thisisit on 7/25/23, 9:02 PM

    The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
  • by rootusrootus on 7/25/23, 7:08 PM

    For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
  • by Der_Einzige on 7/25/23, 8:36 PM

    The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui
  • by aledalgrande on 7/26/23, 1:21 AM

    Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
  • by guy98238710 on 7/25/23, 9:35 PM

    > curl -L "https://replicate.fyi/install-llama-cpp" | bash

    Seriously? Pipe script from someone's website directly to bash?

  • by jossclimb on 7/26/23, 6:28 AM

    Seems to be a better guide here (without the risk curl):

    https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...

  • by ericHosick on 7/26/23, 3:21 AM

    The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.

    I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.

  • by oaththrowaway on 7/25/23, 8:53 PM

    Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?
  • by maxlin on 7/25/23, 10:22 PM

    I might be missing something. The article asks me to run a bash script on windows.

    I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?

    I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.

  • by nonethewiser on 7/26/23, 12:32 AM

    Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
  • by nravic on 7/25/23, 11:56 PM

    Self plug: run llama.cpp as an inference server on a spot instance anywhere: https://cedana.readthedocs.io/en/latest/examples.html#runnin...
  • by TheAceOfHearts on 7/25/23, 10:53 PM

    How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.

    [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML

  • by prohobo on 7/27/23, 7:36 AM

    The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!
  • by sva_ on 7/25/23, 10:31 PM

    If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.

    0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...

  • by handelaar on 7/25/23, 9:12 PM

    Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...

    ... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?

  • by alvincodes on 7/26/23, 12:10 AM

    I appreciate their honesty when it's in their interest that people use their API rather than run it locally.
  • by nomand on 7/25/23, 8:31 PM

    Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
  • by synaesthesisx on 7/25/23, 10:15 PM

    This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
  • by boffinAudio on 7/26/23, 2:10 PM

    I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?
  • by RicoElectrico on 7/25/23, 10:45 PM

        curl -L "https://replicate.fyi/windows-install-llama-cpp"
    
    ... returns 404 Not Found
  • by theLiminator on 7/26/23, 12:31 AM

    Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?
  • by amelius on 7/26/23, 10:38 AM

    As someone with too little spare time I'm curious, what are people using this for, except research?
  • by technological on 7/26/23, 5:35 AM

    Did anyone build pc for running these models and which one do you recommend
  • by TastyAmphibian on 7/27/23, 1:38 PM

    I'm still curious to know the hype behind Llama 2
  • by politelemon on 7/25/23, 10:33 PM

    Llama.cpp can run on Android too.