from Hacker News

Guide to running Llama 2 locally

by bfirsh on 7/25/23, 4:58 PM with 170 comments

by shortrounddev2 on 7/26/23, 12:46 AM
For my fellow Windows shills, here's how you actually build it on windows:
Before steps:
1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads
2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...
In Windows Terminal with Powershell:
```
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    mkdir build
    cd build
    cmake .. -DLLAMA_CUBLAS=ON
    cmake --build . --config Release
    cd bin/Release
    mkdir models
    mv Folder\Where\You\Downloaded\The\Model .\models
    .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
```
`-DLLAMA_CUBLAS` uses cuda
`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal
Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:
```
    function llama {
        .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
    }
```
adjust your paths as necessary. It has a tendency to talk to itself.
by jawerty on 7/26/23, 12:20 AM
Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.
In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.
Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI
by andreyk on 7/25/23, 7:05 PM
This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)
Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.
by krychu on 7/25/23, 10:48 PM
Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:
https://github.com/krychu/llama
It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.
by thisisit on 7/25/23, 9:02 PM
The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
by rootusrootus on 7/25/23, 7:08 PM
For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
by Der_Einzige on 7/25/23, 8:36 PM
The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui
by aledalgrande on 7/26/23, 1:21 AM
Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
by guy98238710 on 7/25/23, 9:35 PM
> curl -L "https://replicate.fyi/install-llama-cpp" | bash
Seriously? Pipe script from someone's website directly to bash?
by jossclimb on 7/26/23, 6:28 AM
Seems to be a better guide here (without the risk curl):
https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...
by ericHosick on 7/26/23, 3:21 AM
The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.
I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.
by oaththrowaway on 7/25/23, 8:53 PM
Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?
by maxlin on 7/25/23, 10:22 PM
I might be missing something. The article asks me to run a bash script on windows.
I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?
I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
by nonethewiser on 7/26/23, 12:32 AM
Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
by nravic on 7/25/23, 11:56 PM
Self plug: run llama.cpp as an inference server on a spot instance anywhere: https://cedana.readthedocs.io/en/latest/examples.html#runnin...
by TheAceOfHearts on 7/25/23, 10:53 PM
How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.
[0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
by prohobo on 7/27/23, 7:36 AM
The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!
by sva_ on 7/25/23, 10:31 PM
If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.
0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...
by handelaar on 7/25/23, 9:12 PM
Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...
... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
by alvincodes on 7/26/23, 12:10 AM
I appreciate their honesty when it's in their interest that people use their API rather than run it locally.
by nomand on 7/25/23, 8:31 PM
Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
by synaesthesisx on 7/25/23, 10:15 PM
This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
by boffinAudio on 7/26/23, 2:10 PM
I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?

by RicoElectrico on 7/25/23, 10:45 PM

    curl -L "https://replicate.fyi/windows-install-llama-cpp"

... returns 404 Not Found

by theLiminator on 7/26/23, 12:31 AM
Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?
by amelius on 7/26/23, 10:38 AM
As someone with too little spare time I'm curious, what are people using this for, except research?
by technological on 7/26/23, 5:35 AM
Did anyone build pc for running these models and which one do you recommend
by TastyAmphibian on 7/27/23, 1:38 PM
I'm still curious to know the hype behind Llama 2
by politelemon on 7/25/23, 10:33 PM
Llama.cpp can run on Android too.