by bfirsh on 7/25/23, 4:58 PM with 170 comments
by shortrounddev2 on 7/26/23, 12:46 AM
Before steps:
1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads
2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...
In Windows Terminal with Powershell:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
cd bin/Release
mkdir models
mv Folder\Where\You\Downloaded\The\Model .\models
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
`-DLLAMA_CUBLAS` uses cuda`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal
Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:
function llama {
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
}
adjust your paths as necessary. It has a tendency to talk to itself.by jawerty on 7/26/23, 12:20 AM
In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.
Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI
by andreyk on 7/25/23, 7:05 PM
Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.
by krychu on 7/25/23, 10:48 PM
https://github.com/krychu/llama
It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.
by thisisit on 7/25/23, 9:02 PM
by rootusrootus on 7/25/23, 7:08 PM
by Der_Einzige on 7/25/23, 8:36 PM
by aledalgrande on 7/26/23, 1:21 AM
by guy98238710 on 7/25/23, 9:35 PM
Seriously? Pipe script from someone's website directly to bash?
by jossclimb on 7/26/23, 6:28 AM
https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...
by ericHosick on 7/26/23, 3:21 AM
I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.
by oaththrowaway on 7/25/23, 8:53 PM
by maxlin on 7/25/23, 10:22 PM
I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?
I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
by nonethewiser on 7/26/23, 12:32 AM
by nravic on 7/25/23, 11:56 PM
by TheAceOfHearts on 7/25/23, 10:53 PM
by prohobo on 7/27/23, 7:36 AM
by sva_ on 7/25/23, 10:31 PM
0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...
by handelaar on 7/25/23, 9:12 PM
... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
by alvincodes on 7/26/23, 12:10 AM
by nomand on 7/25/23, 8:31 PM
by synaesthesisx on 7/25/23, 10:15 PM
by boffinAudio on 7/26/23, 2:10 PM
by RicoElectrico on 7/25/23, 10:45 PM
curl -L "https://replicate.fyi/windows-install-llama-cpp"
... returns 404 Not Foundby theLiminator on 7/26/23, 12:31 AM
by amelius on 7/26/23, 10:38 AM
by technological on 7/26/23, 5:35 AM
by TastyAmphibian on 7/27/23, 1:38 PM
by politelemon on 7/25/23, 10:33 PM