from Hacker News

Run Llama 13B with a 6GB graphics card

by rain1 on 5/14/23, 12:35 PM with 266 comments

by rahimnathwani on 5/14/23, 5:31 PM

On my system, using `-ngl 22` (running 22 layers on the GPU) cuts wall clock time by ~60%.

My system:

GPU: NVidia RTX 2070S (8GB VRAM)

CPU: AMD Ryzen 5 3600 (16GB VRAM)

Here's the performance difference I see:

CPU only (./main -t 12)

  llama_print_timings:        load time = 15459.43 ms
  llama_print_timings:      sample time =    23.64 ms /    38 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  9338.10 ms /   356 tokens (   26.23 ms per token)
  llama_print_timings:        eval time = 31700.73 ms /    37 runs   (  856.78 ms per token)
  llama_print_timings:       total time = 47192.68 ms

GPU (./main -t 12 -ngl 22)

  llama_print_timings:        load time = 10285.15 ms
  llama_print_timings:      sample time =    21.60 ms /    35 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  3889.65 ms /   356 tokens (   10.93 ms per token)
  llama_print_timings:        eval time =  8126.90 ms /    34 runs   (  239.03 ms per token)
  llama_print_timings:       total time = 18441.22 ms

by naillo on 5/14/23, 1:31 PM
This is cool but are people actually getting stuff done with these models? I'm enthusiastic about their potential too but after playing with it for a day I'm at a loss for what to use it for anymore at this point
by holoduke on 5/14/23, 5:20 PM
Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price. These models require lots of memory to 'single' pass an operation. Throughput could be bit slower. A 1080 Nvidia with 256gb of memory would run all these models fast right? Or am I forgetting something here.
by peatmoss on 5/14/23, 2:09 PM
From skimming, it looks like this approach requires CUDA and thus is Nvidia only.
Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?
by marcopicentini on 5/14/23, 2:08 PM
What do you use to host these models (like Vicuna, Dolly etc) on your own server and expose them using HTTP REST API? Is there an Heroku-like for LLM models?
I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.
by syntaxing on 5/14/23, 5:14 PM
This update is pretty exciting, I’m gonna try running a large model (65B) with a 3090. I have ran a ton of local LLM but the hardest part is finding out the prompt structure. I wish there is some sort of centralized data base that explains it.
by tikkun on 5/14/23, 1:59 PM
See also:
https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy...
https://chat.lmsys.org/?arena (Click 'leaderboard')
by Ambix on 5/15/23, 10:34 AM
No need to convert models, 4bit LLaMA versions for GGML v2 available here:
https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main
by mozillas on 5/14/23, 2:31 PM
I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.
Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.
by yawnxyz on 5/15/23, 3:43 AM
Could someone please share a good resource for building a machine from scratch, for doing simple-ish training and running open-source models like Llama? I'd love to run some of these and even train them from scratch, and I'd love to use that as an excuse to drop $5k on a new machine...
Would love to run a bunch of models on the machine without dripping $$ to OpenAI, Modal or other providers...
by rahimnathwani on 5/15/23, 5:16 AM
PSA:
If you're using oobabooga/text-generation-webui then you need to:
1. Re-install llama-cpp-python with support for CUBLAS:
```
  CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall
```
2. Launch the web UI with the --n-gpu-layers flag, e.g.
```
  python server.py --model gpt4-x-vicuna-13B.ggml.q5_1.bin --n-gpu-layers 24
```
by sroussey on 5/14/23, 9:21 PM
I wish this used the webgpu c++ library instead, then it could be used in any GPU hardware.
by hhh on 5/14/23, 1:44 PM
Instructions are a bit rough. The Micromamba thing doesn’t work, doesn’t say how to install it… you have to clone llama.cpp too
by tarr11 on 5/14/23, 4:07 PM
What is the state of the art on evaluating the accuracy of these models? Is there some equivalent to an “end to end test”?
It feels somewhat recursive since the input and output are natural language and so you would need another LLM to evaluate whether the model answered a prompt correctly.
by bitL on 5/14/23, 1:37 PM
How about reloading parts of the model as the inference progresses instead of splitting it into GPU/CPU parts? Reloading would be memory-limited to the largest intermediate tensor cut.
by akulbe on 5/15/23, 12:00 AM
I've only ever been a consumer of ChatGPT/Bard. Never set up any LLM stuff locally, but the idea is appealing to me.
I have a ThinkStation P620 w/ThreadRipper Pro 3945WX (12c24t) with a GTX 1070 (and a second 1070 I could put in there) and there's 512GB of RAM on the box.
Does this need to be bare metal, or can it run in VM?
I'm currently running RHEL 9.2 w/KVM (as a VM host) with light usage so far.
by qwertox on 5/14/23, 10:15 PM
If I really want to do some playing around in this area, would it be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a low-power card, which I want as it would be running 24/7 and energy prices are pretty bad in Germany, or would it make more sense to buy an Apple product with some M2 chip which apparently is good for these tasks as it shares CPU and GPU memory?
by ranger_danger on 5/14/23, 6:38 PM
Why can't these models run on the GPU while also using CPU RAM for the storage? That way people will performant-but-memory-starved GPUs can still utilize the better performance of the GPU calculation while also having enough RAM to store the model? I know it is possible to provide system RAM-backed GPU objects.
by anshumankmr on 5/14/23, 3:52 PM
How long before it runs on a 4 gig card?
by MuffinFlavored on 5/15/23, 12:27 AM
How many "B" (billions of parameters) is ChatGPT GPT-4?
by BlackLotus89 on 5/15/23, 10:28 AM
This only uses llama correct? So the output should be the same as if you were only using llama.cpp. Am I the only one who doesn't get nearly the same quality of output using a quantized model compared to GPU? Some models I tried get astounding results when running on a GPU, but create only "garbage" when running on a CPU. Even when not quantized down to 4bit llama.cpp just doesn't compare for me. Am I alone with this?
by dclowd9901 on 5/14/23, 4:49 PM
Has anyone tried running encryption algorithms through these models? I wonder if it could be trained to decrypt.
by dinobones on 5/14/23, 6:20 PM
What is HN’s fascination with these toy models that produce low quality, completely unusable output?
Is there a use case for them I’m missing?
Additionally, don’t they all have fairly restrictive licenses?
by blendergeek on 5/14/23, 11:19 PM
Is there a way to run any of these with only 4GB of VRAM?
by alg_fun on 5/14/23, 9:11 PM
wouldn't i be faster to use ram as a swap for vram?
by avereveard on 5/14/23, 3:15 PM
or like download oobabooga/text-generation-webui, any prequantized variant, and be done.
by s_dev on 5/14/23, 1:32 PM
[deleted]
by ACV001 on 5/14/23, 7:14 PM
The future is this - these models will be able to run on smaller and smaller hardware eventually being able to run on your phone, watch or embedded devices. The revolution is here and is inevitable. Similar to how computers evolved. We are still lucky that these models have no consciousness, still. Once they gain consciousness, that will mark the appearance of a new species (superior to us if anything). Also, luckily, they have no physical bodies and cannot replicate, so far...