by andrewon on 8/6/23, 11:40 PM with 82 comments
by visarga on 8/7/23, 6:28 AM
The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago - tons of people innovating and debating approaches, lots of projects popping up, so much energy!
by kordlessagain on 8/7/23, 1:38 PM
To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):
sudo apt-get update -y
sudo apt-get install build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
sudo apt-get install python3-pip -y
sudo pip install --upgrade huggingface_hub
# skip using token as git credential
huggingface-cli login (for Meta model access paste token from HF[2])
sudo pip install vllm # ~8 minutes
Then, edit the test code for a 7b Llama 2 model (paste into llama.py): from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("The capital of Brazil is called")
print(output)
Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens
by jurmous on 8/7/23, 7:14 AM
See table 10 (page 22) of the whitepaper for the numbers: https://ai.meta.com/research/publications/llama-2-open-found...
Are there other downloadable models which can be used in a multilingual environment that people here are aware of?
by jmorgan on 8/7/23, 11:15 AM
More projects in this space:
- llama.cpp which is a fast, low level runner (with bindings in several languages)
- llm by Simon Willison which supports different backends and has a really elegant CLI interface
- The MLC.ai and Apache TVM projects
Previous discussion on HN that might be helpful from an article by the great folks at replicate: https://news.ycombinator.com/item?id=36865495
by simonw on 8/7/23, 6:12 AM
by jawerty on 8/7/23, 2:06 PM
Here’s the stream - https://www.youtube.com/live/LitybCiLhSc?feature=share
One is with LoRa and the other QLoRa I also do a breakdown on each fine-tuning method. I wanted to make these since I myself have had issues running LLMs locally and Colab is the cheapest GPU I can find haha.
by SOLAR_FIELDS on 8/7/23, 12:58 PM
Unfortunately with all of the hype it seems that unless you have a REALLY beefy machine the better 70B model feels out of reach for most to run locally leaving the 7B and 13B as the only viable options outside of some quantization trickery. Or am I wrong in that?
I want to focus more on larger context windows since it seems like RAG has a lot of promise so it seems like the 7B with giant context window is the best path to explore rather than focusing on getting the 70B to work locally
by carom on 8/7/23, 7:38 AM
Running on a 3090. The 13b chat model quantized to fp8 is giving about 42 tok/s.
by growt on 8/7/23, 8:10 AM
by ktaube on 8/7/23, 7:28 AM
I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.
by MediumOwl on 8/7/23, 11:39 AM
by brucethemoose2 on 8/7/23, 6:04 AM
...But I am also a bit out of the loop. For instance, I have not kept up with the CFG/negative prompt or grammar implementations in the UIs.
by gorenb on 8/7/23, 7:51 AM
by Manidos on 8/7/23, 11:46 AM
by bigcloud1299 on 8/8/23, 3:24 AM
by KaoruAoiShiho on 8/7/23, 2:31 PM