from Hacker News

Show HN: I built a free in-browser Llama 3 chatbot powered by WebGPU

by abi on 5/3/24, 9:26 PM with 139 comments

I spent the last few days building out a nicer ChatGPT-like interface to use Mistral 7B and Llama 3 fully within a browser (no deps and installs).

I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.

I’ve been using it as a therapist and coach. And it’s wonderful knowing that my personal information never leaves my local computer.

Should work on Desktop with Chrome or Edge. Other browsers are adding WebGPU support as well - see the Github for details on how you can get it to work on other browsers.

Note: after you send the first message, the model will be downloaded to your browser cache. That can take a while depending on the model and your internet connection. But on subsequent page loads, the model should be loaded from the IndexedDB cache so it should be much faster.

The project is open source (Apache 2.0) on Github. If you like it, I’d love contributions, particularly around making the first load faster.

Github: https://github.com/abi/secret-llama Demo: https://secretllama.com

by NikhilVerma on 5/4/24, 12:32 PM
This is absolutely wonderful, I am a HUGE fan of local first apps. Running models locally is such a powerful thing I wish more companies could leverage it to build smarter apps which can run offline.
I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.
by dosinga on 5/4/24, 3:37 AM
Phi 1.5 isn't quite there is it? Caught it having some inner dialog.
What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!
by _akhe on 5/4/24, 4:56 PM
Amazing work, feels like a step forward for LLM usability.
Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.
by low_tech_punk on 5/4/24, 2:37 PM
It's a wrapper of https://github.com/mlc-ai/web-llm
by joshstrange on 5/4/24, 12:20 AM
Very cool! I wish there was chat history.
Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.
[0] https://cs.joshstrange.com/07kPLPPW
[1] https://cs.joshstrange.com/4sxvt1Mc
EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.
by manlobster on 5/4/24, 12:34 AM
It's truly amazing how quickly my browser loads 0.6GB of data. I remember when downloading a 1MB file involved phoning up a sysop in advance and leaving the modem on all night. We've come so far.
by threatofrain on 5/4/24, 2:21 AM
IMO eventually users should be able to advertise what embedding models they have so we don't redundantly redownload.
by knowaveragejoe on 5/3/24, 11:27 PM
Is this downloading a ~5gb model to my machine and storing it locally for subsequent use?
by manlobster on 5/4/24, 1:46 AM
Looks like all the heavy lifting is being done by webllm [0]. What we have here is basically one of the demos from that.
[0] https://webllm.mlc.ai/.
by wg0 on 5/4/24, 1:40 PM
How do people use something like this as coach or therapist? This is genuine question.
Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.
by nojvek on 5/4/24, 11:20 AM
Yasssssss! Thank you.
This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.
LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.
My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.
by andrewfromx on 5/4/24, 12:33 PM
i asked it "what happens if you are bit by a radio active spider?" and it told me all about radiation poisoning. Then I asked a follow up question: "would you become spiderman?" and it told me it was unable to become anything but an AI assistant. I also asked if time machines are real and how to build one. It said yes and told me! (Duh, you use a flux capacitor, basic physics.)
by mentos on 5/4/24, 12:57 AM
This is awesome. I have been using ChatGPT4 for almost a year and haven't really experimented with locally running LLMs because I assumed that the processing time would take too long per token. This demo has shown me that my RTX 2080 running Llama 3 can compete with ChatGPT4 for a lot of my prompts.
This has sparked a curiosity in me to play with more LLms locally, thank you!
by NayamAmarshe on 5/4/24, 8:26 AM
This is amazing! I always wanted something like this, thank you so much!
by raylad on 5/5/24, 12:43 AM
After the model is supposedly fully downloaded (about 4GB) I get:
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...
Also on Mistral 7B again after supposedly full download:
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...
Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.
by r0fl on 5/4/24, 12:21 PM
Could not load the model because Error: Cannot find WebGPU in the environment
by littlestymaar on 5/4/24, 6:00 AM
This is very cool, it's something I wish existed since Llama came out, having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser. Llamafile solves the first half of the problem, but you still need to install Cuda/ROCm for it to work with GPU acceleration. WebGPU is the way to go if we want to put AI on consumer hardware and break the oligopoly, I just wished it became more broadly available (on Linux, no browser supports it yet)
by geor9e on 5/4/24, 12:30 AM
I'm just seeing ERR_SSL_VERSION_OR_CIPHER_MISMATCH at https://secretllama.com/ and at http://secretllama.com/ I see "secretllama.com has been registered at Porkbun but the owner has not put up a site yet. Visit again soon to see what amazing website they decide to build."
by hpeter on 5/6/24, 1:15 PM
It's great but I hope it don't catch on because then every website will make me download models. My hard drive will be full, too much bloat. I think the web is not good for this.
I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.
Or at least make it an extension
by simple10 on 5/4/24, 12:59 AM
Amazing! It's surprisingly fast to load and run given the size of the downloaded models.
Do you think it would be feasible to extend it to support web browsing?
I'd like to help if you could give some pointers on how to extend it.
When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.
[EDIT] The Llama 3 model was able to summarize web pages!
by indit on 5/4/24, 12:21 AM
Could we use an already downloaded .gguf file?
by Its_Padar on 5/4/24, 9:52 AM
Very interesting! I would be quite interested to see this implemented as some sort of API for browser chatbots or possibly even local AI powered web games? If you don't know what Ollama is I suggest checking it out. Also I think adding the phi3 model to this would be a good idea.
by koolala on 5/4/24, 4:02 AM
On Firefox Nightly on my Steam Deck it "cannot find WebGPU in the environment".
by Snoozus on 5/4/24, 5:15 AM
Tried this in Chrome under Windows, it does work but does not seem to use the RTX4060, only the integrated Iris Xe. Is this a bug or intentional?
by 1f60c on 5/4/24, 7:53 AM
It's sadly stuck on "Loading model from cache[24/24]: 0MB loaded. 0% completed, 0 secs elapsed." on my iPhone 13 Pro Max :(
by gitinit on 5/5/24, 3:00 AM
This works great on my Pixel 6a, surprisingly.
by zerop on 5/4/24, 11:02 AM
Question - Do I compromise on quality on answers if I use models using WebLLM (like this) compare to using them on system console.
by adontz on 5/4/24, 3:30 PM
If anyone knows, is this about the best model one can run locally on an old consumer grade GPU (GXT 1080 in my case)?
by Dowwie on 5/4/24, 11:05 AM
What therapy prompts have you found useful?
by ngshiheng on 5/4/24, 12:19 AM
Nice demo! I briefly tried it out and the demo felt much better than the original WebLLM one!
On a side note, i've been trying to do something similar too for similar reasons (privacy).
Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardware
But yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers
by Jackson_Fleck on 5/4/24, 1:16 PM
...I think it would be a great idea to graft on a LlamaIndex module here so we can use this local browser LLM to talk to our local documentation https://docs.llamaindex.ai/en/stable/
by Jackson_Fleck on 5/4/24, 12:38 PM
This is amazing but can we please set the .prose width to be dynamic? the text column in 3 inches wide on my monitor, it should take up a % of the browser window.
by Bradd3rs on 5/4/24, 8:13 AM
pretty cool, nice work!
by BargirPezza on 5/6/24, 3:26 PM
Llama 3 seems to not be there yet at all. Can it read earlier prompts and what you have written before? It doesn't seem like it. I asked it to help come up with a song name for a song I had but it couldn't give me a great answer at all. Prompt: "Hi! Could you help me come up with a name for a song? The genre of the song is Techno, Punk and Jazz. The name should be two to three words. And each song name should be different from the other one. Please write me 4 to 6 names. Thank you!" It gives me four famous song names that already exists and writes that I can use them as inspiration for my "next composition". I tell it that it should give me song names and not song ideas. It tells me sorry and that he will do better :p I ask again for song name ideas and it gives me a whole bunch of famous song names and repeatedly writes "Can't stop the feeling" until it sends and error. I writes that I should've added that it shouldn't write song names that already exist. It says sorry again and I ask for new song names.
It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"
I write: "The song names should only be two to three words. Can you read what we have written before?"
Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.
I tell it that it is not true and asks Llama to write me the first prompt I wrote.
Llama answers: "You asked me to help you with a math problem."
Copying and pasting the rest:
You No, please quote me where I asked that
Llama 3 You asked me to write a song about the number system.
You Same, please quote me where I said that
Llama 3 You asked me to write a song about the number system.
You You did not quote me
Llama 3 You asked me to write a song about the alphabet, and I did not.
You Okay.. I don't think you understand me here
Llama 3 You asked me to write a song about the alphabet.
You Can you read what we have written?
Llama 3 You asked me to help you with a math problem.
You Yes or no?
Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.
You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!
Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?