by abi on 5/3/24, 9:26 PM with 139 comments
I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.
I’ve been using it as a therapist and coach. And it’s wonderful knowing that my personal information never leaves my local computer.
Should work on Desktop with Chrome or Edge. Other browsers are adding WebGPU support as well - see the Github for details on how you can get it to work on other browsers.
Note: after you send the first message, the model will be downloaded to your browser cache. That can take a while depending on the model and your internet connection. But on subsequent page loads, the model should be loaded from the IndexedDB cache so it should be much faster.
The project is open source (Apache 2.0) on Github. If you like it, I’d love contributions, particularly around making the first load faster.
Github: https://github.com/abi/secret-llama Demo: https://secretllama.com
by NikhilVerma on 5/4/24, 12:32 PM
I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.
by dosinga on 5/4/24, 3:37 AM
What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!
by _akhe on 5/4/24, 4:56 PM
Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.
by low_tech_punk on 5/4/24, 2:37 PM
by joshstrange on 5/4/24, 12:20 AM
Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.
[0] https://cs.joshstrange.com/07kPLPPW
[1] https://cs.joshstrange.com/4sxvt1Mc
EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.
by manlobster on 5/4/24, 12:34 AM
by threatofrain on 5/4/24, 2:21 AM
by knowaveragejoe on 5/3/24, 11:27 PM
by manlobster on 5/4/24, 1:46 AM
by wg0 on 5/4/24, 1:40 PM
Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.
by nojvek on 5/4/24, 11:20 AM
This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.
LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.
My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.
by andrewfromx on 5/4/24, 12:33 PM
by mentos on 5/4/24, 12:57 AM
This has sparked a curiosity in me to play with more LLms locally, thank you!
by NayamAmarshe on 5/4/24, 8:26 AM
by raylad on 5/5/24, 12:43 AM
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...
Also on Mistral 7B again after supposedly full download:
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...
Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.
by r0fl on 5/4/24, 12:21 PM
by littlestymaar on 5/4/24, 6:00 AM
by geor9e on 5/4/24, 12:30 AM
by hpeter on 5/6/24, 1:15 PM
I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.
Or at least make it an extension
by simple10 on 5/4/24, 12:59 AM
Do you think it would be feasible to extend it to support web browsing?
I'd like to help if you could give some pointers on how to extend it.
When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.
[EDIT] The Llama 3 model was able to summarize web pages!
by indit on 5/4/24, 12:21 AM
by Its_Padar on 5/4/24, 9:52 AM
by koolala on 5/4/24, 4:02 AM
by Snoozus on 5/4/24, 5:15 AM
by 1f60c on 5/4/24, 7:53 AM
by gitinit on 5/5/24, 3:00 AM
by zerop on 5/4/24, 11:02 AM
by adontz on 5/4/24, 3:30 PM
by Dowwie on 5/4/24, 11:05 AM
by ngshiheng on 5/4/24, 12:19 AM
On a side note, i've been trying to do something similar too for similar reasons (privacy).
Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardware
But yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers
by Jackson_Fleck on 5/4/24, 1:16 PM
by Jackson_Fleck on 5/4/24, 12:38 PM
by Bradd3rs on 5/4/24, 8:13 AM
by BargirPezza on 5/6/24, 3:26 PM
It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"
I write: "The song names should only be two to three words. Can you read what we have written before?"
Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.
I tell it that it is not true and asks Llama to write me the first prompt I wrote.
Llama answers: "You asked me to help you with a math problem."
Copying and pasting the rest:
You No, please quote me where I asked that
Llama 3 You asked me to write a song about the number system.
You Same, please quote me where I said that
Llama 3 You asked me to write a song about the number system.
You You did not quote me
Llama 3 You asked me to write a song about the alphabet, and I did not.
You Okay.. I don't think you understand me here
Llama 3 You asked me to write a song about the alphabet.
You Can you read what we have written?
Llama 3 You asked me to help you with a math problem.
You Yes or no?
Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.
You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!
Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?