by redman25 on 5/10/25, 3:39 AM with 104 comments
by dust42 on 5/10/25, 6:55 AM
25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.Steps to reproduce:
git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build
cmake --build build --config Release -j 12 --clean-first
# download model and mmproj files...
build/bin/llama-server \
--model gemma-3-4b-it-Q4_K_M.gguf \
--mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interfaceNote: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
by danielhanchen on 5/10/25, 5:10 AM
You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.
I made some quants with vision support - literally run:
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
by ngxson on 5/10/25, 6:51 AM
This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!
llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
by simonw on 5/10/25, 4:17 AM
by banana_giraffe on 5/10/25, 5:40 AM
Very nice for something that's self hosted.
by simonw on 5/10/25, 6:31 AM
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
unzip llama-b5332-bin-macos-arm64.zip
cd build/bin
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R) ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this: ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/by thenthenthen on 5/10/25, 10:06 AM
by nico on 5/10/25, 4:48 AM
Any benefit on a Mac with apple silicon? Any experiences someone could share?
by dr_kiszonka on 5/10/25, 9:47 AM
Use case: I am working on a hobby project that uses TS/React as frontend. I can use local or cloud LLMs in VSCode but even those with vision require that I take a screenshot and paste it to a chat. Ideally, I would want it all automated until some stop criterion is met (even if only n-iterations). But even an extension that would screenshot a preview and paste it to chat (triggered by a keyboard shortcut) would be a big time-saver.
by a_e_k on 5/10/25, 8:06 AM
by gryfft on 5/10/25, 4:20 AM
by threeme3 on 5/18/25, 4:02 AM
by yieldcrv on 5/10/25, 5:58 PM
They’re still doing text and math tests on every new model because it’s so bad
by behnamoh on 5/10/25, 5:07 AM
by jacooper on 5/10/25, 10:08 AM
by buyucu on 5/10/25, 6:30 AM
by mrs6969 on 5/10/25, 7:15 AM
just trying to understand, awesome work so far.
by bsaul on 5/10/25, 7:41 AM
by nurettin on 5/10/25, 6:23 AM
by nikolayasdf123 on 5/10/25, 8:22 AM
by babuloseo on 5/11/25, 4:03 AM
by gitroom on 5/10/25, 6:29 AM