from Hacker News

Gemma3 – The current strongest model that fits on a single GPU

by brylie on 3/12/25, 7:48 AM with 138 comments

  • by archerx on 3/12/25, 8:33 AM

    I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.

    Models that are worth writing home about are;

    EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.

    Rocinante-12B-v2i - Fun for stories and D&D

    Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks

    OpenThinker-7B - Good and fast reasoning

    The Deepseek destills - Able to handle more complex task while still being fast

    DeepHermes-3-Llama-3-8B - A really good vLLM

    Medical-Llama3-v2 - Very interesting but be careful

    Plus more but not Gemma.

  • by danielhanchen on 3/12/25, 11:36 AM

    I wrote a mini guide on running Gemma 3 at https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...!

    The recommended settings according to the Gemma team are:

    temperature = 0.95

    top_p = 0.95

    top_k = 64

    Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

  • by swores on 3/12/25, 8:32 AM

    See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.

    https://news.ycombinator.com/item?id=43340491

  • by iamgopal on 3/12/25, 10:43 AM

    Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.
  • by antirez on 3/12/25, 9:23 AM

    After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.
  • by smcleod on 3/12/25, 11:09 AM

    No mention of how well it's claimed to perform with tool calling?

    The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.

  • by mythz on 3/12/25, 8:29 AM

    Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.

    Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.

    From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.

  • by elif on 3/12/25, 10:08 AM

    Good job Google. It is kinda hilarious that 'open'AI seems to be the big player least likely to release any of their models.
  • by wtcactus on 3/12/25, 8:57 AM

    The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.

    I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.

  • by singularity2001 on 3/12/25, 12:39 PM

    How does it compare to OlympicCoder 7B [0] which allegedly beats Claude Sonnet 3.7 in the International Olympiad in Informatics [1] ?

    [0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...

    [1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...

  • by tarruda on 3/12/25, 11:49 AM

    My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.

    My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"

    It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:

    " Key improvements and explanations:

         Clearer Code Structure:  The code is now organized into a Tetris class, making it much more maintainable and readable.  This is essential for any non-trivial game.
    "

    Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.

    I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".

    Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:

    "The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"

    Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.

  • by sigmoid10 on 3/12/25, 8:26 AM

    These bar charts are getting more disingenuous every day. This one makes it seem like Gemma3 ranks as nr. 2 on the arena just behind the full DeepSeek R1. But they just cut out everything that ranks higher. In reality, R1 currently ranks as nr. 6 in terms of Elo. It's still impressive for such a small model to compete with much bigger models, but at this point you can't trust any publication by anyone who has any skin in model development.
  • by leumon on 3/12/25, 9:20 AM

    In my opinion qwq is the strongest model that fits on a single gpu (Rtx 3090 for example, in Q4_K_M quantization which is the standard in Ollama)
  • by aravindputrevu on 3/12/25, 10:36 AM

    I'm curious. Is there any value to do these OSS models?

    Suddenly after reasoning models, it looks like OSS models have lost their charm

  • by chaosprint on 3/12/25, 11:10 AM

    How does this compare with qwq 32B?
  • by wewewedxfgdf on 3/12/25, 11:06 AM

    Discrete GPUs are finished for AI.

    They've had years to provide the needed memory but can't/won't.

    The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.

    Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.

  • by tekichan on 3/12/25, 10:32 AM

    I found deepseek better for trivial tasks
  • by casey2 on 3/12/25, 11:53 AM

    coalma3
  • by axiosgunnar on 3/12/25, 11:18 AM

    PSA: DO NOT USE OLLAMA FOR TESTING.

    Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).

    The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).

    Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!

  • by tarruda on 3/12/25, 9:31 AM

    Is "OpenAI" the only AI company that hasn't released any model weights?