from Hacker News

A Practical Guide to Running Local LLMs

by philk10 on 3/11/25, 1:41 PM with 18 comments

  • by simonw on 3/11/25, 3:29 PM

    This is a good guide. Ollama and Llamafile are two of my top choices for this.

    On macOS it's worth investigating the MLX ecosystem. The easiest way to do that right now is using LM Studio (free but proprietary), or you can run the MLX libraries directly in Python. I have a plugin for my LLM CLI tool that uses MLX here: https://simonwillison.net/2025/Feb/15/llm-mlx/

  • by filoeleven on 3/11/25, 4:22 PM

    I found Mozilla's Transformer Lab quite nice to use, for very small and dabbling values of "use" at least. It encapsulates the setup and interaction with local LLMs into an app, and that feels more comfortable to me than using from the CLI.

    Upon getting a model up and running though, I quickly realized that I really have no idea what to use it for.

    https://transformerlab.ai/

  • by tegiddrone on 3/11/25, 3:06 PM

    Is there a spreadsheet out there benchmarking local LLM and hardware configs? I want to know if I should even bother with my coffeelake xeon server or if it is something to consider for my next gaming rig.
  • by smjburton on 3/11/25, 4:44 PM

    > Let’s be clear. It’s going to be a long time before running a local LLM will produce the type of results that you can get from querying ChatGPT or Claude. (You would need an insanely powerful homelab to produce that kind of results).

    Anyone experimented with local LLMs and compared the output from ChatGPT or Claude? The article mentions that they use local LLMs when they're not overly concerned with the quality or response time, but what are some other limitations or differences to running these models locally?

  • by FloatArtifact on 3/11/25, 4:26 PM

    What we need is a platform for benchmarking hardware for AI models. With X hardware you can get X amount of tokens with X amount of latency For context token pre-filled. So, standard testing methodology per model with user-supplied benchmarks. Yes, I recognize there's going to be some variability based on different versions of the software stack and encoders.

    End user experience should start by selecting the models of interest to run and output hardware builds with price tracking for components.

  • by adultSwim on 3/11/25, 4:26 PM

    Ollama and llama.cpp are easy to get running. Does anyone have notes on performance of other systems that are harder to install, such as TensorRT-LLM?
  • by zoogeny on 3/11/25, 9:29 PM

    I'm sorry to use HN as Google or even ChatGPT, but are these systems just for LLMs?

    I'm wondering about multi-modal models, or generative models (like image diffusion models). For example, I was wondering about noise removal from audio files and how hard it would be to find open models that could be fine tuned for that purpose and how easy they would be to run locally.