from Hacker News

Ask HN: How to increase LLM inference speed?

by InkCanon on 6/15/25, 10:08 AM with 1 comments

Hi HN,

I'm building software that has a very tight feedback loop with the user. One part involves a short (few hundred tokens) response from an LLM. By far this is the biggest UX problem - currently DeepSeek's total time taken can reach 10 seconds, which is horrific. Would it be possible to practically reduce the speed to maybe ~2 seconds? The LLM just asks to rephrase (while preserving meaning) of a short text, so it does not need to be SOTA. On the whole faster inference time is much more important.

by cranberryturkey on 6/15/25, 10:14 AM
you need a faster GPU but that only works for self hosted LLMs (ie: ollama/huggingface)