from Hacker News

Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning

by codelion on 5/28/25, 2:39 AM with 68 comments

I built AutoThink, a technique that makes local LLMs reason more efficiently by adaptively allocating computational resources based on query complexity.

The core idea: instead of giving every query the same "thinking time," classify queries as HIGH or LOW complexity and allocate thinking tokens accordingly. Complex reasoning gets 70-90% of tokens, simple queries get 20-40%.

I also implemented steering vectors derived from Pivotal Token Search (originally from Microsoft's Phi-4 paper) that guide the model's reasoning patterns during generation. These vectors encourage behaviors like numerical accuracy, self-correction, and thorough exploration.

Results on DeepSeek-R1-Distill-Qwen-1.5B:

- GPQA-Diamond: 31.06% vs 21.72% baseline (+43% relative improvement)

- MMLU-Pro: 26.38% vs 25.58% baseline

- Uses fewer tokens than baseline approaches

Works with any local reasoning model - DeepSeek, Qwen, custom fine-tuned models. No API dependencies.

The technique builds on two things I developed: an adaptive classification framework that can learn new complexity categories without retraining, and an open source implementation of Pivotal Token Search.

Technical paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327

Code and examples: https://github.com/codelion/optillm/tree/main/optillm/autoth...

PTS implementation: https://github.com/codelion/pts

I'm curious about your thoughts on adaptive resource allocation for AI reasoning. Have you tried similar approaches with your local models?

by codelion on 5/28/25, 2:40 AM
The motivation for AutoThink came from watching how current reasoning models waste computation - they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs. This seemed obviously inefficient.
The breakthrough was combining two techniques I'd been working on separately: adaptive classification (which can learn new categories without retraining) and an open source implementation of Pivotal Token Search from Microsoft's Phi-4 paper. When I put them together with dynamic token budgeting, the performance gains were much better than expected.
What surprised me most was that the technique actually uses fewer tokens on average while improving performance. The adaptive allocation means simple queries finish faster, offsetting the extra computation on complex ones.
A few technical notes:
- The steering vectors are small (typically <1MB per pattern) and add minimal memory overhead
- Classification adds about 10ms latency, which is negligible
- Target layer selection matters - I found middle layers (15-20) work best for most models
I'd love feedback on:
- Have you tried similar adaptive approaches with your models?
- What other reasoning patterns would be useful to steer toward?
- Ideas for automatically detecting the optimal target layer?
Thanks for checking it out! Happy to answer any questions about the implementation or results.
by bufferoverflow on 5/28/25, 6:12 AM
But how do you classify a question as high vs low complexity? Some seemingly simple questions can turn out to be very very complex. For example, integer solution to
```
    x³ + y³ + z³ = 42 
```
took over a hundred years of compute time to find.
Or another seemingly simple equation with positive integers x,y,z
```
    x/(y+z)+y/(z+x)+z/(x+y) = 4
```
requires elliptic curve knowledge, and the solution is huge
```
    x = 154476802108746166441951315019919837485664325669565431700026634898253202035277999

    y = 36875131794129999827197811565225474825492979968971970996283137471637224634055579

    z = 4373612677928697257861252602371390152816537558161613618621437993378423467772036
```
(Solution is discussed here: https://www.quora.com/How-do-you-find-the-positive-integer-s...)
by NiloCK on 5/28/25, 12:53 PM
I, too, built a POC autothink shortly after the Claude 3.7 release that included the `extended thinking` toggle. It's literally also called autothink:
https://github.com/NiloCK/autothink
https://www.paritybits.me/think-toggles-are-dumb/
My own version took a first pass with an LLM whose job was to assign a 0-100 complexity rating, and then there was more or less a linear scaling of the allocated thinking budget.
The OP effort here is obviously higher grade, and I'm really tickled to see quantitative results. Well done.
by nssnsjsjsjs on 5/28/25, 3:22 AM
This is an obvious optimisation. Surprised this isn't been done already. Good job writing it up and showing how it can be done.
by mentalgear on 5/28/25, 9:43 AM
It's great how small models help small teams and individual researchers everywhere now compete with big AI labs by allowing them to demonstrate new innovative approaches on small experiments.
Also, as small language models (SML) become more competent, it's amazing what they can do on-device !
by CMay on 5/28/25, 3:52 PM
In terms of reasoning models like QwQ or Qwen 3 I didn't waste too much time trying to improve their results aside from coming up with various ways to constrain their reasoning token output with prompts.
Even though Gemma 3 27B QAT is not a reasoning model, it's so good at instruction following and being used in LLM chains/routes that it can be used for classifying/language optimization steps before instructing it how to reason about the prompt in the next step. You can even have it output intermediate answers interspersed between multiple think tags in the same response. In many ways for these models I just define thinking as any tokens that are helping the model arrive at the conclusion, but are not fully formed parts of the answer.
Instructing it to use certain words (tokens) and types of phrasing preferentially is something that is known to improve results in general, not just in LLMs and I've seen improved results by encouraging certain types of language to be used. AutoThink using the highest performing tokens out of a dataset _could_ be a nice way to optimize towards that in a more general way.
It seems like there's a risk of using so many pivotal tokens that it almost overfits responses to benchmark questions, though. So, while I have personally seen careful word/token selection improve result quality and also see it as a potential low cost high return optimization, I'd still want to see how AutoThink generalizes.
by vintermann on 5/28/25, 6:46 AM
If host models for others, then sure, I'm happy to save some computation time for really simple queries. Sure the cost is that the model will be effectively dismissive of questions it judges to be "easy", but I'm not the one carrying that cost I suppose.
However, for a local model, answering my own queries? That's the last thing I want. I already spent way too much money on that GPU, might as well get use out of it.
by GENIXUS on 5/28/25, 3:33 PM
I’m very new to the world of LLMs and AI, but this project really caught my attention.
From what I understood, AutoThink helps the AI “think more wisely” by adjusting how much effort it spends based on how hard the question is. That makes a lot of intuitive sense — like how people don’t spend 10 minutes figuring out what 2+2 is, but do take time with tricky problems.
Even though I don’t know the technical parts (like token budgeting or steering vectors), it’s fascinating to see how these methods can make the AI both faster and smarter at the same time.
Thanks for sharing — I’m definitely going to follow this kind of work more closely from now on.
by shah_akshat on 5/28/25, 4:13 AM
Surprised this didn't exist. Great work @codelion
by SamScout on 5/28/25, 7:52 PM
Great food for thought! We will discuss this approach as we find our evolving AI-crawler should ideally be able to recognize when a site we visit needs more vs. less queries.
For context, we're samaritanscout.org a search engine that is attempting to provide a comprehensive view into all local volunteering opportunities posted on a range of nonprofit websites.
by casenmgreen on 5/28/25, 9:39 AM
It seems to me inadvisable to say "think" and "reason", because those words have particular meanings, and those particular meanings are not in use by LLMs.
They are a computing method, where we can choose to use more or less run time (and so processor time), to generate results.
by lostmsu on 5/28/25, 12:37 PM
Official https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-... says GPQA-Diamond is 33.8
by Dowwie on 5/28/25, 10:19 AM
Hey, this is really interesting. What are the features you used to measure the reasoning complexity? In other words, how does one evaluate a query during classification?
by shwouchk on 5/28/25, 7:34 AM
Very interesting, thanks for sharing!
FWIW gemini explicitly told me that it ranks question difficulty from 1 to 100 and depending on the bin allocates more or less resources to answering it
by transfire on 5/28/25, 3:45 AM
That’s awesome!
Now have it mark blocks of text on or off, so it can ignore irrelevant, or worse erroneous material — no need to include it in the context window.
by shirman on 6/1/25, 11:52 PM
Hi, it does not work with llama.cpp right?
by danielhanchen on 5/28/25, 4:22 AM
Super cool and the results look pretty solid as well! Will give it a try!
by keeganpoppen on 5/28/25, 5:03 AM
i have definitely observed a similar pattern in the Big Label Foundation Models… so, i’m glad to see it in this realm too <3
by MagicMoonlight on 5/28/25, 7:07 AM
You didn’t invent this. Models like o3 already do it, that’s why the amount of thinking time varies.