from Hacker News

Run DeepSeek R1 Dynamic 1.58-bit

by noch on 1/28/25, 8:52 AM with 332 comments

by Jasondells on 1/28/25, 11:14 AM
An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
by apples_oranges on 1/28/25, 10:15 AM
Random observation 1: I was running DeepSeek yesterday on my Linux with a RTX 4090 and I noticed that the models should fit into VRAM, which is 24GB. Or they are simply slow. So the Apple shared memory architecture has an advantage here. A 192GB Mx Ultra can load and process large models efficiently.
Random observation 2: It's time to cancel the OpenAI subscription.
by mtrovo on 1/28/25, 11:09 AM
Wow, an 80% reduction in size for DeepSeek-R1 is just amazing! It's fantastic to see such large models becoming more accessible to those of us who don't have access to top-tier hardware. This kind of optimization opens up so many possibilities for experimenting at home.
I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.
by raghavbali on 1/28/25, 9:15 AM
> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.
This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
by brap on 1/28/25, 10:57 AM
As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?
It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.
by DogRunner on 1/28/25, 8:31 PM
>For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.
Oh nice! So I can try it in my local "low power/low cost" server at home.
My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb
So 64gb RAM plus 84gb VRAM
I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.
such system can be build for around 1600 euro. The power consumption is around 520 watt.
I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.
There every additional GPU is connected with the pcie riser/extender to give the cards enough space.
After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.
It runs pretty nice. Awesome to learn and gain experience
by cubefox on 1/28/25, 12:20 PM
For anyone wondering why "1.58" bits: 2^1.58496... = 3. The weights have one of the three states {-1, 0, 1}.
by tarruda on 1/28/25, 9:13 AM
Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).
For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.
by TheTaytay on 1/28/25, 1:08 PM
Danielhanchen, your work is continually impressive. Unsloth is great, and I’m repeatedly amazed at your ability to get up to speed on a new model within hours of its release, and often fix bugs in the default implementation. At this point, I think serious labs should give you a few hour head start just to iron out their kinks!
by afro88 on 1/28/25, 12:53 PM
The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.
by hendersoon on 1/28/25, 12:57 PM
The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.
by amusingimpala75 on 1/28/25, 12:45 PM
> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.
Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?
by miohtama on 1/28/25, 10:39 AM
Flappy Bird in Python is the new Turing test
by ThePhysicist on 1/28/25, 9:27 AM
In general, how do you run these big models on cloud hardware? Do you cut them up layer-wise and run slices of layers on individual A100/H100s?
by ggm on 1/28/25, 9:59 PM
If I invested in a 100x machine because I needed 100 of x to run, and somebody shows how 10x can work, why have I not just become the holder of 10 10x machines, and therefore have already achieved capex to exploit this new market?
I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.
by xiphias2 on 1/28/25, 5:04 PM
Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will be too slow to be usable.
I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.
by Pxtl on 1/28/25, 6:25 PM
Is there any good quick summary of what's special about DeepSeek? I know it's OSS and incredibly efficient, but news laymen are saying it's trained purely on AI info instead of using a corpus of tagged data... which, I assume, means it's somehow extracting weights or metadata or something from other AIs. Is that it?
by Dwedit on 1/28/25, 10:47 PM
Is this actually 1.58 bits? (Log base 2 of 3) I heard of another "1.58 bit" model that actually used 2 bits instead. "1.6 bit" is easy enough, you can pack five 3-state values into a byte by using values 0-242. Then unpacking is easy, you divide and modulo by 3 up to five times (or use a lookup table).
by danesparza on 1/28/25, 4:56 PM
Just ask it about Taiwan (not kidding). I'm not sure I can trust a model that has such a focused political agenda.
by MyFirstSass on 1/28/25, 1:54 PM
Is this akin to the quants already being done to various models when you download a GGUF at 4 bits for example, or is this variable layer compression something new that can also be make existing smaller models smaller so we can fit more into say 12 or 16 gb's of vram?
by beernet on 1/28/25, 2:59 PM
Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.
by slewis on 1/28/25, 4:53 PM
It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.
by patleeman on 1/28/25, 10:33 PM
Incredible work by the Unsloth brothers again. It’s really cool to see bitnet quantization implemented like this.
by CHB0403085482 on 1/28/25, 11:20 AM
DeepSeek R1 in a nutshell
youtube.com/watch?v=Nl7aCUsWykg
by upghost on 1/28/25, 9:28 AM
Thanks for the run instructions, unsloth. Deepseek is so new it's been breaking most of my builds.
by indigodaddy on 1/28/25, 4:02 PM
Is there any small DS or qwen model that could run on say an M4 Mac Mini Standard (16G) ?
by techwiz137 on 1/28/25, 1:07 PM
How can you have a bit and a half exactly? It doesn't make sense.
by mclau156 on 1/28/25, 3:48 PM
Is the new LLM benchmark to create flappy bird in pygame?
by CodeCompost on 1/28/25, 11:12 AM
Can I run this on ollama?
by homarp on 1/28/25, 9:50 AM
see also https://news.ycombinator.com/item?id=42846588
by petesergeant on 1/28/25, 11:28 AM
It is going to be truly fucking revolutionary if open-source models are and continue to be able to challenge the state of the art. My big philosophical concern is that AI locks Capital into an absolutely supreme and insurmountable lead over Labour, and into the hands of oligarchs, and the possibility of a future where that's not case feels amazing. It pleases me greatly that this has Trump riled up too, because I think it means he's much less likely to allow existing US model-makers to build moats, as I think he's -- even as a man who I don't think believes in very much -- absolutely unwilling to let the Chinese get the drop on him over this.
by sylware on 1/28/25, 11:16 AM
site is javascript walled
80%? On 2 H100 only? To get near chatgpt 4? Seriously? The 671B version??
by bluesounddirect on 1/28/25, 1:45 PM
Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .
I 100% expect some downvotes from the ccp.