by mohsen1 on 2/25/25, 6:20 AM
> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.
by pama on 2/25/25, 3:40 AM
I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
by ofou on 2/25/25, 3:02 AM
You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing
by breadwinner on 2/25/25, 1:08 PM
Zuckerberg should stop claiming Meta is open sourcing AI (they are even running TV ads) when they are only releasing the weights, and not the code. Only DeepSeek is real OSS AI.
by helloericsf on 2/25/25, 2:27 AM
- Efficient and optimized all-to-all communication
- Both intranode and internode support with NVLink and RDMA
- High-throughput kernels for training and inference prefilling
- Low-latency kernels for inference decoding
- Native FP8 dispatch support
- Flexible GPU resource control for computation-communication overlapping
X:
https://x.com/deepseek_ai/status/1894211757604049133by ur-whale on 2/25/25, 12:51 PM
The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil) but the net result for everyone on the planet is simply fantastic.
So even in the worst case (doing this for the wrong reasons): thank you DeepSeek, you are actually doing what OpenAI lied through their teeth to the whole world about doing for years.
You rock.
by rvz on 2/25/25, 4:25 AM
Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.
Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".
Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.
The objective is to not be squeezed in the middle of the race to zero.
by yieldcrv on 2/25/25, 12:25 PM
so while the US is chasing GPU receipts in Singapore just to ensure DeepSeek was using H800s only, the rest of the world can run these optimizations on the full H100s?
while we also pretend that H100s were difficult to get or access because of the US sanctions and their hubris to believe their edicts blanket the globe?
am I understanding this correctly?
by deyiao on 2/25/25, 4:01 AM
Is the PTX that everyone was looking forward to included this time?
by wbsun on 2/27/25, 5:03 AM
This feels like the 80s/90s when people hacking assembly or finding undocumented instructions to squeeze CPU for performance. Until one day either the compiler will be highly optimized enough or the GPU will be so powerful that such tricks won’t make much difference anymore, like CPUs nowadays :D
by Bimos on 2/25/25, 2:44 AM
The PTX instructions they talked about in the tech report should be pointing to the code here?
by kennyloginz on 2/25/25, 7:56 AM
Spring showers bring may flowers!
by deyiao on 2/25/25, 5:41 AM
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...