by helloericsf on 2/24/25, 1:37 AM with 108 comments
by refibrillator on 2/24/25, 7:40 AM
https://github.com/vllm-project/vllm/releases/tag/v0.7.1
MHA is still faster in low QPS regime apparently.
https://neuralmagic.com/blog/enhancing-deepseek-models-with-...
Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.
by helloericsf on 2/24/25, 1:38 AM
by FL33TW00D on 2/24/25, 10:32 AM
If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.
Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.
by ur-whale on 2/24/25, 11:18 AM
https://verticalserve.medium.com/group-query-attention-58283...
by eigenvalue on 2/24/25, 6:07 AM
by imranq on 2/24/25, 4:20 PM
by mohsen1 on 2/24/25, 3:25 AM
by rob_c on 2/24/25, 11:21 AM
(Showing my lack of breadth of knowledge in the ecosystem (s))
by behnamoh on 2/24/25, 3:25 AM
by mclau156 on 2/24/25, 2:40 PM
by syntex on 2/24/25, 4:31 PM
by rvz on 2/24/25, 4:12 AM
There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.
Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.
You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.
by m3kw9 on 2/24/25, 5:34 AM
by deyiao on 2/24/25, 3:00 AM