from Hacker News

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

by helloericsf on 2/24/25, 1:37 AM with 108 comments

  • by refibrillator on 2/24/25, 7:40 AM

    vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity.

    https://github.com/vllm-project/vllm/releases/tag/v0.7.1

    MHA is still faster in low QPS regime apparently.

    https://neuralmagic.com/blog/enhancing-deepseek-models-with-...

    Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.

    https://arxiv.org/pdf/2502.07864

  • by helloericsf on 2/24/25, 1:38 AM

    X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800
  • by FL33TW00D on 2/24/25, 10:32 AM

    It seems to me that MLA will become the standard from here on out.

    If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.

    Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.

  • by ur-whale on 2/24/25, 11:18 AM

    For those who wonder ... it's somewhat likely that MLA mean Multi-head latent attention

    https://verticalserve.medium.com/group-query-attention-58283...

    https://paperswithcode.com/method/multi-head-attention

  • by eigenvalue on 2/24/25, 6:07 AM

    Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.
  • by imranq on 2/24/25, 4:20 PM

    Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler
  • by mohsen1 on 2/24/25, 3:25 AM

    I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!
  • by rob_c on 2/24/25, 11:21 AM

    Great work any plans to integrate with pyT or TF I wonder?

    (Showing my lack of breadth of knowledge in the ecosystem (s))

  • by behnamoh on 2/24/25, 3:25 AM

    Open AI is back!
  • by mclau156 on 2/24/25, 2:40 PM

    Was really hoping we could get flash games back with AI
  • by syntex on 2/24/25, 4:31 PM

    What i can do with that?
  • by rvz on 2/24/25, 4:12 AM

    This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.

    There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.

    Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

    You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

  • by m3kw9 on 2/24/25, 5:34 AM

    MHGA making hopper great again
  • by deyiao on 2/24/25, 3:00 AM

    I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp