from Hacker News

Every Flop Counts: Scaling a 300B LLM Without Premium GPUs

by bretpiatt on 3/24/25, 12:48 PM with 9 comments

  • by flowerthoughts on 3/28/25, 6:43 AM

    They never mention what hardware they're on.

    Table 1 is the closest thing. Device specs for six devices: 120-989 TFLOPS and 64-96 GB RAM.

    An RTX 5090 is about 105 TFLOPS.

    https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216

  • by rahen on 3/28/25, 10:19 AM

    I'm pretty surprised by the claimed memory usage for 300B parameters (table 1). If we compare similar models:

    - Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)

    - DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)

    Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?

  • by vednig on 3/30/25, 6:17 AM

    They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.
  • by osti on 3/28/25, 3:26 AM

    I think this is the one where they train LLM without NVIDIA GPU's.