from Hacker News

Llama 405B 506 tokens/second on an H200

by moondistance on 10/14/24, 1:21 AM with 5 comments

  • by EgoIncarnate on 10/14/24, 2:58 AM

    not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
  • by 7e on 10/14/24, 2:49 AM

    And this is why nobody submits MLPerf against NVIDIA.
  • by moondistance on 10/14/24, 1:21 AM

    Significant further optimizations. FP8!