from Hacker News

Llama 405B 506 tokens/second on an H200

by moondistance on 10/14/24, 1:21 AM with 5 comments

by EgoIncarnate on 10/14/24, 2:58 AM
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
by 7e on 10/14/24, 2:49 AM
And this is why nobody submits MLPerf against NVIDIA.
by moondistance on 10/14/24, 1:21 AM
Significant further optimizations. FP8!