Comments Page - Llama 405B 506 tokens/second on an H200

EgoIncarnate a year ago
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
- FanaHOVA a year ago
  Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.
7e a year ago
And this is why nobody submits MLPerf against NVIDIA.
- greenknight a year ago
  Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...
  From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)
moondistance a year ago
Significant further optimizations. FP8!