Comments Page - xAI's 100k GPUs data center in Memphis is up and running

« Back xAI's 100k GPUs data center in Memphis is up and runningsemafor.comSubmitted by mfiguiere 2 days ago

_davide_ a day ago
I'm waiting for AI bubble to burst to buy some of those at the fraction of the price :)
melodyogonna 2 days ago
This is interesting, I thought the data center came online weeks ago
blackeyeblitzar 2 days ago
I assume there are others already running 100k GPU training clusters already, but this article claims xAI’s Colossus is the most powerful computer ever built, and that no other company has been able to connect this many GPUs together due to networking limitations. But haven’t there been other companies who’ve purchased more GPUs? Are they simply running them as smaller separate clusters?
- xemdetia a day ago
  Smaller clusters are common because it depends on what you are trying to accomplish. It's the same problem as any distributed system- going wider with a map only works if you can perform a timely reduce. With checkpointing you run into things like network storage being too slow and so on. There's also the fact you might want to work on/with many models and what you want to do for production (inference) is different than what you try to accomplish in development (training).
- michelb a day ago
  I think Meta’s cluster will be slightly larger than 100k next month, scaling to 350k by end of year. And double of that sometime next year. According to their info.
  turingfeel a day ago
  I haven’t seen any info on those being in a single cluster. They seem to just be referring to total GPUs. It’s currently almost impossible to create a single cluster of GPUs that size.
  michelb 2 hours ago
  Ah correct! It seemed weird to me they would have that figured out while others have not.