I'm waiting for AI bubble to burst to buy some of those at the fraction of the price :)
This is interesting, I thought the data center came online weeks ago
I assume there are others already running 100k GPU training clusters already, but this article claims xAI’s Colossus is the most powerful computer ever built, and that no other company has been able to connect this many GPUs together due to networking limitations. But haven’t there been other companies who’ve purchased more GPUs? Are they simply running them as smaller separate clusters?
Smaller clusters are common because it depends on what you are trying to accomplish. It's the same problem as any distributed system- going wider with a map only works if you can perform a timely reduce. With checkpointing you run into things like network storage being too slow and so on. There's also the fact you might want to work on/with many models and what you want to do for production (inference) is different than what you try to accomplish in development (training).
I think Meta’s cluster will be slightly larger than 100k next month, scaling to 350k by end of year. And double of that sometime next year. According to their info.
I haven’t seen any info on those being in a single cluster. They seem to just be referring to total GPUs. It’s currently almost impossible to create a single cluster of GPUs that size.
Ah correct! It seemed weird to me they would have that figured out while others have not.