Comments Page - 100x defect tolerance: How we solved the yield problem

« Back 100x defect tolerance: How we solved the yield problemcerebras.aiSubmitted by jwan584 9 months ago

ChuckMcM 9 months ago
I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.
That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).
In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.
While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.
[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.
[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.
- dogcomplex 9 months ago
  Of course many people are going to collectively lose trillions, AI's a very highly hyped industry with people racing into it without an intellectual edge and any temporary achievement by any one company will be quickly replicated and undercut by another using the same tools. Economic success of the individuals swarming on a new technology is not a guarantee whatsoever, nor is it an indicator of the impact of the technology.
  Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.
  idiotsecant 9 months ago
  >it's gonna rewire the way everything else operates too.
  This is the part that I think a lot of very tech literate people don't seem to get. I see people all the time essentially saying 'AI is just autocomplete' or pointing out that some vaporware ai company is a scam so surely everyone is.
  A lot of it is scams and flash in the pan. But a few of them are going to transform our lives in ways we probably don't even anticipate yet, for good and bad.
  Retric 9 months ago
  I’m not so sure it’s going to even do that much. People are currently happy to use LLM’s, but the outputs aren’t accurate and don’t seem to be improving quickly.
  A YouTuber watch regularly includes questions they asked Chat GPT and very single time there’s a detailed response in the comments showing how the output is wildly wrong from multiple mistakes.
  I suspect the backlash from disgruntled users is going to hit the industry hard and these models are still extremely expensive to keep updated.
  Thews 9 months ago
  Using function calls for correct answer lookup already practically eliminates this, it's not wide spread yet, but the ease of doing it is already practical for many.
  New models aren't being trained specifically on single answers which will only help.
  The expense for the larger models is something to be concerned about. Small models with function calls is already great, especially if you narrow down what they are being used for. Not seeing their utility is just a lack of imagination.
  dogcomplex 9 months ago
  Right. And any particular question people think AIs are bad at also has a comments section of people who have run better crafted prompts that do the job just fine. The consensus is heading more towards "well damn, actually LLMs might be all we need" rather than "LLMs are just a stepping stone" - but either way, that's fine, cuz plenty of more advanced architecture uses are on their way (especially error correction / consistency frameworks).
  I dont believe there are any significant academic critiques doubting this. There are a lot of armchair hot takes, and perceptions that this stuff isn't improving up to their expectations, but those are pretty divorced from any rigorous analysis of the field, which is still improving at staggeringly fast rates compared to any other field of research. Aint no wall, folks.
  Retric 9 months ago
  “Crafting a better prompt” is often simply spinning an RNG again and again until you end up with an answer that happens to be good enough.
  In the real world if you know the correct answer you don’t need to ask the question. A self driving car that needs you to pay attention isn’t self driving.
  Any system can get canned response, the value of AI is completely in its ability to handle novelty without hand holding. And none of these systems actually do that even vaguely well in practice rather than providing response that are vaguely close to correct.
  If I ask for a summary of an article and it gets anything wrong in the article that’s a 0 because now I need to read the article to know what it said. Arguably the value is actually negative here.
  dogcomplex 9 months ago
  Any time prompt crafting matters is just when demonstrating the current edge of capabilities - next iteration, you can get away with a much more general/primitive prompt. Those are just people countering the "gotcha" arguments people try to levy against LLMs, showing that even now those tasks can be done with a good prompt. Anytime it's a practical concern though - just wait a little longer for the next model to smooth that out.
  You don't have to pay attention, that's the point. You can code without reading code now. Sure you gotta tell it what the app looks like with each iteration - but again, that's temporary til the next model comes out with good enough vision to assess that itself. None of this is permanently planning on requiring human interaction - it's just early days and these are progressing through mediums one at a time.
  They're not canned responses either. They're bespoke mixtures of all the various elements of the current environment/context translated to an answer. It certainly handles novelty - that's the whole point. They certainly handle plenty of novelty - like entire mediums of text and images - to expert levels. I think you're just being greedy for more, here.
  As for consistency and avoiding error? There are benchmarks for that. There are error checking methods. Those are all steadily improving too, and are already well-consistent on easier topics/mediums. It would be foolish to think that's innately impossible from AI for remaining ones.
  ithkuil 9 months ago
  Dollars are not lost; they are just very indirectly invested into gpu makers (and energy providers)
  Melomomololo 9 months ago
  [dead]
- girvo 9 months ago
  > Xilinx was still aggressively suing people who put SERDES ports on FPGAs
  This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!
  ChuckMcM 9 months ago
  Well this was the patent they were threatening with as I recall (https://patents.google.com/patent/US20030023912A1/en) and there was this one too: https://patents.google.com/patent/US5576554A/en
  Basically the "secret sauce" of the startup recruiting me was that they were going to do wafer scale FPGAs that could be tiled together to build arbitrarily complex systems like military phased array radars and such. All very hush hush but apparently they had recruited some key talent from Xilinx which was annoying Xilinx.
  nroize 9 months ago
  Not OP but I was curious too. Here's all I could find that seemed related: https://www.businesswire.com/news/home/20200121005582/en/Xil...
- enragedcacti 9 months ago
  Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s
  ChuckMcM 9 months ago
  I could guess that it helps with heat dissipation/management. But I don't know. That guess is from looking at the list of patents[1] they have.
  [1] https://patents.justia.com/assignee/cerebras-systems-inc
  projektfu 9 months ago
  They did mention that they stash extra cores to enable the re-routing. Those extra cores are presumably unused when not routed in.
  enragedcacti 9 months ago
  That was my first thought but based on the rerouting graphic it seems like the extra cores would be one or two rows and columns around the border which would only account for ~4000 cores.
  projektfu 9 months ago
  If the system were broken down into more subdivisions internally, there would be more cores dedicated to replacement. It seems like it could be more difficult to reroute an entire row or column of cores on a wafer than a small block. Perhaps, also, they are building in heavy redundancy for POC and in the future will optimize the number of cores they expect to lose.
- __Joker 9 months ago
  "While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"
  Can you please explain more why you think so ?
  Thank you.
  mschuster91 9 months ago
  It's a hype cycle with many of the hypers and deciders having zero idea about what AI actually is and how it works. ChatGPT, while amazing, is at its core a token predictor, it cannot ever get to an AGI level that you'd assume to be competitive to a human, even most animals.
  And just as every other hype cycle, this one will crash down hard. The crypto crashes were bad enough but at least gamers got some very cheap GPUs out of all the failed crypto farms back then, but this time so much more money, particularly institutional money, is flowing around AI that we're looking at a repeat of Lehman's once people wake up and realize they've been scammed.
  dsign 9 months ago
  Those glorified token predictors are the missing piece in the puzzle of general intelligence. There is a long way to go still in putting all those pieces together, but I don't think any of the steps left are in the same order of "we need a miracle breakthrough".
  That said, I believe that this is going one of two ways: we use AI to make things materially harder for humans, in a scale from "you don't get this job" to "oops, this is Skynet", with many unpleasant stops in the middle. By the amount of money going into AI right now and most of the applications I'm seeing being hyped, I don't think we have have any scruples with this direction.
  The other way this can go, and Cerebras is a good example, is that we increase our compute capability and our AI-usefulness to a point where we can fight cancer and stop/revert aging, both being a computational problem at this point. Even if most people don't realize it, or most people have strong moral objections to this outcome and don't even want to talk about it, so it probably won't happen.
  In simpler words, I think we want to use AI to commit species suicide :-)
  Shorel 9 months ago
  I'm sure there are more missing pieces.
  We are more than Broca's areas. Our intelligence is much more than linguistic intelligence.
  However, and this is also an important point, we have built language models far more capable than any language model a single human brain can have.
  Makes me shudder in awe of what's going to happen when we add the missing pieces.
  idiotsecant 9 months ago
  Yes, I sometimes wonder if what we're witnessing in our lifetimes is the next stage of the 'bootstrapping' of life into a more complex form. If we might be the mitochondria contributing our little piece to the cell that comes after.
  KronisLV 9 months ago
  > And just as every other hype cycle, this one will crash down hard.
  Isn't that an inherent problem with pretty much everything nowadays: crypto, blockchain, AI, even the likes of serverless and Kubernetes, or cloud and microservices in general.
  There's always some hype cycle where the people who are early benefit and a lot of people chasing the hype later lose when the reality of the actual limitations and the real non-inflated utility of each technology hits. And then, a while later, it all settles down.
  I don't think the current "AI" is special in any way, it's just that everyone tries to get rich (or benefit in other ways, as in the microservices example, where you still very much had a hype cycle) quick without caring about the actual details.
  anon373839 9 months ago
  > I don't think the current "AI" is special in any way
  As someone who loves to pour ice water on AI hype, I have to say: you can't be serious.
  The current AI tech has opened up paths to develop applications that were impossible just a few years ago. Even if the tech freezes in place, I think it will yield substantial economic value in the coming years.
  It's very different from crypto, the main use case for which appears to be money laundering.
  pixelfarmer 9 months ago
  > Even if the tech freezes in place, I think it will yield substantial economic value in the coming years.
  The question is, where will this "economic value" be? Because "economic value" and actual progress that helps society are two very different things. For example, if someone wants to hire people, they can use "AI" to sift through the applications. But people looking for a job can also use "AI" to write their applications. In the end you may have created "economic value", but its an arms race and a waste of resources at the core, more digital paperwork, a waste of compute. So the actual value is not positive, it is negative. And we see that in many places where this so called "AI" is supposed to help.
  Does it mean it is entirely useless? No, but the field of applications where it actually makes sense and has an overall net benefit is way smaller than many believe.
  Plus, there are different types of neural networks in use for decades already. Look at OCR, for example, where the commercial OCR software switched to neural networks around the mid 90s already. So it is not that "AI" as such is bad, just that this generative neural network stuff is overly hyped by people who have absolutely no clue about it, but have to hop on the bandwagon to not be left out and keep shareholder values up, because most of these shareholders are equally stupid about the whole issue. Its a circus that burns many resources, money that could have created way more value in other areas.
  idiotsecant 9 months ago
  >the main use case for which appears to be money laundering.
  You say tomato, I say freedom from the tyranny of fiat power structures.
  carlmr 9 months ago
  >It's very different from crypto, the main use case for which appears to be money laundering.
  Which has substantial economic value (for certain groups of people).
  lazide 9 months ago
  According to this random estimate, black market economy alone in just the US is worth ~ $2 trillion/yr. [https://www.investopedia.com/terms/u/underground-economy.asp]
  Roughly 11-12% of GDP.
  In many countries, black+grey market is larger than the ‘white’ market. The US is notoriously ‘clean’ compared to most (probably top 10).
  Even in the US, if you suddenly stopped 10-12% of GDP we’re talking ‘great depression’ levels of economic pain.
  Honestly, the only reason Crypto isn’t bigger IMO is because there is such a large and established set of folks doing laundering in the ‘normal’ system, and those work well enough there is not nearly as much demand as you’d expect.
  carlmr 9 months ago
  >Honestly, the only reason Crypto isn’t bigger IMO is because there is such a large and established set of folks doing laundering in the ‘normal’ system, and those work well enough there is not nearly as much demand as you’d expect.
  Correct me if I'm wrong. Doesn't the public ledger make it really bad for this purpose when enough is known about some of the actors interacting on the chain here? It's basically an exact recording that everybody can see, of who interacted with whom, based on pseudonyms, right?
  lazide 9 months ago
  You are not wrong.
  Cash is superior in the ability to have intermediate transactions hidden. However it has similar issues in that bills are serialized, and I believe scanned/tracked - or at least there is the capability to do so. So in theory, when moving it out of and back into the system, it can be tied to real identities not just pseudonyms too.
  Though frankly, it seems like no one really cares most of the time.
  KronisLV 9 months ago
  > The current AI tech has opened up paths to develop applications that were impossible just a few years ago.
  My argument is that if it's advertised as a direct precursor to AGI based on wishful thinking and people don't know any better, then it's no different to claims about how putting blockchain technology in X industry will solve all of its problems.
  I use LLMs daily and don't scoff at AI generated imagery or use cases like agentic systems, but there absolutely is a similar hype cycle to every other innovation out there where people are borderline delusional in the initial stages (Kubernetes will solve all of our issues, moving to cloud and microservices will solve all of our issues, the blockchain will...), before the limitations crystallize and we know what each technology is good or bad at.
  Though maybe that says more about human nature than the technology itself.
  > It's very different from crypto, the main use case for which appears to be money laundering.
  That's akin to saying "The main use case for AI appears to be stealing people's art and even for writers and others it seems to be firing people to replace them with soulless AI generated slop."
  I'd even argue that there's nothing wrong with the technologies themselves, be it LLMs, AI for image, video, audio generation, blockchain and crypto, or whatever. The problems arise based on how the technologies are used, or in my argument above - how they're touted as the solution to all the problems. Some people profit a lot, others collide with reality and their limitations at speed.
  In other words, if the technology will generate 100 billion USD of actual value but people are betting on 500 billion USD, then clearly we have a bit of an issue.
  carlhjerpe 9 months ago
  Both Kubernetes and serverless (FaaS) is here to stay. Microservices is just an excuse to build shit software.
  KronisLV 9 months ago
  > Both Kubernetes and serverless (FaaS) is here to stay.
  *in environments and projects where they are a good fit
  > Microservices is just an excuse to build shit software.
  *in environments and projects where they are a bad fit
  idiotsecant 9 months ago
  All the big LLMs are no longer just token predictors. They are beginning to incorporate memory, chain of thought, and other architectural tricks that use the token predictor in novel ways to produce some startlingly useful output.
  It's certainly the case that an LLM alone cannot achieve AGI. As a component of a larger system though? That remains to be seen. Maybe all we need to do is duct tape a limbic system and memory onto an LLM and the result is something sort of like an AGI.
  It's a little bit like saying that a ball bearing can't possibly ever be an internal combustion engine. While true, it's sidestepping the point a little bit.
  Shorel 9 months ago
  While I basically agree with everything you say, I have to add some caveats:
  ChatGPT, while being as far from true AGI as the Elisa chatbot written in Lisp, is extraordinarily more useful, and being used for many things that previously required humans to write the bullshit, like lobbying and propaganda.
  And Crypto... right now BTC is at an historical highest. It could even go higher. And it will eventually crash again. It's the nature of that beast.
  immibis 9 months ago
  Why do you think that an AGI can't be a token predictor?
  Shorel 9 months ago
  By analogy with human brains: Because our own brains are far more than the Broca's areas in them.
  Evolution selects for efficiency.
  If token prediction could work for everything, our brains would also do nothing else but token prediction. Even the brains of fishes and insects would work like that.
  The human brain has dedicated clusters of neurons for several different cognitive abilities, including face recognition, line detection, body parts self perception, 3D spatial orientation, and so on.
  myrmidon 9 months ago
  > Evolution selects for efficiency.
  I think this is a poor argument here. From an evolutionary point of view, our brains are optimized to:
  - Provide fine-motor control to craft weapons and tools (enhancing adaptibility and enabling us to hunt way outside our weight class)
  - Communicate/coordinate effectively in small groups
  - Do sensor processing and the above with a low energy budget
  Our brains are *not* selected to be minimum-complexity intelligences, and a lot of what our brain does is completely useless for AGI building (motor control, sensor processing, ...).
  Furthermore, the cost/complexity (from a evolutionary PoV) is a totally different beast from what complexity means to us.
  Just consider flight as an example: A fruitfly is an insanely simple and straightforward beast, but to us, a biochemically fuelled, beating-wing design is still infeasibly complicated. If our approach to flight had been to ape after how nature does it in detail, we likely still would not have planes.
  I do agree that todays LLMs still have clear architectural flaws that we need to overcome (online learning being a very glaring one), but, to pick up the flight analogy, we might well have the main wing structure already down, and we won't necessarily have to make those wings beat to get into the air...
  Shorel 9 months ago
  Just because there are some parts of our brains that are not needed for an AGI...
  Doesn't mean that there aren't some part of our brains that are needed for an AGI, and are not present in the current crop of LLM.
  bcrl 9 months ago
  I think it means that AGIs trained in different ways will have different strengths, just like people.
  immibis 9 months ago
  What do our brains do that isn't token prediction?
  They receive information about photons and air vibrations and control muscles, okay. If a human brain was hooked up the way ChatGPT was, only to text input and output, would that make it not intelligent?
  Shorel 9 months ago
  > What do our brains do that isn't token prediction?
  I am planning a masters and phd on that question, so give me a few years to answer.
  mschuster91 9 months ago
  Because an LLM _by definition_ cannot even do basic maths (well, except if you're OpenAI and cheat your way around it by detecting if the user asks a simple math question).
  I'd expect an actually "general" intelligence Thing to be able to be as versatile in intellectual tasks as a human is - and LLMs are reasonably decent at repetition, but cannot infer something completely new from the data it has.
  versteegen 9 months ago
  Define "by definition".
  Because this statement really makes no sense. Transformers are perfectly capable (and capable of perfectly) learning mathematical functions, given the necessary working-out space, e.g. for long division or for algebraic manipulation. And they can learn to generalise from their training data very well (although very data-inefficiently). That's their entire strength!
  dogcomplex 9 months ago
  Yet they can get silver medal PhD level competition math scores.
  Perhaps your "definition" should be simply that LLMs have temporarily seen limitations in their ability to natively do math unassisted by an external memory, but are exceptionally good at very advanced math when they can compensate for their lossy short-term attention memory...
  CamperBob2 9 months ago
  it cannot ever get to an AGI level that you'd assume to be competitive to a human, even most animals.
  Suppose you turn out to be wrong. What would convince you?
  ChuckMcM 9 months ago
  It could diagram a sentence it had never seen.
  karolist 9 months ago
  It does that all the time even now. Play with temp, top_k, min_p and nothing is preventing you from getting combinations and order of tokens previously not encountered in history.
  CamperBob2 9 months ago
  A language model can generate valid sentences that have never existed before -- not just syntactically or statistically valid sentences in a Markovian sense, but semantically valid ones. How could it possibly do that without an implicit awareness of all aspects of language, including grammar and parts of speech?
  That implicit awareness is indistinguishable from knowledge, at least to the extent it yields correct results.
  And the fact that it was inferred from examples rather than taught explicitly is indistinguishable from intelligence.
  kragen 9 months ago
  I did eventually get an LLM to produce what seems to be a correct diagram of a sentence it had never seen, but it took about ten tries. Grammatical analysis seems to have happened correctly every time, but getting to a usable diagram was difficult.
  (I know that it's generally rude to include LLM output in HN comments, but in this case I think it's essential supporting material to elevate the discussion of LLM capabilities above "yes it is", "no it isn't".)
  First, I just asked ChatGPT (GPT-4o mini, what they offer anonymous free users):
  > Please diagram the sentence, "If my mother's road had been cultural, my uncle's sloth squad residue would have compiled a periodic meadow of night."
  I just invented that sentence, picking several words with a random number generator, but of course I can't prove to you that I didn't copy it from some book you haven't seen. (For that, you'd have to invent a sentence yourself, or at least provide me with some words to include in it.) I hope you can at least agree that it's a highly unlikely sentence for anyone ever to have uttered before!
  After talking to itself for a few paragraphs, it arrived at the following grammatical analysis:
  Step 4: Diagramming the sentence Main Clause: Subject: "my uncle's sloth squad residue" Verb: "would have compiled" Direct Object: "a periodic meadow of night" "a periodic meadow" (noun phrase) "of night" (prepositional phrase modifying "meadow") Conditional Clause: Subject: "my mother's road" Verb: "had been" (linking verb) Predicate Adjective: "cultural"
  This is exactly correct. From that it produced the following ASCII art "diagram", which looks like nonsense to me, but maybe it's in some kind of sentence-diagramming formalism I haven't seen:
  If ______/______ | | my mother's had been road | cultural my uncle's sloth squad residue | would have compiled | a periodic meadow of night | of night
  I told it the diagram looked like nonsense, and it produced this one, which looks worse:
  If _______/______ | | my mother's road had been | | cultural (Main Clause) | my uncle's sloth squad residue | would have compiled | a periodic meadow | of night
  I asked GPT-4 (paid) the same question, and it gave me another exactly correct grammatical analysis in words:
  1. **Identify the main clauses**: - "If my mother's road had been cultural" is a dependent clause (specifically, a conditional clause). - "my uncle's sloth squad residue would have compiled a periodic meadow of night" is the main clause. 2. **Start with the main clause**: - Subject: "my uncle's sloth squad residue" - "my uncle's" modifies "sloth squad," which in turn modifies "residue." - Predicate: "would have compiled" - This includes the modal auxiliary verb ("would"), perfect auxiliary ("have"), and main verb ("compiled"). - Direct object: "a periodic meadow of night" - “a” is an article modifying “meadow,” - “periodic” modifies “meadow,” - “of night” is a prepositional phrase modifying “meadow.” 3. **Diagram the dependent clause**: - Subject: "road" - Modified by possessive noun phrase “my mother’s” - Predicate: "had been" - Complement/Adjective: "cultural"
  (I feel like this is definitely "a level that you'd assume to be competitive to a human", in mschuster91's phrase; most native English speakers can't do that.)
  But then it came up with the following incorrect diagram:
  +---------------------------------------------------+ | Main Clause | +---------------------------------------------------+ / | \ / | \ / | \ [My uncle's] [sloth squad] [residue] \ / / \ / / \ / / would have compiled ------------------- \ \ \ [a periodic meadow of night] / | \ / | \ a periodic [of night] +---------------------+ | Dependent Clause | +---------------------+ If | road ___|_____ | | my mother's had been cultural
  Telling it the diagram was nonsense didn't help.
  But my conclusion from this is not that GPT-4 can't analyze the grammatical structure of sentences it had never seen. It's that GPT-4 gets "confused" when it tries to draw ASCII art, like a person trying to tell you how to draw it over the telephone, but not quite as bad. I've seen similarly abominable results when it tried to draw circuit schematics and guitar tablatures as ASCII art. I tried asking it for a graphviz diagram, which was syntactically correct graphviz code but didn't express the sentence structure properly.
  I tried simplifying the problem with the prompt, 'Please provide a syntactic tree showing the syntactic derivation of the sentence, "My uncle's sloth squad residue would have compiled a periodic meadow of night," using graphviz syntax and with individual words at the leaf nodes of the tree.' which yielded a slightly incorrect diagram. (I'm still sure nobody has ever uttered this sentence before.) On being told, "This diagram is not quite right.", GPT-4 produced the following diagram, which I believe is correct except for an extra "s" on "uncle" (you can see this in your browser at https://is.gd/Axufjx):
  digraph G { node [shape=plaintext]; S [label="S"]; NP [label="NP"]; VP [label="VP"]; S -> NP; S -> VP; // Noun Phrase breakdown DP_my [label="DP"]; DP_my -> My; UnclePossessive [label="NP"]; UnclePossessive -> Uncles; UnclePossessive -> ApostropheS; NP -> DP_my; NP -> UncleCompound; UncleCompound [label="NP"]; SlothSquadResidue[label = "N'"]; SlothSquadResidue-> Sloth; SlothSquadResidue-> SquadResidue; SquadResidue[label = "N'"] SquadResidue-> Squad; SquadResidue-> Residue; UncleCompound-> UnclePossessive; UncleCompound-> SlothSquadResidue; // Verb Phrase breakdown ModalWould[label = "Modal"]; ModalWould-> Would; AuxHave[label = "Aux"]; AuxHave-> Have; VCompiled[label = "V"]; VCompiled-> Compiled; ObjectNP[label = "NP"]; // Direct object noun phrase ArticleA[label = "D"]; ArticleA-> A; MeadowPhrase[label = "N'"]; // 'Meadow' modified by 'periodic' AdjPeriodic[label = "Adj"]; AdjPeriodic-> Periodic; MeadowWord[label = N]; MeadowWord -> Meadow; MeadowPhrase -> AdjPeriodic; MeadowPhrase -> MeadowWord; PPofNight[label = PP]; // Prepositional phrase 'of night' PrepOf[label=P] PrepOf-> Of; NightNounPhrase[label=N] NightNounPhrase-> Night; PPofNight -> PrepOf; PPofNight -> NightNounPhrase; ObjectNP -> ArticleA; ObjectNP -> MeadowPhrase; ObjectNP -> PPofNight; VP -> ModalWould; VP -> AuxHave; VP -> VCompiled VP -> ObjectNP }
  Claude produced a reasonably serviceable diagram on the first try with that prompt:
  digraph SyntacticTree { node [shape=plaintext]; S -> NP1; S -> VP; NP1 -> DET1 [label="Det"]; NP1 -> NP2 [label="NP"]; NP2 -> NP3 [label="NP"]; NP2 -> N1 [label="N"]; NP3 -> N2 [label="N"]; NP3 -> POS [label="Pos"]; VP -> AUX1 [label="Aux"]; VP -> AUX2 [label="Aux"]; VP -> V [label="V"]; VP -> NP4 [label="NP"]; NP4 -> DET2 [label="Det"]; NP4 -> ADJ [label="Adj"]; NP4 -> N3 [label="N"]; NP4 -> PP [label="PP"]; PP -> P [label="P"]; PP -> NP5 [label="NP"]; NP5 -> N4 [label="N"]; DET1 -> "My"; N2 -> "uncle"; POS -> "'s"; N1 -> "sloth"; N1 -> "squad"; N1 -> "residue"; AUX1 -> "would"; AUX2 -> "have"; V -> "compiled"; DET2 -> "a"; ADJ -> "periodic"; N3 -> "meadow"; P -> "of"; N4 -> "night"; }
  On being told, I think incorrectly, "This diagram is not quite right.", it produced a worse diagram.
  So LLMs didn't perform nearly as well on this task as I thought they would, but they also performed much better than you thought they would.
  emkee 9 months ago
  Having only taken one syntax class for fun in college, I find this pretty impressive. Generating syntax trees was never a trivial task for me (but I was just a CS major who needed a credit). Slightly related, but I have also never had ChatGPT successfully generate ASCII art, even with extensive conversation.
  stonemetal12 9 months ago
  If you ask it to draw a dinosaur it does an okay brontosaurs.
  __ / _) _.----._/ / / / __/ ( | ( | /__.-'|_|--|_|
  Asking for a Tyrannosaurus Rex gives you more or less the same brontosaurs:
  __ / _) _.----._/ / / / __/ ( | ( | /__.-'|_|--|_|
  kragen 9 months ago
  Yeah, I think it would be very challenging for most people. It did considerably better with Graphviz than with ASCII art, but it still had trouble with the transition from a perfectly correct and highly nuanced verbal grammatical analysis to Graphviz. I think this is pretty convincing evidence against ChuckMcM's implicit position. It's weaker evidence against mschuster91's explicit position because parsing is something computers have been doing for a long time, so it doesn't imply any new capabilities.
  I'm pretty sure there are part-of-speech tagging parsers using fairly shallow statistics that could also have produced an equivalently good sentence diagram. https://corenlp.run/ seems to produce a correct parse, though in a different format.
  CamperBob2 9 months ago
  (I know that it's generally rude to include LLM output in HN comments, but in this case I think it's essential supporting material to elevate the discussion of LLM capabilities above "yes it is", "no it isn't".)
  You just have to be prepared to take a karma hit for it. The audience here does not consist largely of 'hackers', but seems to skew toward the sort of fearful, resentful reactionaries that hacker culture traditionally opposes.
  I will say I wouldn't peg ChuckMcM as being one of the reactionaries, though. That would be an unpleasant surprise.
  As far as the diagram goes, my guess is that sentence diagrams were underrepresented in the training corpus. Diagramming sentences was already out of fashion when I was in school in the 1980s -- in fact, I don't recall ever having done it. The model is struggling much the same way you'd expect a grade-school student (or me, I guess) to struggle upon being asked to perform the task for the first time.
  Knowing when to say "I don't know how to do that" is still a foundational weakness of LLMs, but I don't expect it to remain unaddressed for long. We will see improvement in that area, sooner or later. The anklebiters will respond by moving their goalposts and hitting the downvote button as usual.
  kragen 9 months ago
  ASCII art Reed–Kellogg sentence diagrams are probably hard to find anywhere, and Graphviz can't really express Reed–Kellogg diagrams. But Reed and Kellogg published their somewhat ad-hoc diagram language in 01877, 78 years before what we now call "linguistics" was known in the West thanks to Chomsky's work in 01955. These are among the reasons I thought it might be a good idea to use the form of sentence diagrams used by linguists instead of the more compact Reed–Kellogg diagrams.
  ChuckMcM 9 months ago
  I would guess you're not asking a serious question here but if you were feel free to contact me, it's why I put my email address in my profile.
  bigdict 9 months ago
  Why are you assuming bad faith?
  ChuckMcM 9 months ago
  What gave you the impression I was assuming bad faith? It's off topic to the discussion (which is fine) but can be annoying in the middle of an HN thread.
  ossopite 9 months ago
  Without offering any opinion on its merits, if you think justifying this controversial claim is off topic, then so is the claim and you shouldn't have written it.
  bigdict 9 months ago
  > What gave you the impression I was assuming bad faith?
  You said "I would guess you're not asking a serious question here"
  kragen 9 months ago
  You said, "I would guess you're not asking a serious question here," which is to say, you were guessing that the question was asked in bad faith. Or, at any rate, you would, if for some reason the question came up, for example in deciding how to answer it. Which is what you were doing. That is to say, you did guess that it was asked in bad faith. Given the minimal amount of evidence available (12 words and a nickname "__Joker") I think it's reasonable to describe that guess as an assumption. Ergo, you were assuming bad faith.
  ripped_britches 9 months ago
  It was a direct quote from your original comment
  bruce343434 9 months ago
  You brought it up...
  __Joker 9 months ago
  Really sorry, if the question came as snarky or if otherwise. Those were not my intent.
  Related to AI given all around noise, really wanted to understand kind of contrarian view of monetary aspects.
  Once again, apologies if the question seems frivolous.
- undefined 9 months ago
  [deleted]
ajb 9 months ago
So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!
Not sure how that's a win.
Unless the rest of the wafer is useable for some other customer?
- nine_k 9 months ago
  It's a win because they have to test one chip, and don't have to spend resources on connecting the chiplets. The latter costs a lot (though it has other advantages). I suspect that a chiplet-based device with total 900k cores would just be not viable due to the size constraints.
  If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.
- ungreased0675 9 months ago
  Why does it have to be a square? There’s no need to worry about interchangeable third-party heat sink compatibility. Is it possible to make it an irregular polygon instead of square?
- kristjansson 9 months ago
  Additional wafer area would be a marginal increase in performance (+~20% core core best case) but increases the complexity of their design, and requires they figure out how to package/connect/house/etc. a non-standard shape. A wafer scale chip is already a huge tech risk, why spend more novelty budget on nonessential weirdness?
- Scaevolus 9 months ago
  Why does their chip have to be rectangular, anyways? Couldn't they cut out a (blocky) circle too?
  Qwertious 9 months ago
  You need a rectilinear polygon that tessellates, and has the fewest sides possible to minimize the number of cuts necessary. And it would probably help the cutting if the shape is entirely convex, so that cuts can overshoot a bit without damaging anything.
  That suggests a rectangle is the only possible shape.
  CorrectHorseBat 9 months ago
  If it's just one chip per wafer, why even bother cutting?
  timerol 9 months ago
  Why does it need to tessellate if there's only one chip per wafer?
  nine_k 9 months ago
  Rather I wonder why do they even need to cut the extra space, instead of putting something there. I suppose that the structure of the device is highly rectangular from the logical PoV, so there's nothing useful to put there. I suspect smaller unrelated chips can be produced on these areas along the way.
  guyzero 9 months ago
  I've never cut a wafer, but I assume cutting is hard and single straight lines are the easiest.
  sroussey 9 months ago
  I wonder if you could… just not cut the wafer at all??
  ryao 9 months ago
  I suspect this would cause alignment issues since you could literally rotate it into the wrong position when doing soldering. That said, perhaps they could get away with cutting less and using more.
  CorrectHorseBat 9 months ago
  They already have a notch or flat for alignment, which is much more critical during the lithography process than during soldering.
  Dylan16807 9 months ago
  If you want to have nice straight edges to clamp into place, then you only need to shave off four slivers. You can lose a couple percent instead of more than a third.
  sharpesttool 9 months ago
  You just need a sharpie to mark the top.
  daedrdev 9 months ago
  That's the idea in the article. Just one big chip. But the reason why it's normally done is that there is a pretty high defect rate, so cutting if every wafer has 1-2 defects you still get (X-1.5) devices per wafer. In the article thy go into how they avoid this problem (I think its better fault tolerance, at a cost)
  gpm 9 months ago
  The article shows them using a single maximally sized square portion of a circular wafer.
  I think the proposal you're responding to is "just use the whole circular wafer without cutting out a square".
  axus 9 months ago
  Might be jumping in without reading, but the chips you cut out of the wafer have to be delivered to physically different locations.
  ajb 9 months ago
  Normally yes. But they're using a whole wafer for a single chip! So it's actually a good idea.
  I guess the issue is how do you design your routing fabric to work in the edge regions.
  Actually I wonder how they are exposing this wafer. Normal chips are exposed in a rectangular batch called a reticle. The reticle mask has repeated patterns across it, and it is then exposed repeatedly across the wafer. So either they have to make a reticle mask the full size of the wafer, which sounds expensive, or they somehow have to precisely align reticle exposures so that the joined edges form valid circuits.
  yannyu 9 months ago
  The cost driver for fabbing out wafers is the number of layers and the number of usable devices per wafer. Higher layer count increases cost and tends to decrease yield, and more robust designs with higher yields increase usable devices per wafer. If circles or other shapes could help with either of those, they would likely be used. Generally the end goal is to have the most usable devices per wafer, so they'll be packed as tightly as possible on the wafer so as to have the highest potential output.
  Scaevolus 9 months ago
  Right, but they're making just one usable device per wafer already.
  undefined 9 months ago
  [deleted]
- olejorgenb 9 months ago
  Is the wafer itself so expensive? I assume they don't pattern the unused area, so the process should be quicker?
  addaon 9 months ago
  > I assume they don't pattern the unused area
  I’m out of date on this stuff, so it’s possible things have changed, but I wouldn’t make that assumption. It is (used to be?) standard to pattern the entire wafer, with partially-off-the-wafer dice around the edges of the circle. The reason for this is that etching behavior depends heavily on the surrounding area — the amount of silicon or copper whatever etched in your neighborhood affects the speed of etching for you, which effects line width, and (for a single mask used for the whole wafer) thus either means you need to have more margin on your parameters (equivalent to running on an old process) or have a higher defect right near the edge of the die (which you do anyway, since you can only take “similar neighborhood” so far). This goes as far as, for hyper-optimized things like SRAM arrays, leaving an unused row and column at each border of the array.
  kurthr 9 months ago
  All the process steps are limited by wafers for hour. Lithography (esp EUV) might be slightly faster, but that's not 30% of total steps, since you generally have deposit and etch/implant for every lithography step.
  It's close to a dead loss in process cost.
  yannyu 9 months ago
  > I assume they don't pattern the unused area, so the process should be quicker?
  The primary driver of time and cost in the fabrication process is the number of layers for the wafers, not the surface area, since all wafers going through a given process are the same size. So you generally want to maximize the number of devices per wafer, because a large part of your costs will be calculated at the per-wafer level, not a per-device level.
  mattashii 9 months ago
  Yes, but isn't a big driver of layer costs the cost of the machines to build those layers?
  For patterning, a single iteration could be (example values, no actual values used, probably only ballpark accuracy) on a 300M$ EUV machine with 5-year write off cycle, patterns on average 180 full wafers /hour. Excluding energy usage and service time, each wafer that needs full patterning would cost ~38$. If each wafer only needed half the area patterned, the lithography machine might only spend half its usual time on such a wafer, and that could double the throughput of the EUV machine, halving the write-off based cost component of such a patterning step.
  Given that each layer generally consists of multiple patterning steps, a 10-20% reduction in those steps could give a meaningful reduction in time spent in the machines whose time spend on the wafer depends on the used wafer area.
  This of course doesn't help reduce time in polishing or etching (and other steps that happen with whole wafers at a time), so it won't be as straightforward as % reduction in wafer area usage == % reduction in cost, but I wouldn't be surprised if it was a meaningful percentage.
  yannyu 9 months ago
  > Yes, but isn't a big driver of layer costs the cost of the machines to build those layers?
  Let's say the time spent in lithography step is linear the way you're describing. Even with that, the deposition step beforehand is surface area independent and would be applied across the entire wafer, and takes just as long if not longer than the lithography.
  Additionally, if you were going to build a fab ground up for some specific purpose, then you might optimize the fab for those specific devices as you lay out. But most of these companies are not doing that and are simply going through TSMC or a similar subcontractor. So you've got an additional question of how far TSMC will go to accommodate customers who only want to use half a wafer, and whether that's the kind of project they could profitably cater to.
  olejorgenb 9 months ago
  Yes, but my understanding is that the wafer is exposed in multiple steps, so there would still be less exposure steps? Probably insignificant compared to all the rest though. (Etching, moving the wafer, etc.)
  EDIT: to clarify - I mean the exposure of one single pattern/layer is done in multiple steps. (https://en.wikipedia.org/wiki/Photolithography#Projection)
  yannyu 9 months ago
  The number of exposure steps would be unrelated to the (surface area) size of die/device that you're making. In fact, in semiconductor manufacturing you're typically trying to maximize the number of devices per wafer because it costs the same to manufacture 1 device with 10 layers vs 100 devices with 10 layers on the same wafer. This goes so far as to have companies or business units share wafers for prototyping runs so as to minimize cost per device (by maximizing output per wafer).
  Also, etching, moving, etc is all done on the entire wafer at the same time generally, via masks and baths. It's less of a pencil/stylus process, and more of a t-shirt silk-screening process.
  gpm 9 months ago
  > This goes so far as to have companies or business units share wafers for prototyping runs so as to minimize cost per device
  Can this be done in production? Is there a chance that the portion of the wafer cerebras.ai can't fit their giant square in is being used for production of some other companies chips?
  pulvinar 9 months ago
  There's also no reason they couldn't pattern that area with some other suitable commodity chips. Like how sawmills and butchers put all cuts to use.
  sitkack 9 months ago
  Often those areas are used for test chips and structures for the next version. They are effectively free, so you can use them to test out ideas.
  ajb 9 months ago
  Good question. I think the wafer has a cost per area which is fairly significant, but I don't have any figures. There has historically been a push to utilise them more efficiently, eg by building fabs that can process larger wafers. Although mask exposure would be per processed area, I think that there are also some proportion of processing time which is per wafer, so the unprocessed area would have an opportunity cost relating to that.
  kristjansson 9 months ago
  AIUI Wafer marginal cost is lower than you'd expect. I had $50k in my head, quick google indicates[1] maybe <$20k at AAPL volumes? Regardless seems like the economics for Cerebras would strongly favor yield over wafer area utilization.
  [1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...
  undefined 9 months ago
  [deleted]
  georgeburdell 9 months ago
  They probably pattern at least next nearest neighbors for local uniformity. That’s just litho though. The rest of the process is done all at once on the wafer
- sroussey 9 months ago
  It’s a win if you can use the wafer as opposed to throwing it away.
  kristjansson 9 months ago
  A win is a manufacturing process that results in a functioning product. Wafers, etc. aren't so scarce as to demand every mm2 be used on every one every time.
NickHoff 9 months ago
Neat. What about power density?
An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.
That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?
amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams
energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ
time = 154 kJ / 39.8 kW = 3.9 seconds
This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?
- Paul_Clayton 9 months ago
  The enthalpy of vaporization of water (at standard pressure) is listed by Wikipedia[1] as 2.257 kJ/g, so boiling 462 grams would require an additional 1.04 MJ, adding 26 seconds. Cerebras claims a "peak sustained system power of 23kW" for the CS-3 16 Rack Unit system[2], so clearly the power density is lower than for an H100.
  [1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/
  twic 9 months ago
  On a tangent: has anyone built an active cooling system which operates in a partial vacuum? At half atmospheric pressure, water boils at around 80 C, which i believe is roughly the operating temperature for a hard-working chip. You could pump water onto the chip, have it vapourise, taking away all that heat, then take the vapour away and condense it at the fan end.
  This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes. Are they called something else?
  It's also not a refrigerator, because those use a pump to pressurise the coolant in its gas phase, whereas here you would only be pumping the water.
  pants2 9 months ago
  No need to bother with a partial vacuum when ethanol boils at around 80 C as well and doesn't destroy electronics. I'm not aware of any active cooling systems utilizing this though.
  pezezin 9 months ago
  May I introduce you to the glorious vodka cooled PC? https://www.youtube.com/watch?v=IYTJfLyo_vE
  ddxxdd 9 months ago
  I could argue that ethanol has 1/3 the latent heat of vaporization of water, and would boil off 3 times quicker. However, what ultimately matters is the rate of heat transfer, so my nitpick may be irrelevant.
  TehCorwiz 9 months ago
  I found this review from 2019 of mechanically pumped heat pipe technologies. I skimmed the intro. Looks like it already has a foothold in aerospace.
  https://www.sciencedirect.com/science/article/abs/pii/S13594...
  Dylan16807 9 months ago
  > This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes.
  Do you have a particular benefit in mind that a pump would help with?
- buildbot 9 months ago
  A Very Fancy cooling engine: https://www.eetimes.com/powering-and-cooling-a-wafer-scale-d...
- jwan584 9 months ago
  A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...
- throwup238 9 months ago
  The machine that actually holds one of their wafers is almost as impressive as the chip itself. Tons of water cooling channels and other interesting hardware for cooling.
- flopsamjetsam 9 months ago
  Minor correction, the keynote video says ~20 kW
- lostlogin 9 months ago
  If rack mounted, you are ending up with something like a reverse power station.
  So why not use it as an energy source? Spin a turbine.
  kristjansson 9 months ago
  If you let the chip actual boil enough water to run a turbine you're going to have a hard time keeping the magic smoke inside. Much better to run at reasonable temps and try to recover energy from the waste heat.
  ericye16 9 months ago
  What if you chose a refrigerant with a lower boiling point?
  kristjansson 9 months ago
  That's basically the principle of binary cycle[1] generators. However for data center waste heat recovery, I'd think you'd want to use a more stable fluid for cooling, and then pump it to a separate closed-loop binary-cycle generator. No reason to make your datacenter cooling system also deal with high pressure fluids, and moving high pressure working fluid from 1000s of chips to a turbine of sufficient size, etc.
  [1]: https://en.wikipedia.org/wiki/Binary_cycle
  renhanxue 9 months ago
  There's a bunch of places in Europe that use waste heat from datacenters in district heating systems. Same thing with waste heat from various industrial processes. It's relatively common practice.
  sebzim4500 9 months ago
  If my very stale physics is accurate then even with perfect thermodynamic efficiency you would only recover about a third of the energy that you put into the chips.
  dylan604 9 months ago
  1/3 > 0, so even if you don't get a $0 energy bill I'd venture that any company that could get 1/3 of energy bill would be happy
  bentcorner 9 months ago
  I'm aware of the efficiency losses but I think it would be amusing to use that turbine to help power the machine generating the heat.
  twic 9 months ago
  Hey, we're building artificial general intelligence, what's a little perpetual motion on the side?
highfrequency 9 months ago
To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.
Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.
oksurewhynot 9 months ago
I live in a small city/large town that has a large number of craft breweries. I always marveled at how these small operations were able to churn out so many different varieties. Turns out they are actually trying to make their few core recipes but the yield is so low they market the less consistent results as...all that variety I was so impressed with.
bee_rider 9 months ago
> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.
That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.
I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?
Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?
IshKebab 9 months ago
TSMC also have a manufacturing process used by Tesla's Dojo where you can cut up the chips, throw away the defective ones, and then reassemble working ones into a sort of wafer scale device (5x5 chips for Dojo). Seems like a more logical design to me.
- bee_rider 9 months ago
  Is this similar to a chiplet design? Chiplets have been a thing for a while, so I assume Cerebras avoided them on purpose.
  IshKebab 9 months ago
  I don't think so - chiplets are much smaller and I think the process is different.
- ryao 9 months ago
  I had been under the impression that Nvidia had done something similar here, but they did not talk about deploying the space saving design and instead only talked about the server rack where all of the chips on the mega wafer normally are.
  https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...
  wmf 9 months ago
  That shield is just a prop that looks nothing like the real product. The NVL72 rack doesn't use any wafer-scale-like packaging.
  ryao 9 months ago
  It would be nice if they made it real. The cost savings from not needing so much material should be fantastic.
- mhh__ 9 months ago
  Amazing. I clicked a button in the azure deployment menu today...
ilaksh 9 months ago
I assume people are aware, but Cerebras has a web demo and API which is open to try and it is 2000 tokens per second for Llama 3.3 70b and 1000 tokens per second for Llama 3.1 405b.
https://cerebras.ai/inference
Neywiny 9 months ago
Understanding that there's inherent bias by them being competitors of the other companies, but still this article seems to make some stretches. If you told me you had an 8% core defect rate reduced 100x, I'd assume you got to close to 99% enablement. The table at the end shows... Otherwise.
They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.
- sfink 9 months ago
  I think you're missing the point. The comparison is not between 93% and 92%. The comparison is between what they're getting (93%) and what you'd get if you scaled up the usual process to the core size they're using (0%). They are doing something different (namely: a ~whole wafer chip) that isn't possible without massively boosting the intra-chip redundancy. (The usual process stops working once you no longer have any extra dies to discard.)
  > Despite having built the world’s largest chip, we enable 93% of our silicon area, which is higher than the leading GPU today.
  The important part is building the largest chip. The icing on the top is that the enablement is not lower. Which it would be without the routing-to-spare-cores magic sauce.
  And the differing terminology is because they're talking about differing things? You could call an SM a core, but it kind of contains (heterogeneous) cores itself. (I've no idea whether intra-SM cores can be redundant to boost yield.) A die is the part you break off and build a computer out of, it may contain a bunch of cores, a wafer can be broken up into multiple dies but for Cerebras it isn't.
  If NVIDIA were to go and build a whole-wafer die, they'd do something similar. But Cerebras did it and got it to work. NVIDIA hasn't gotten into that space yet, so there's no point in building a product that you can't sell to a consumer or even a data center that isn't built around that exact product (or to contain a Balrog).
  Neywiny 9 months ago
  I think I'll still stand by my viewpoint. They said:
  > On the Cerebras side, the effective die size is a bit smaller at 46,225mm2. Applying the same defect rate, the WSE-3 would see 46 defects. Each core is 0.05mm2. This means 2.2mm2 in total would be lost to defects.
  So ok they claim that they should see (46225-2.2)/46225 = 99.995%. Doing the same math for their Nvidia numbers it's 99.4%. And yet in practice neither approach got to these numbers. Nowhere near it. I just feel like the whole article talks about all this theory and numbers and math of how they're so much better but in practice it's meaningless.
  So what I'm not seeing is why it'd be impossible for all the H100s on a wafer to be interconnected and call it a day. You'd presumably get 92/93 = 98.9% of the performance and, here's the kicker, no need to switch to another architecture. I didn't know where your 0% number came from. Nothing about this article says that a competitor doing the same scaling to wafer scale would get 0%, just a marginal decrease in how many cores made it through fab.
  Fundamentally I am not convinced from this article that Cerebras has done something in their design that makes this possible. All I'm seeing is that it'd perform 1% faster.
  Edit: thinking a bit more on it, to me it's like they said TSMC has a guy with a sledgehammer who smashes all the wafers and their architecture snaps a tiny bit cleaner. But they haven't said anything about firing the guy with the sledgehammer. Their paragraph before the final table says that this whole exercise is pretty much meaningless because their numbers are made up about competitors and they aren't even the right numbers to be using. Then the table backs up my paraphrase.
  fspeech 9 months ago
  There is nothing inherently good about wafer scale. It's actually harder to dissipate heat and enable hybrid bonding with DRAM. So the gp is entirely correct that you need to actually show higher silicon utilization to be even considered as being something worthwhile.
exabrial 9 months ago
I have a dumb question. Why isn't silicon sold in cubes instead of cylinders?
- amelius 9 months ago
  The silicon ingots have a rotating production process that results in cylinders, not bricks.
  exabrial 9 months ago
  fascinating, I figured it was something like that. maybe we should produce hexagonal, instead of square, chip designs
- kryptiskt 9 months ago
  Crystalline silicon is produced with the Czochralski process (https://en.wikipedia.org/wiki/Czochralski_method), which produces a round ingot. So you'd have to cut away perfectly fine silicon to make something squarish.
- bigmattystyles 9 months ago
  no matter how you orient a circle on a plane, it's the same
anonymousDan 9 months ago
Very interesting. Am I correct in saying that fault tolerance here is with respect to 'static' errors that occur during manufacturing and are straightforward to detect before reaching the customer? Or can these failures potentially occur later on (and be tolerated) during the normal life of the chip?
aaroninsf 9 months ago
The number of people ITT this thread who have absorbed the world-weary AI-is-a-bubble skepticism...
I'm just gonna say, with serene certainty,
the economic order we inhabit going through phase change is certain. From certain myopic perspectives we can shoehorn that into a narrative of cyclical patterns in the tech industry or financial markets etc etc.
This is not going to be that. No more than the transformation of American retail can be shoehorned to kind of look like it used if you don't know anything at all about what contemporary international trade and logistics and oligopoly actually mean in terms of what is coming into your home from where and why it is or isn't cheap.
Where we'll be in 10, 20, years is literally unimaginable today; and trying to navigate that wrt traditional landmarks... oof.
larsrc 9 months ago
How do these much smaller cores compare in computing power to the bigger ones? They seem to implicitly claim that a core is a core is a core, but surely one gets something extra out of the much bigger one?
trhway 9 months ago
56K mm2 vs 46K mm2. I wonder why they wouldn’t use the smart routing/etc to use more fitting shape than square and thus use more of the wafer.
TowerTall 9 months ago
Ever heard the old joke story about an American buyer told the Japanese manufacture how many incorrectly made bolts were acceptable per lot of a thousand bolts? Maybe 2 or 3 in 1,000?
So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.
bigmattystyles 9 months ago
When I was a kid, I used to get intel keychains with a die in acrylic - good job to whoever thought of that to sell the fully defective chips.
- dylan604 9 months ago
  wow, fancy with the acrylic. lots of places just place a chip (I'm more familiar with RAM sticks) on a keychain and call it a day.
  kragen 9 months ago
  Those aren't just a chip; they're an epoxy package with a leadframe and a chip inside it. To put just a chip on a keychain, you'd have to drill a hole through it, which is difficult because silicon is so brittle—almost like drilling a hole in glass. Then, when someone put it onto a keyring, the keyring would form a lever that applies a massive force to the edge of the brittle hole, shattering the brittle silicon. Potting the chip in acrylic resin is a much cheaper solution that works better.
  bigmattystyles 9 months ago
  they're all over eBay, I just checked - the one I was thinking of, that I think I had is going for $150 - the things you get rid of....
  bradyd 9 months ago
  Electronic Goldmine sells entire scrapped 200mm wafers for $15 or less
  https://theelectronicgoldmine.com/search?options%5Bprefix%5D...
ashvardanian 9 months ago
The AMD comparison may not be accurate. The 96 core AMD CPU takes multiple such dies (eight if I remember correctly) and separate IO chiplets. The total surface area listed should be much larger.
aurareturn 9 months ago
Bear case on Cerebras: https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...
Note: This author is heavily invested in Nvidia.
RecycledEle 9 months ago
IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to laser out bad cores.
abrookewood 9 months ago
Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?
- donavanm 9 months ago
  AFAIK all wafer ingots are cylinders, which means the wafers themselves are a circular cross section. So manufacturing is binpacking rectangles in to a circle. Plus different effects/defects in the chips based on the distance from the edge of the wafer.
  So I believe its the opposite: why are they representing the larger square and implying lower yield off the wafer in space that doesnt practically exist?
- flumpcakes 9 months ago
  Because the circle is the physical silicon. Any chips that fall outside the circle are only part of a full chip. They will be physically missing half the chip.
- therealcamino 9 months ago
  That's just the shape of the wafer. I don't know why the diagram continued the grid outside it.
iataiatax10 9 months ago
The yield problem is not surprising they found a solution. Maybe they could elaborate more on the power distribution and dissipation problem?
gunalx 9 months ago
My biggest question is who are the buyers?
- asdasdsddd 9 months ago
  mostly 1 ai company in the middle east last I heard
jstrong 9 months ago
I would like a workstation with 900k cores. lmk when these things are on ebay.
- riskable 9 months ago
  Just need that 20kW connection to your energy provider.
  jstrong 9 months ago
  a man can dream
Fokamul 9 months ago
Anyone has some picture how it is looks like inside these servers?
hoseja 9 months ago
Why square chip? Make it an octagon or something.
lofaszvanitt 9 months ago
A well written, easy to understand article.
wendyshu 9 months ago
What's yield?
- wmf 9 months ago
  It's the fraction of usable product from a manufacturing process.
- elpocko 9 months ago
  [flagged]
wizzard0 9 months ago
this is an important reminder that all digital electronics is really analog but with good correction circuitry.
and run-time cpu and memory error rates are always nonzero too, though orders of magnitude lower than chip yield rates
- nine_k 9 months ago
  CPUs may be very digital inside, but DRAM and flash memory are highly analog, especially MLC flash. DDR4 even has a dedicated training mode [1], during which DRAM and the memory controller learn the quirks of particular data lines and adjust to them, in order to communicate reliably.
  [1]: https://www.systemverilog.io/design/ddr4-initialization-and-...
tweetpeekai 9 months ago
[dead]
bcatanzaro 9 months ago
This is a strange blog post. Their tables say:
Cerebras yields 46225 * .93 = 43000 square millimeters per wafer
NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer
I don't know if their numbers are correct but it is a strange thing for a startup to brag that it is worse than a big company at something important.
- saulpw 9 months ago
  Being within striking distance of SOTA while using orders of magnitude fewer resources is worth bragging about.
ryao 9 months ago
> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.
Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.
- jjk166 9 months ago
  Redundant cores lead to a fault tolerant chip.
  ryao 9 months ago
  ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.
  jjk166 9 months ago
  Redundancy is a method of providing fault tolerance, the existence of other methods doesn't make it less fault tolerant.
  Nothing is tolerant to all possible faults. Fault tolerance refers to being able to tolerate specific types of faults under specific conditions.
  Fault tolerant is the proper term for this.
  ryao 9 months ago
  I think it would have been better to write redundant. It is more specific.