Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPGA -> ASIC.
LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.
So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.
I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.
Google's TPU is an ASIC and performs competitively. Also Tesla and Meta is building something AFAIK.
Although I doubt you could get lot better as GPUs already have half the die area reserved for matrix multiplication.
It depends on your precise definition of ASIC. The FPGA thing here would be analogous to an MSIC where m = model.
It's clearly different to build a chip for a specific model than what a TPU is.
Maybe we'll start seeing MSICs soon.
Is there any particular reason you'd want to use an FPGA for this? Unless your problem space is highly dynamic (e.g. prototyping) or you're making products in vanishing low quantities for a price insensitive market (e.g. military) an ASIC is always going to be better.
There doesn't seem to be much flux in the low level architectures used for inferencing at this point, so may as well commit to an ASIC, as is already happening with Apple, Qualcomm, etc building NPUs into their SOCs.
You can open-source your FPGA designs for wider collaboration with the community? wider collaboration. Also, FPGA is the starting step to make any modern digital chip.
(1) Academics could make an FPGA but not an ASIC, (2) FPGA is a first step to make an ASIC
This specific project looks like a case of "we have this platform for automotive and industrial use, running Llama on the dual-core ARM CPU is slow but there's an FPGA right next to it". That's all the justification you really need for a university project.
Not sure how useful this is for anyone who isn't already locked into this specific architecture. But it might be a useful benchmark or jumping-off-point for more useful FPGA-based accelerators, like ones optimized for 1 bit or 1.58 bit LLMs
Model architecture changes fast. Maybe it will slow down.
gotta prototype the thing somewhere. If it turns out that the LLM algos become pretty mature I suspect accelerators of all kinds will be baked into silicon, especially for inference.
That's the thing though, we're already there. Every new consumer ARM and x86 ASIC is shipping with some kind of NPU, the time for tentatively testing the waters with FPGAs was a few years ago before this stuff came to market.
But the NPU might be poorly designed for your model or workload or just poorly designed.
like this? https://www.d-matrix.ai/product/
4 times as efficient as on the SoC's low end arm cores, soo many times less efficient than on modern GPUs I guess?
Not that I was expecting GPU like efficiency from a fairly small scale FPGA project. Nvidia engineers spent thousands of man-years making sure that stuff works well on GPUs.