Great article!
For context: this WebGPU version achieves ~17% of peak theoretical performance of M2. With CUDA (i.e. CuBLAS), you can reach ~75% of peak performance for same matrix config (without tensor core).
> you can reach ~75% of peak performance for same matrix config
Not on the same computer, CUDA doesn’t run on the integrated GPU of the Apple M2 Pro.
Yes, that's why I was focusing on percentage of peak hardware performance, not actual flops.
That's exactly what I was wondering. That cannot be.
Probably more relevant here is that a single CPU core on that computer exceeds 1 tflop/s on gemm with plenty of margin using a single lib call, and leaves the rest of the CPU cores and all of the GPU free to do other work.
Nope, no Apple CPU core has such performance.
That single lib call must have used the AMX accelerator, which is separate from the cores and shared by a group of cores.
So that AMX accelerator performance may be greater than of all CPU cores together. AFAIK, some Apple CPUs have one AMX accelerator for the big cores and another AMX accelerator for the smaller cores, but in any case there is no chance to hope that if you have obtained 1 TFLOP/s when running the program on 1 core you will get much more when running it on multiple cores, because all cores of the same type will use the same shared accelerator.
One interesting thing about these newfangled matrix/AI/ML accelerators that’s very rarely mentioned on the internets, they only deliver that many TFLOP because they operate in very low precision.
nVidia tensor cores support int8, couple versions of FP16 (BF16 and the standard IEEE one) and FP19 which they call TensorFloat-32. I think Intel AMX only supports int8 and BF16.
None of them supports FP32 let alone FP64 input numbers, which makes them completely useless for traditional GEMM stuff.
https://github.com/corsix/amx indicates that Apple's AMX supports up to 64-bit FP, but I don't see any performance metrics. They also have the ANE, which is the low-precision ML-focused accelerator.
I wasn’t aware there’re two completely different things from different companies both called AMX. I assumed that AMX: https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions
The Apple’s version is indeed interesting. I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?
> I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?
Using the Accelerate framework (which includes Apple's BLAS) is the only supported way for programmers to access the AMX. Reverse engineering the instruction set to access it directly is discouraged, because it's not a documented stable interface.
It’s because Apple’s AMX was an early, in-house built version of what was eventually released by Arm as SME. Apple adopted SME in the M4 and dropped AMX, but as long as you were using their Accelerate framework instead of directly writing AMX code (which they told people not to do), you wouldn’t notice.
Now that they’re using “standard” SME, it shouldn’t be a problem to write SME assembly opcodes directly, although I suspect Apple themselves is still probably sparse on the documentation. I’m not aware if there’s any way to use intrinsics or something slightly higher level than inline-ASM, but lower level than the Accelerate framework.
Apple's matrix unit supports FP16, 32, and 64 sources and accumulators.
Right. So, like I said, using one CPU core, you can exceed 1 TFLOP/s, leaving all the other CPU cores and the GPU free for other work.
Your initial claim was ambiguous.
It sounded like you claimed that using only one core you already reach 1 TFLOP/s, implying that you could reach more than that by using more cores, which is false.
Now you have clarified that you actually claim that it is good that when using a single core you can reach the maximum throughput of the shared matrix operation accelerator.
This is correct, but there is no essential difference between this and a Zen 5 CPU that reaches this throughput by using only half of the cores, while having the other half of the cores free to do any other tasks.
Top of the line AMD zen5 core can sustain ~80GFLOPS@FP64 and ~160GFLOPS@FP32 using AVX-512, 2x FMA units and ~5Ghz of clock frequency.
This is way way lower than what you claim M2 Pro is capable of and since I'm comparing it against the state-of-the-art datacenter CPU I'm curious how did you get to this number?
M2 Pro core runs at much lower frequency, what it seems to be around ~3.4GHz. And I couldn't find any information about SVE vector widths supported nor number of FMAs.
OP implemented that stuff with WebGPU. In runtime, that weird language compiles into a compute shader which computes that stuff on GPU.
I believe the point being made was that this could be done in the CPU faster than was achieved here.
Yeah, but not on a single core.
In my desktop computer, I have Ryzen 7 8700G CPU, which has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP. Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core. You gonna need all 8 cores to achieve 1 theoretical TFlops.
BTW, integrated GPU inside the same 8700G processor can theoretically do 8.2 TFlops FP32.
> Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core.
Isn't it that zen4 doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units?
Because of this, a single AVX-512 instruction will occupy both FMA units and therefore I think that the theoretical limit for a single zen4 core should be half of the 134 GFLOPS number?
One FMA counts as two floating-point operations: one multiplication and one addition.
According to uops.info, Zen 4 cores can do two 8-wide FMA instructions per cycle, or one 16-wide FMA per cycle. See VFMADD132PS (YMM, YMM, YMM) and VFMADD132PS (ZMM, ZMM, ZMM) respectively, the throughput column is labelled TP. That’s where 32 FLOP/cycle number comes from.
> doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units
That’s correct, AVX512 doesn’t deliver more FLOPs on that CPU. The throughput of 32-byte FMA and 64-byte FMA is the same, 32 FLOP/cycle for FP32 numbers.
> One FMA counts as two floating-point operations: one multiplication and one addition.
Right. This is where the discrepancy comes from. I counted FMA as a single FLOP.
I am told the numbers above require the core to have a matrix multiply unit (such as SME)
how are you running CUDA on the integrated Apple silicon GPU these days?
You are not.
75% can't be the best we can do. What would reach 100% or nearly 100%? Handcoded assembly?
With GPUs it's not uncommon to run out of memory bandwidth before you max out the theoretical FLOPS. They may have a ton of bandwidth but it's never enough.
That can lead you to some pretty counter-intuitive optimizations because it's often faster to do more compute work if it means you touch less memory in the process.
> That can lead you to some pretty counter-intuitive optimizations because it's often faster to do more compute work if it means you touch less memory in the process.
It is not specific to the GPUs: this kind of optimizations are pretty common on CPU too where latency kills you and 200 cycles spent wasted on doing compute can actually be faster than a single cache miss trying to fetch data. This is pretty common for many SIMD algorithms actually.
Memory is currently lagging behind compute on almost every type of modern hardware, and it will very likely become worst, not better.
For sufficiently large GEMM you should never run out of bandwidth before you max out FLOPS if your blocking is organized correctly, because the arithmetic scales like O(n^3) while the memory access scales like O(n^2).
In theory, yes. In practice you will probably be forced to tile your GEMM and incur the penalty of redundant memory accesses.
Sure, but still on each tile, you do O(k^3) compute with O(k^2) memory, and you generally arrange things so that at least one tile is in L1 and at least one other is in L2/LLC (using CPU idioms), so again, you have plenty of bandwidth (typical choices of k are in the ballpark of ~32, and a 32:1 compute to memory ratio is just fine on most hardware, especially if some of those accesses are coming from fast memory)
I don't think so? It is too late for me to actually do the math on this but if you take the degenerate case where the tile size is literally 1 element then you will do as many loads as arithmetic operations. Thus I would consider any sort of fixed tiling (which you would be resigned to due to your caches being of limited size) requiring O(n^3) loads?
Shouldn't the roofline inform capacity assessments?
Sure, but rooflines don't account for stuff like memory granularity. You not only have to do a lot of bytes per flop to achieve the necessary arithmetic intensity, you also have to access those bytes in a coalesced way. I.e., you want to access consecutive bytes, which are ideally already in registers.
The parameters of the matrix multiply, such as the size of the matrices, impose some limits to how close you can get to the peak theoretical performance in a particular GPU. Not all possible matrix multiplies are equally valuable to optimize a priori, so the hardware is designed to perform best on problems that are financially significant, such as modern LLMs.
As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?
> As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?
Why not? We do so for cryptographic primitives and video codecs. And why are you talking about “thousands of kernels”? AI programs only need a small amount of different kernels so it doesn't sound intractable.
> AI programs only need a small amount of different kernels
That is not the case. What appears like a simple matmul operation actually requires these libraries to select which specific kernel out of the many internally available to execute.
If you are curious to learn more, NVidia open sourced a library called Cutlass some years ago. And remember that is only what they are willing to open source.
Is that really different from AV codecs in terms of scale though?
Yes, you can peek under the hood of cuBLAS and notice that it has dozens of kernels for different problem sizes. It’s not generally the case that when you do h264 at a different crf you have a completely different tiling strategy that you have to implement.
I am not at liberty to discuss more than that.
thanks! and yes definitely not at CUDA levels :)
Can you explain why you did the naive algorithm here and not any of the fast matrix multiplication ones that trade multiplications for more additions? Just for educational purposes or is there a performance benefit in the technique?
Because those algorithms are generally not worth implementing even though their algorithmic complexity is theoretically lower.
at least on my m2, the compiled kernel ends up using fast math anyways so using WGSL's fma didn't change anything about the actual kernel that gets run
inglor is probably referring to Strassen or Coppersmith–Winograd.
Last I checked the extra mems really hurt on a lot of cases especially for the more complex ones, but I'm no expert.
oh in that case it was because i didn't know about them :) something to try next!
For a very deep dive into the subject, this is a great writeup:
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)
(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)
this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried
there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.
i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard
That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.
It’s a deep dive as of like 2015 probably. I don’t know if anyone has done something similar for modern GEMMs. Maybe the CUTLASS or Colfax people?
To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance.
H100 can do well over 1500 TFLOPS in fp16.
Which H100 and how much over 1500 TFLOP/s?
The datasheet for the H100 SXM seems to indicate that it can only do ~1000 TFLOP/s peak.
I just went to Nvidia’s site and downloaded the data sheet: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor.... It says 1600/1900 in half precision?
Read the fine print: "With sparsity". They double the claimed throughput by assuming that half of the FLOPs can be skipped.
I also recently went through the specs and noticed "with sparsity" but I didn't quite understand what it specifically refers to - the premise is that a lot of weights in matmul operations will be zero or insignificant - also known as sparse matrices - and in that case A100/H100 has a circuitry that can boost the throughput up to 2x, essentially "skipping" half of the FLOPS as you say.
I am not an expert in LLM but I don't think you can end up having a significant amount of zeroed weights (~50%) in a converged network so I think it is safe to say that the theoretical throughput for 99% of cases is really ~800 TFLOPS and not ~1600 TFLOPS as advertised.
Oh, that is really annoying. Thanks for catching that!
There are two populations of people reading the NVIDIA specs (and now you switched groups). If NVIDIA ever changes their marketing strategy and the asterisk denotes something else, there might be a third population because I know a lot of people that I suspect will keep dividing those starred FLOPS/s by two :-)
I wrote something similar a while back: https://github.com/FL33TW00D/wgpu-mm
Also does quantized matmuls.
would be fun to do a leaderboard of some specific size (e.g. 4096x4096x4096) just to get all the code and tricks in one spot for folks to learn about things
You can do slightly better fairly easily I think. See here for example https://github.com/AnswerDotAI/gpu.cpp/pull/35
Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...
As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.
What's the perf like?
Sorry, I have not benchmarked against cuBLAS or Eigen or similar, I did that thing for ML inference.
I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the compute shader to minimize the time reported by these queries for my specific use case.
WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?
WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications
> don't have support for hardware specific memory
I have no experience with WebGPU but if you mean group shared memory, I think the support is available. See the demo: https://compute.toys/view/25
i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb
i'm excited to try subgroups though: https://developer.chrome.com/blog/new-in-webgpu-128#experime...
I've heard the WebGPU workgroup wants to close the gap on tensor core support.
you're definitely right, 80% was a bit of an overestimation, especially with respect to CUDA
it would be cool to see if there's some way to get better access to those lower-level primitives but would be surprised
it does seem like subgroup support are a step in the right direction though!
great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware
[flagged]
WebGPU will make Web maps even more competitive than they are already.
The smoothness of an iPhone map zoom, on any device.
> The smoothness of an iPhone map zoom, on any device.
Any device except an iPhone, until Apple finally gets around to shipping WebGPU in Safari. Any year now...
Safari is officially enabling support for WebGPU in iOS 18.2, which is rolling out within the first weeks of December.
Where'd you hear that? It's not listed here:
https://developer.apple.com/documentation/safari-release-not...
Source is here, from a Unity WebGPU thread. Look at the comment from October 27 from Brendan Duncan, a Unity employee: https://discussions.unity.com/t/early-access-to-the-new-webg...
"I have found that WebGPU is enabled by default now with iOS 18.2. Apple has been working in the open on WebGPU. The WebKit source code has their latest WebGPU work in it. What hasn’t been known is their release schedule, but now with 18.2 it’s looking very promising that it will be on by default in that version."
Source?
Edit: I just pressed “Reset All to Defaults” under “WebKit Feature Flags” on my device running 18.2 beta, and the switch for WebGPU is on!! <3