Comments Page - PyTorch Helion

« Back PyTorch Helionpytorch.orgSubmitted by jarbus 6 days ago

markush_ 5 hours ago
Interesting choice from PyTorch to release yet another DSL, on positive side it's one more point in the design space on the other hand it's even more difficult to choose the right technology among Triton, Gluon, CuTe, ThunderKittens and a few others.
darknoon 4 hours ago
The developers also gave a talk about Helion on GPU Mode: https://www.youtube.com/watch?v=1zKvCLuvUYc
ballpug an hour ago
Compiling a kernel after assemblage in low-level object oriented languages either uses stable kernel or the cargo fuzzed raw_spinlock code.
Helion abstracts syntax and design for calculating λ-functions, which converts language in a kernel config.
bobajeff 5 hours ago
It's good to see more effort for making things not device specific but I only see benchmarks for NVIDIA B200 and AMD MI350X. Also what's the experience of using one of these Python DSLs like? Are the tools good enough to make code completion, jump to definition, setting breakpoints, watching variables, copying as expression etc. nice?
- saagarjha an hour ago
  Generally you are unlikely to get Python-level debugging for code that is going to run on GPUs.
brap 6 hours ago
Asking as someone who is really out of the loop: how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level, and the infra would be mostly commoditized.
- embedding-shape 5 hours ago
  > how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level
  Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.
  westurner 2 hours ago
  How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics? How reconfigurable are TPUs? Are there any standardized feature flags for TPUs yet?
  Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?
  From https://news.ycombinator.com/item?id=45775181 re: current TPUs in 2025; "AI accelerators" :
  > How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?
- brrrrrm 5 hours ago
  a recent wave of interest in bitwise equivalent execution had a lot of kernels this level get pumped out.
  new attention mechanisms also often need new kernels to run at any reasonable rate
  theres definitely a breed of frontend-only ML dev that dominates the space, but a lot novel exploration needs new kernels
bwfan123 3 hours ago
I dont get the point of helion as compared to its alternatives like gluon.
For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.
- krapht 3 hours ago
  Funny, I feel the same way about Triton. Performant Triton looks like CUDA (but with tiles!) except it's ten times harder to debug since it doesn't have the tooling NVIDIA provides.
  If I had to run on AMD I'd rather deal with their hipify tooling.
  saagarjha an hour ago
  Performant Triton programs are usually simpler and shorter than their CUDA equivalents. This alone makes it easier to write, and I would argue that it helps with debugging too because the model provides a lot more guarantees on how your code executes. That said, some of the tooling is notably poor (such as cuda-gdb support).
  krapht 11 minutes ago
  Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.
  Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.
  For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.
  It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.
mshockwave 2 hours ago
Is it normal to spend 10minutes on tuning nowadays? Do we need to spend another 10 minutes upon changing the code?
singularity2001 3 hours ago
Anything as long as I don't have to touch propriety cuda and mpx
- saagarjha an hour ago
  You’ll need an execution backend.
jarbus 2 hours ago
I posted this 5 days ago, how did this resurface?
- koolba 2 hours ago
  https://news.ycombinator.com/item?id=26998308
uoaei 4 hours ago
Tangential question related to the example kernel: in GPU programming is it idiomatic/standard to initialize the out array as zeros rather than empty? are the performance savings negligible?
- saagarjha an hour ago
  It saves a kernel launch and memory bandwidth for a fill kernel. If you’re going to overwrite the data anyway, why bother?
- porridgeraisin an hour ago
  They have made it empty only.
  >> out = torch.empty([m, n], dtype=x.dtype, device=x.device)
  The accumulator has been initialized to zero, since well, they have to add stuff into it.
  >> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
  > idiomatic
  No as far as I have seen they generally try to not initialize if its not necessary.
  > overhead
  There is the memory bandwidth point as you might expect. But additionally when using high level interfaces like pytorch, when you write torch.zeros(512, 512) in pytorch, it launches a whole kernel (tens of micros) just for that line. So that's cpu -> gpu -> back to cpu, and then it does the next line, where it goes to gpu again and uses that memory. So in these cases you make sure to avoid it if its in a hot path. Ideally you want the 2nd kernel to do the initialization itself. When you write cuda c++ yourself this is how you typically do it. Helion being a compiler might be doing this optimization, but runtime based torch can't clearly.
doctorpangloss 4 hours ago
Is contributing to Triton so bad? It looks like the blocker is usually LLVM.
- saagarjha an hour ago
  It’s not that bad, but I’m not sure why this is relevant?
dachworker 6 hours ago
I'm super excited to give this one a spin. It seems like a neat idea, Triton, but simpler and with automatic autotuning. My head is spinning with options right now. I love how everyone was hyping up CUDA this and CUDA that a couple of years ago, and now CUDA is all but irrelevant. There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.
It's also kinda of ironic that right now in 2025, we have all this diversity in tooling, but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.
- embedding-shape 5 hours ago
  > CUDA that a couple of years ago, and now CUDA is all but irrelevant
  What? CUDA won't be irrelevant for years even if all the competitors figure out the holy grail, the ecosystem doesn't suddenly migrate over night. People learning CUDA today will continue to be find jobs and opportunities across the sector for the near future without any worries.
  > but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.
  That's also not true, the ML space is still growing, and lots of things outside of Transformers, but it requires you to actually look and pay attention, not just browse the HN and r/localllama frontpage.
  Overall, these do not seem to be the sentiments coming from someone inside the ML space, but rather from an onlookers perspective.
- pjmlp 4 hours ago
  In what alternative reality is that the case?
- almostgotcaught 6 hours ago
  > and now CUDA is all but irrelevant.
  Lol this is so wrong it's cringe.
  > There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.
  There are literally only 2: SIMT (ie the same as it always was) and tiles (ie Triton). That's it. Helion is just Triton with more auto-tuning (Triton already has auto-tuning).
  the__alchemist 4 hours ago
  Even for non-ML things like chem simulations: CUDA (and cuFFT) are more pleasant to use than Vulkan Compute and vkFFT.
  ozgrakkurt 4 hours ago
  I just learned the graphics api of vulkan, can’t imagine anything being less pleasant than vulkan
  porridgeraisin an hour ago
  Yeah it's quite something. If anyone wants a preview, here's the triangle hello world in vulkan: https://gist.github.com/Overv/7ac07356037592a121225172d7d78f...
  But then again, I've heard that it's this low level because its meant for engine developers.
  simlevesque 32 minutes ago
  Oh wow that's horrible.