Comments Page - Show HN: Steerling-8B, a language model that can explain any token it generates

« Back Show HN: Steerling-8B, a language model that can explain any token it generatesguidelabs.aiSubmitted by adebayoj 7 hours ago

gormen 3 hours ago
Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.
ottah 2 hours ago
It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.
brendanashworth 4 hours ago
Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.
[1] https://shap.readthedocs.io/en/latest/
- dwohnitmok 3 hours ago
  SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.
  SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
  It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
  It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
  I don't know the specifics of what this particular model's approach is.
  But SHAP unfortunately does not work for LLMs at all.
pbmango 4 hours ago
This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.
great_psy 4 hours ago
Maybe I’m not creative enough to see the potential, but what value does this bring ?
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
- voidhorse 4 hours ago
  It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.
  great_psy 4 hours ago
  Can this method be extended to go down to the sentence level ?
  In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?
  rickydroll 2 hours ago
  Your question should be "Can it drill down to show the paragraphs or sentences that influence the answer?"
  I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.
  great_psy an hour ago
  The example on the website shows one to many as well: Wikipedia, axive article, etc along with a ratio how much it influences the chunk of the answer.
umairnadeem123 2 hours ago
the practical value here is for regulated domains. in healthcare and finance you often cant deploy a model at all unless you can explain why it made a specific decision. token-level attribution that traces back to training data sources could satisfy audit requirements that currently block LLM adoption entirely.
curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.
- snowhale 2 hours ago
  the quality tax framing might actually undersell the value in regulated domains. if a hospital system can't deploy without explainability, a model that scores 95% and can trace its reasoning beats one that scores 97% and can't. the baseline isn't 'interpretable model vs better model' -- it's 'interpretable model vs no model at all.'
- luulinh90s 2 hours ago
  in the "Performance" section of the post: https://www.guidelabs.ai/post/steerling-8b-base-model-releas..., the authors show the model lags behind llama 8b but worth noting that llama 8b trained on > 2x more computes (see the FLOPs axis)
in-silico an hour ago
Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
rvz 4 hours ago
Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.
We'll see.