Twitter thread from Albert Gu, a primary author: https://nitter.net/_albertgu/status/1731727672286294400?s=20
"Quadratic attention has been indispensable for information-dense modalities such as language... until now.
Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried."
This paper introduces a new class of models called selective state space models (S6 or selective SSMs). The key ideas and results are:
1. SSMs are a type of recurrent model that can scale linearly with sequence length, making them more efficient than Transformers. However, prior SSMs struggled with discrete sequence modeling tasks like language.
2. This paper augments SSMs with a "selection mechanism" that allows the model dynamics to depend on the input, giving it the ability to selectively remember or forget information. This makes SSMs effective on tasks requiring discrete reasoning.
3. They design an efficient parallel scan algorithm to implement selective SSMs on GPUs. Despite recurrency, this achieves up to 5x higher throughput over Transformers in benchmarks.
4. They simplify prior SSM architectures into a new model called Mamba. On language modeling, Mamba matches or exceeds Transformers of 2-3x its size, while retaining linear scaling. It also achieves state-of-the-art results on audio, genomics, and synthetic tasks requiring long-term reasoning.
This work makes SSMs truly competitive with Transformers through selectivity and efficient engineering. Mamba matches or beats Transformers on major modalities while being substantially more efficient in computation, memory, and scaling to long sequences. If replicated, it's arguably the first linear-time architecture with Transformer-quality performance!!
Can't wait to see the code!
If you’re going to use ChatGPT to write comments you should preface the comment saying so.
I don’t want Hackernews to be a place where people just copy and paste ChatGPT work, taking credit for it, and also potentially adding hallucinations to the ecosystem here.
When I read comments I assume they are human written, that someone put effort into that comment, and when someone is strongly asserting something (as your comment does), that they have sufficient authority to match the strength of the assertion.
ChatGPT comments always sound VERY sure of themselves, as though they are 100% right, but I know from experience that it can be very wrong, especially with more complex topics. So I think it’s poor form to post up this kind of comment with its strong assertions without saying you did a copy and paste as though it was your own original work. A modern version of plagiarism really while also lowering the overall quality of Hackernews and polluting it with generated text.
Yeah, as the other comment here indicates, I used Claude to generate an initial summary, which I verified and edited before posting. I am teaching myself deep learning and I use Claude to analyze papers before diving in. Many papers are hard to get through otherwise - and that’s true I think even if you are an expert!
You won’t teach yourself by running papers through Claude, and you won’t need to if you went from first principles rather than rushing.
I beg to differ. If I start with a summary that was produced by a model that has an excellent broad understanding of all prior work, it can help to fill in gaps so that when I do read the paper later, I parse it better. I have ADHD and a learning disability. I guess your mileage may vary.
In fairness to the GP, there's another comment from them from about an hour ago explaining how the answer was produced.
tl;dr Claude drafting + manual checking and cleanup.
> If replicated, it's arguably the first linear-time architecture with Transformer-quality performance!!
RetNet was published in August (although I only learned of it last week): https://arxiv.org/abs/2307.08621
How does this differ from RNNs and their gating mechanism?
I got a really similar summary using the paper abstract in context window with prompt to summarize with GPT4. Did you write that summary yourself?
I got Claude to write a draft and then I scanned the paper to make sure it didn’t make any egregious errors. I cleaned up its output, removing some bullets that I felt didn’t add anything the abstract had already covered well. I also prompted it to contrast this development with earlier work.
Am I understanding correctly that Mamba SSM's can be used for autoregressive sequence modeling, on both continuous and discrete data?
The paper claims that Mamba outperforms Transformers on modeling audio waveforms. Once a signal gets into a computer, it is technically always discrete, but I think what you’re referring to here is the ability to model continuous signals like audio and discrete data such as written language. Am I reading you right?
Yes I am indeed talking about that distinction.
If there's really no catch then this is a groundbreaking paper. Unambiguously beating transformers with margin on all metrics they tested.
some catches I'm looking for: it remains to be seen if the generations are high quality (not just low perplexity) and if the architecture scales beyond 3B, but definitely super promising
See section 4 of the paper, 'Empirical Evaluation.' Meets or exceeds performance of similarly sized transformers on various benchmarks such as Hellaswag and Winogrande. Seems very promising.
Hi og_kalu! I agree. Let me add that the authors of RWKV and Linear Attention deserve some credit too, given the connection between those linear RNNs and these new "selective" SSM models. See my comment here: https://news.ycombinator.com/item?id=38529943
Oh for sure. Didn't mean to imply otherwise!
:-)
This looks great. Thank you Gu and Dao!
For those who are unaware, previous "state-space models" (SSMs) took the classic state space model from control theory[a], parameterized its matrices, and applied it in discrete time to tokens in a sequence. Many people, including me, got excited by SSMs, given their linear cost... but they failed to live up to the hype.
The key difference now is that, instead of parameterizing the transformation matrices as before, the authors propose computing them from the data. In other words, instead of applying the same transformation at each step in time (which the authors call "linear time invariance"), they now dynamically compute and apply a different transformation at each step in time. Think of it as a new kind of nonlinear state-space model in which the linear maps are dynamically computed from the data at each step. The whole thing is very similar to RWKV[b], which, as the authors point out, can be formulated as a composition of these new "selective" SSMs. And RWKV itself is based on Linear Attention.[c] There's a lot to process in this paper.
More importantly, by establishing a connection between these linear-time DNNs and a well-understood model from control theory, the authors open the door to some exciting possibilities: Can these models be extended to the continuous-time setting? Which classic results from control theory apply to them? Can we draw from classic theory to improve our ability to reason about the behavior of these models? Other authors may have been the first to show these new linear RNNs can work well, but Gu, Dao, and others out of Chris Ré's lab deserve credit for making the connection to classic models.
I'm looking forward to diving in!
---
[a] https://en.wikipedia.org/wiki/State-space_representation#Lin...
Still chewing but my takeaways thus far:
SSM models strengths are with continuous data like audio and video, they struggle with discrete data like text/ DNA. This newest architecture uses selective attention to try to address the weaknesses around discrete data with some loss in performance in continuous tasks, empirically shown here with audio. The empirical exploration was limited to smaller size models, the performance as larger scales is yet to be explored in practice.
I found my deepest understanding of the selection mechanism came from struggling with the discretization in section 2, followed by the deeper explanations of the variables involved in 3.5.2. This video gives excellent background to SSMs, along with a detailed walk through of the paper itself.[a]
I am still coming to understand S4, SSMs in general but the video suggested this annotated explainer that has been helping a lot[b].
I would also point out section 3.1 and it’s discussion of the tradeoffs between compression and effectiveness as particularly interesting.
I do wonder how many different GPUs / hardware architectures will be able to execute the optimizations that are described as critical. I think the nature of the optimizations is the part of the paper I understand least well.
The paper taken at face value looks very exciting. The promise of a very large context window with 5x throughput for inference would be huge if it proves to scale well. I do wonder if it will make sense to train SSMs without this selection mechanism for specific continuous use cases where it seems to perform better or if other architectures will prove to better serve those cases.
----
The distinction I remember from the paper, is that discrete data (text, DNA) could be modeled effectively with only real-valued components, while continuous data (audio, video) benefited from complex numbers -- in their state. Evidently summarized from the authors statements characterizing existing work, preexisting wisdom on SSM/S4 models.
https://youtubetranscript.com/?v=ouF-H35atOY (same video, I'd watched previous to seeing this hn post)
other model details the authors note
that most prior State space models use
complex numbers in their state but it
has been empirically observed that
completely real valued State space
models seem to work fine and possibly
even better in some settings so they use
real values as the default which work
well for all but one of their tasks next
just following this, another impressive snippet it succeeds on test sequence lengths of
up to a million tokens which is 4,000
times longer than it saw during training
while none of the other methods compared
to generalize to Beyond twice their
training length
Yeah I also was confused by the discretization step. The video is great thank you for sharing!
>letting the SSM parameters be functions of the input
That’s what the attention mechanism has going for it. Different weights for different sets of tokens.
Authors claim their mechanism is more computationally efficient (both asymptotically and on current hardware) than the Transformer attention, while performing much the same role, yes
(A smaller later dup thread: https://news.ycombinator.com/item?id=38606590)
I've been doing web related engineering for the past few years and recently been interested in machine learning. Taking a random perusal through the code for this [1] feels daunting. So many single letter variables, it looks like old JS soup code. It makes me hesitant to leave Typescript and even Rust (for web tooling).
[1] https://github.com/state-spaces/mamba/blob/main/mamba_ssm/op...
It corresponds to the paper formulas as a courtesy. While descriptive comments or function signature might be nicer, it still is reasonably easy to follow along from the paper description! (Although, most of the complexity is hidden away in selective_scan_cuda.fwd())
Implementation of selective_scan_cuda: https://github.com/state-spaces/mamba/blob/6dbfc4553a98c81e1...
Aha, thanks, that is helpful. Any chance you know of a guide that walks through a paper and code side by side?
Hailing Jeremy Howard and his team at Fast.ai to make it so!
The variable names A, B, C, D and related assume knowledge of the state space model / formulation where these are the common names of matrices in the state space equations.
Thanks, that's helpful.
> It makes me hesitant to leave Typescript and even Rust (for web tooling).
To be clear this is not what a "typical" python application/script looks like at all. While there's definitely an argument to be made about python's immature type support (especially compared to typescript/rust), don't use this as an example of python's readability.
This was written to closely follow the math equations, not to be a maintainable piece of software. It makes a lot more sense from a mathematics/academic perspective, but not from a software development perspective.
Strong typing support would do nothing for readability but make this harder to read. Typing support doesn't help your naming, commenting, formatting, etc. that makes this so hard to read.
Well, this is cutting edge ML code written in PyTorch. I wouldn't worry about understanding something like this - you start with scikit-learn first.
This looks like a monad.