« BackWere RNNs all we needed?arxiv.orgSubmitted by beefman 6 hours ago
  • xnx 5 hours ago

    It's curse and a blessing that discussion of topics happens in so many different places. I found this comment on Twitter/X interesting: https://x.com/fchollet/status/1841902521717293273

    "Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)

    Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."

    • islewis 5 hours ago

      > "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."

      I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:

      > ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512

      Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.

      • teruakohatu 4 hours ago

        > Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.

        This! Not just fastest but with the lowest resources in total.

        Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.

        • actionfromafar an hour ago

          Unless we could build chips in 3D?

          • foota 15 minutes ago

            Not even then, a truly fully connected network would have super exponential runtime (it would take N^N time to evaluate)

        • byearthithatius 2 hours ago

          > finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale

          Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow

        • sakras 2 hours ago

          I figured this was pretty obvious given that MLPs are universal function approximators. A giant MLP could achieve the same results as a transformer. The problem is the scale - we can’t train a big enough MLP. Transformers are a performance optimization, and that’s why they’re useful.

          • ctur 9 minutes ago

            Architecture matters because while deep learning can conceivably fit a curve with a single, huge layer (in theory... Universal approximation theorem), the amount of compute and data needed to get there is prohibitive. Having a good architecture means the theoretical possibility of deep learning finding the right N dimensional curve becomes a practical reality.

            Another thing about the architecture is we inherently bias it with the way we structure the data. For instance, take a dataset of (car) traffic patterns. If you only track the date as a feature, you miss that some events follow not just the day-of-year pattern but also holiday patterns. You could learn this with deep learning with enough data, but if we bake it into the dataset, you can build a model on it _much_ simpler and faster.

            So, architecture matters. Data/feature representation matters.

            • ants_everywhere 3 hours ago

              > is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)

              (Somewhat) fun and (somewhat) related fact: there's a whole cottage industry of "is all you need" papers https://arxiv.org/search/?query=%22is+all+you+need%22&search...

              • TaurenHunter 3 hours ago

                Reminds me of the "Considered Harmful" articles:

                https://meyerweb.com/eric/comment/chech.html

                • jprete 3 hours ago

                  I wonder if there's something about tech culture - or tech people - that encourages them to really, really like snowclones.

                  • observationist 2 hours ago

                    Yes. Do stuff that other people have been successful doing. Monkey see, monkey do - it's not a tech people thing, it's a human thing.

                    Tech just happens to be most on display at the moment - because tech people are building the tools and the parameters and the infrastructure handling all our interactions.

              • acchow 4 hours ago

                What it will come down to is computational efficiencies. We don’t want to retrain once a month - we want to retrain continuously. We don’t want one agent talking to 5 LLMs. We want thousands of LLMs all working in concert.

                • pbhjpbhj 15 minutes ago

                  Sounds like something that has unsustainable energy costs.

                  • ActorNightly 3 hours ago

                    This and also the way models are trained has to be rethought. BPP is good for figuring out complex function mappings, but not for storing information.

                  • Lerc 3 hours ago

                    I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.

                    On the other hand, while "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." is true, a sufficiently expressive mechanism may not be computationally or memory efficient. As both are constraints on what you can actually build, it's not whether the architecture can produce the result, but whether a feasible/practical instantiation of that architecture can produce the result.

                    • wongarsu 3 hours ago

                      One big thing that bells and whistles do is limit the training space.

                      For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture

                      • dheera 3 hours ago

                        I mean, transformer-based LLMs are RNNs, just really really really big ones with very wide inputs that maintain large amounts of context.

                        • immibis 3 hours ago

                          No. An RNN has an arbitrarily-long path from old inputs to new outputs, even if in practice it can't exploit that path. Transformers have fixed-size input windows.

                          • og_kalu 2 hours ago

                            You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.

                            • immibis 2 hours ago

                              The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different.

                              • quotemstr 2 hours ago

                                But an RNN isn't going to remember the first token of input. It won't know until it sees the last token whether that first token was relevant after all, so it has to learn token-specific update rules that let it guess how long to hold what kinds of information. (In multi-layer systems, the network uses ineffable abstractions rather than tokens, but the same idea applies.)

                                What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.

                                • OkayPhysicist 2 hours ago

                                  The most popular RNNs (the ones that were successful enough for Google translate and the like) actually had this behavior baked in to the architecture, called "LSTMs", "Long-Short Term Memory"

                            • dheera 2 hours ago

                              A chunk of the output still goes into the transformer input, so the arbitrarily-long path still exists, it just goes through a decoding/encoding step.

                          • fsndz 4 hours ago

                            after reading this paper, I am now convinced we will need more than curve fitting to build AGI:https://medium.com/@fsndzomga/there-will-be-no-agi-d9be9af44...

                            • josh-sematic 3 hours ago

                              One reason why I'm excited about o1 is that it seems like OpenAI have cracked the nut of effective RL during training time, which takes us out of the domain of just fitting to the curve of "what a human would have said next." I just finished writing a couple blog posts about this; the first [1] covers some problems with that approach and the second [2] talks about what alternatives might look like.

                              [1] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t... [2] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t...

                              • acchow an hour ago

                                > After reading this paper, I am now

                                Is this your paper?

                                • ahzhou 3 hours ago

                                  Author: @fandzomga Username: fsndz

                                  Why try to funnel us to your paywalled article?

                                  • swolchok 4 hours ago

                                    paper is paywalled; just logging into Medium won't do it

                                  • xpl 4 hours ago

                                    I would like to read it, but it's under a paywall.

                                  • vineyardmike 3 hours ago

                                    TLDR: “statistically fitting token output is not the same as human intelligence, and human intelligence and AGI are contradictory anyways (because humans make mistakes)”

                                    Saved you the paywall click to the poorly structured medium article :)

                                  • quantadev 2 hours ago

                                    Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.

                                    I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"

                                    It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

                                    • OkayPhysicist 2 hours ago

                                      The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so

                                      Ax = y

                                      Adding another layer is just multiplying a different set of weights times the output of the first, so

                                      B(Ax)= y

                                      If you remember your linear algebra course, you might see the problem: that can be simplified

                                      (BA)x = y

                                      Cx = y

                                      Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.

                                      To prevent this collapse, a non linear function must be introduced between each layer.

                                      • quantadev an hour ago

                                        Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.

                                        But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.

                                        • nazgul17 an hour ago

                                          The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.

                                          > nothing but linearity

                                          No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.

                                          • quantadev 2 minutes ago

                                            > The non linearities are not there primarily to keep the outputs in a given range

                                            Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?

                                            All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.

                                  • bob1029 4 hours ago

                                    > Transformers required ~2.5x more training steps to achieve comparable performance, overfitting eventually.

                                    > RNNs are particularly suitable for sequence modelling settings such as those involving time series, natural language processing, and other sequential tasks where context from previous steps informs the current prediction.

                                    I would like to draw an analogy to digital signal processing. If you think of the recurrent-style architectures as IIR filters and feedforward-only architectures as FIR filters, you will likely find many parallels.

                                    The most obvious to me being that IIR filters typically require far fewer elements to produce the same response as an equivalent FIR filter. Granted, the FIR filter is often easier to implement/control/measure in practical terms (fixed-point arithmetic hardware == ML architectures that can run on GPUs).

                                    I don't think we get to the exponential scary part of AI without some fundamentally recurrent architecture. I think things like LSTM are kind of an in-between hack in this DSP analogy - You could look at it as FIR with dynamic coefficients. Neuromorphic approaches seem like the best long term bet to me in terms of efficiency.

                                    • trott 3 hours ago

                                      My feeling is that the answer is "no", in the sense that these RNNs wouldn't be able to universally replace Transformers in LLMs, even though they might be good enough in some cases and beat them in others.

                                      Here's why.

                                      A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history. But what is an RNN to do? While the length of its context is unlimited, the amount of information the model retains about it is bounded by whatever is in its hidden state at any given time.

                                      Relevant: https://arxiv.org/abs/2402.01032

                                      • mkaic 3 hours ago

                                        The counterargument here is that you can just scale the size of the hidden state sufficiently such that it can hold compressed representations of whatever-length sequence you like. Ultimately, what I care about is whether RNNs could compete with transformers if FLOPs are held constant—something TFA doesn't really investigate.

                                        • psb217 3 hours ago

                                          Well, that's what Transformer already does... One problem with the scaling you're describing is that there would be a massive amount of redundant information stored in hidden activations during training the RNN. The hidden state at each time step t in the sequence would need to contain all info that (i) could be useful for predicting the token at time t and (ii) that could be useful for predicting tokens at times >t. (i) is obvious and (ii) is since all information about the past is transferred to future predictions through the current hidden state. In principle, Transformers can avoid storing redundant info in multiple hidden states at the cost of having to maintain and access (via attention) a larger hidden state at test/eval time.

                                          • mkaic 2 hours ago

                                            > there would be a massive amount of redundant information stored in hidden activations

                                            Is there a way to prove this? One potential caveat that comes to mind for me is that perhaps the action of lerping between the old state and the new could be used by the model to perform semantically meaningful transformations on the old state. I guess in my mind it just doesn't seem obvious that the hidden state is necessarily a collection of "redundant information" — perhaps the information is culled/distilled the further along in the sequence you go? There will always be some redundancy, sure, but I don't think that such redundancy necessarily means we have to use superlinear methods like attention.

                                        • slashdave an hour ago

                                          > the amount of information the model retains about it is bounded by whatever is in its hidden state

                                          This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner.

                                          • phkahler 2 hours ago

                                            >> A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history.

                                            Which isn't necessary. If you say "translate the following to german." Instead, all it needs is to remember the task at hand and a much smaller amount of recent input. Well, and the ability to output in parallel with processing input.

                                            • DoctorOetker 43 minutes ago

                                              Also, a lightweight network could do a first pass to identify tasks, instructions, constraints etc, and then a second pass could use the RNN.

                                              Consider the flood fill algorithm or union-find algorithm, which feels magical upon first exposure.

                                              https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori...

                                              Having 2 passes can enable so much more than a single pass.

                                              Another alternative could be to have a first pass make notes in a separate buffer while parsing the input. The bandwidth of the note taking and reading can be much much lower than that required for fetching the billions of parameters.

                                              • og_kalu 2 hours ago

                                                It's necessary for arbitrary information processing if you can forget and have no way to "unforget".

                                                A model can decide to forget something that turns out to be important for some future prediction. A human can go back and re-read/listen etc, A transformer is always re-reading but a RNN can't and is fucked.

                                                • magicalhippo an hour ago

                                                  That's just because we twisted it's arm. One could for example feed the reversed input after, ie abc|cba where | is a special token. That would allow it to react to any part of the message.

                                                • trott an hour ago

                                                  People did something similar to what you are describing 10 years ago: https://arxiv.org/abs/1409.0473

                                                  But it's trained on translations, rather than the whole Internet.

                                              • mkaic 3 hours ago

                                                I strongly enjoy the simplicity of their "minGRU" architecture. It's basically just:

                                                  class MinGRU(nn.Module):
                                                    def __init__(self, token_size, hidden_state_size):
                                                      self.token_to_proposal = nn.Linear(token_size, hidden_size)
                                                      self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
                                                
                                                    def forward(self, previous_hidden_state, current_token):
                                                      proposed_hidden_state = self.token_to_proposal(current_token)
                                                      mix_factors = torch.sigmoid(self.token_to_mix_factors(current_token))
                                                      return torch.lerp(proposed_hidden_state, previous_hidden_state, mix_factors)
                                                
                                                And since the proposed hidden states and mix factors for each layer are both only dependent on the current token, you can compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.

                                                The fact that this is competitive with transformers and state-space models in their small-scale experiments is gratifying to the "best PRs are the ones that delete code" side of me. That said, we won't know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameter and data counts comparable to SOTA models.

                                                One detail I found really interesting is that they seem to do all their calculations in log-space, according to the Appendix. They say it's for numerical stability, which is curious to me—I'm not sure I have a good intuition for why running everything in log-space makes the model more stable. Is it because they removed the tanh from the output, making it possible for values to explode if calculations are done in linear space?

                                                EDIT: Another thought—it's kind of fascinating that this sort of sequence modeling works at all. It's like if I gave you all the pages of a book individually torn out and in a random order, and asked you to try to make a vector representation for each page as well as instructions for how to mix that vector with the vector representing all previous pages — except you have zero knowledge of those previous pages. Then, I take all your page vectors, sequentially mix them together in-order, and grade you based on how good of a whole-book summary the final vector represents. Wild stuff.

                                                FURTHER EDIT: Yet another thought—right now, they're just using two dense linear layers to transform the token into the proposed hidden state and the lerp mix factors. I'm curious what would happen if you made those transforms MLPs instead of singular linear layers.

                                                • slashdave an hour ago

                                                  Log space is important if the token probabilities span a large range of values (powers). There is a reason that maximum likelihood fitting is always performed with log likelihoods.

                                                  • immibis 2 hours ago

                                                    This architecture, on the surface, seems to preclude the basic function of recognizing sequences of tokens. At the very least, it seems like it should suffer from something like the pumping lemma: if [the ][cat ][is ][black ] results in the output getting close to a certain vector, [the ][cat ][is ][black ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get even closer to that vector and nowhere close to a "why did you just repeat the same sentence three times" vector? Without non-linear mixing between input token and hidden state, there will be a lot of linear similarities between similar token sequences...

                                                    • mkaic 2 hours ago

                                                      Counterpoint: the hidden state at the beginning of ([the][cat][is][black]) x 3 is (probably) initialized to all zeros, but after seeing those first 4 tokens, it will not be all zeros. Thus, going into the second repetition of the sentence, the model has a different initial hidden state, and should exhibit different behavior. I think this makes it possible for the model to learn to recognize repeated sequences and avoid your proposed pitfall.

                                                  • charlescurt123 an hour ago

                                                    I find the entire field lacking when it comes to long-horizon problems. Our current, widely used solution is to scale, but we're nowhere near achieving the horizon scales even small mammal brains can handle. Our models can have trillions of parameters, yet a mouse brain would still outperform them on long-horizon tasks and efficiency. It's something small, simple, and elegant—an incredible search algorithm that not only finds near-optimal routes but also continuously learns on a fixed computational budget.

                                                    I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.

                                                    The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.

                                                    When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?

                                                    • imjonse 5 hours ago

                                                      To their credit, the authors (Y. Bengio among them) end the paper with the question, not suggesting they know the answer. These models are very small even by academic standards so any finding would not necessarily extend to current LLM scales. The main conclusion is that RNN class networks can be trained as efficiently as modern alternatives but the resulting performance is only competitive at small scale.

                                                      • phkahler 4 hours ago

                                                        >> These models are very small even by academic standards so any finding would not necessarily extend to current LLM scales.

                                                        Emphasis on not necessarily.

                                                        >> The main conclusion is that RNN class networks can be trained as efficiently as modern alternatives but the resulting performance is only competitive at small scale.

                                                        Shouldn't the conclusion be "the resulting competitive performance has only been confirmed at small scale"?

                                                      • tehsauce 6 hours ago

                                                        I haven’t gone through the paper in detail yet but maybe someone can answer. If you remove the hidden state from an rnn as they say they’ve done, what’s left? An mlp predicting from a single token?

                                                        • bunderbunder 5 hours ago

                                                          They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.

                                                          • jfcoa 5 hours ago

                                                            It doesn't completely remove it, it removes certain dependencies on it so that it can be computed by parallel scan, there is still a hidden state. It bears some similarity to what was done with Mamba.

                                                            • statusfailed 5 hours ago

                                                              I only had a quick look, but it looks like they tweaked the state update so the model can be run with parallel scan instead of having to do it sequentially.

                                                              • _0ffh 3 hours ago

                                                                The trick is to make sure the recursive dependency stays linear, that's how you enable parallel training.

                                                              • limapedro 2 hours ago

                                                                This is such a interesting paper, sadly they don't have big models, I'd like to see a model trained on TinyStories or even C4 since it should be faster than the transformer variant and see how it compares.

                                                                • m11a 5 hours ago

                                                                  It’d be nice to see more of how this compares to Mamba. Looks like, in performance, they’re not leagues apart and it’s just a different architecture, not necessarily better or worse?

                                                                  • marcosdumay 5 hours ago

                                                                    R == Recurrent

                                                                    From theory the answer to the question should be "yes", they are Turing complete.

                                                                    The real question is about how to train them, and the paper is about that.

                                                                    • baanist 4 hours ago

                                                                      Why aren't AI researchers automating the search for efficient architectures?

                                                                      • ks2048 3 hours ago
                                                                        • ActorNightly 3 hours ago

                                                                          There has been some work, but the problem is that its such a massive search space. Philosophically speaking, if you look at how humans came into existence, you could make an argument that the process of evolution from basic lifeforms can be represented as one giant compute per minute across of all of earth, where genetic selection happens and computation proceeds to the next minute. Thats a fuckload of compute.

                                                                          In more practical terms, you would imagine that an advanced model contains some semblance of a CPU to be able to truly reason. Given that CPUs can be all NAND gates (which take 2 neurons to represent), and are structured in a recurrent way, you fundamentally have to rethink how to train such a network, because backprop obviously won't work to capture things like binary decision points.

                                                                          • baanist 2 hours ago

                                                                            I thought the whole point of neural networks was that they were good at searching through these spaces. I'm pretty sure OpenAI is pruning their models behind the scenes to reduce their costs because that's the only way they can keep reducing the cost per token. So their secret sauce at this point is whatever pruning AI they're using to whittle the large computation graphs into more cost efficient consumer products.

                                                                          • kelseyfrog 3 hours ago

                                                                            The search space is all off too wide, difficult to parameterize, and there is a wide gap between effective and ineffective architectures - ie: a very small change can make a network effectively DOA.

                                                                            • hedgehog 3 hours ago

                                                                              Notably architecture search was popular for small vision nets where the cost of many training runs was low enough. I suspect some of the train-then-prune approaches will come back, but even there only by the best funded teams.

                                                                          • jjtheblunt 4 hours ago

                                                                            What are you saying is Turing-complete?

                                                                        • dsamarin 5 hours ago

                                                                          The name of the paper contrasts with the paper that spawned Transformer architecture, which itself is a reference to the song "All You Need Is Love" by the Beatles. https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

                                                                          • vundercind 4 hours ago

                                                                            I eagerly await the backlash to suggesting any one thing is all you need, the first shot of which shall surely be titled: “‘All you need’ Considered Harmful”

                                                                            • ants_everywhere 3 hours ago

                                                                              Surely the universe is all you need though

                                                                          • logicchains 4 hours ago

                                                                            The model in the paper isn't a "real" RNN due making it parallelizable, for same the reasons described in https://arxiv.org/abs/2404.08819 , and hence is theoretically less powerful than a "real" RNN (struggles at some classes of problems that RNNs traditionally excel at). On the other hand, https://arxiv.org/abs/2405.04517 contains a "real" RNN component, which demonstrates a significant improvement on the kind of state-tracking problems that transformers struggle with.

                                                                            • robertsdionne 2 hours ago

                                                                              These are real RNNs, they still depend upon the prior hidden state, it’s just that the gating does not. The basic RNN equation can be parallelized with parallel prefix scan algorithms.

                                                                            • fhdsgbbcaA 2 hours ago

                                                                              We really need a [preprint] flag for unreviewed papers.

                                                                              • lgessler 28 minutes ago

                                                                                IMHO reviews are almost indistinguishable from noise at the AI conferences I'm familiar with these days anyway, so I don't see much of a value add.

                                                                              • adamnemecek 4 hours ago

                                                                                Yes, all machine learning can be interpreted in terms of approximating the partition function.

                                                                                This is obvious when one considers the connections between Transformers, RNNs, Hopfield networks and the Ising model, a model from statistical mechanics which is solved by calculating the partition function.

                                                                                This interpretation provides us with some very powerful tools that are commonplace in math and physics but which are not talked about in CS & ML.

                                                                                I'm working on a startup http://traceoid.ai which takes this exact view. Our approach enables faster training and inference, interpretability and also scalable energy-based models, the Holy Grail of machine learning.

                                                                                Join the discord https://discord.com/invite/mr9TAhpyBW or follow me on twitter https://twitter.com/adamnemecek1

                                                                                • hydrolox 5 hours ago

                                                                                  Betteridge's law of headlines?

                                                                                  • woah 5 hours ago

                                                                                    For paper titles, the law is that the answer is always "yes"

                                                                                    • bunderbunder 5 hours ago

                                                                                      Not always, I think?

                                                                                      Opinions probably differ, for example, on John Backus's paper "Can programming be liberated from the Von Neumann style?" Many fans of functional programming would say the answer is yes, but Backus himself expressed less enthusiasm in interviews later in his life.

                                                                                      I think the important point, though, is that academic papers and newspaper articles are not the same, and titles in the form of questions function differently in the two domains. Journalists tend to use titles like these to dissemble and sensationalize. When academics use these kinds of titles for peer-reviewed articles, it's because they really are asking an honest question. Backus was doing it in his paper. The authors of this paper are doing the same. They end the paper by re-iterating the question before launching into a discussion of the limitations that prevent them from reaching any firm conclusions on the answer to this question.

                                                                                      • nephanth 5 hours ago

                                                                                        More like "we aren't sure, but we have good reasons not to exclude the possibility"

                                                                                    • hiddencost 5 hours ago

                                                                                      Note Yoshua Bengio in the author list. This shouldn't be taken lightly.

                                                                                      • auggierose 5 hours ago

                                                                                        And this is where science breaks down.

                                                                                        • hotspot_one 5 hours ago

                                                                                          Not really, because

                                                                                          1) Yoshua's reputation would take a hit if this paper were bullshit, so he has extrinsic motivation to make it good 2) Yoshua has enough experience in the field to know what is going on in the field, you don't have to ask if he forgot about a certain architecture or the work of a certain research group which would contradict his findings-- if such work exists and is credible, it is very likely to be discussed in the paper. 3) This test answers something a leader in the field thinks is important enough for them to work on, else he wouldn't be involved.

                                                                                          Also note, the poster said the paper shouldn't be taken lightly. That doesn't mean we need to take it blindly. It only means we cannot dismiss it out of hand, if we have a different view we would need substantive arguments to defend our view.

                                                                                          I've overturned the field leader several times in science, but that's only because I acknowledged what they got right and that they were indeed the person who got it right.

                                                                                          • auggierose 4 hours ago

                                                                                            > It only means we cannot dismiss it out of hand, if we have a different view we would need substantive arguments to defend our view.

                                                                                            You will need to do that anyway, no matter if Yoshua is on the paper, or not. I understand that people have limited bandwidth, and so they need shortcuts, and they need to justify these shortcuts to themselves somehow (of course the justifications are nonsense). Maybe AI will help here.

                                                                                            • DAGdug 4 hours ago

                                                                                              “ I've overturned the field leader several times in science” Either that makes you a field leader yourself, or you did it for trivial things, or you’re BSing. Which one is it?

                                                                                              • exe34 3 hours ago

                                                                                                there's a big space between leader and trivial. it's entirely possible to point out the top leader in your field is wrong on ten things over a career, without becoming the top leader yourself.

                                                                                        • PunchTornado 4 hours ago

                                                                                          To me this is further evidence that these LLMs learn only to speak English, but there is no reasoning at all in them. If you simplify a lot and obtain the same results and we know how complex the brain is.

                                                                                          • quantadev 2 hours ago

                                                                                            Every LLM expert on the planet agrees LLMs are doing "reasoning". No one says they have feelings or qualia, but we all know there's definitely genuinely artificial reasoning happening.

                                                                                            What LLMs have shown both Neuroscience and Computer Science is that reasoning is a mechanical process (or can be simulated by mechanical processes) and is not purely associated only with consciousness.