• Ratelman 4 hours ago

    So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?

    • marmaduke 2 hours ago

      similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...

      and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/

      • amai 5 hours ago

        I wonder why the authors felt they need to use drop caps in this paper. It is a distraction and seems to value style over content.

        • 331c8c71 4 hours ago

          They were not so secretely hoping their paper's gonna go directly in history:) One could check the other papers by the authors to verify.

        • OutOfHere 17 minutes ago

          What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.

          • suninsight 5 hours ago

            Key questions:

            1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.

            However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?

            2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.

            • tigershark 4 hours ago

              The biggest model that they have used has only 760M parameters, and it outperforms models 1 order of magnitude larger.

            • groceryheist 6 hours ago

              Is it just me, or does this seem like big news?

              • 331c8c71 5 hours ago

                Same here (seen it yesterday) but I haven't parsed the technicals so far tbh.

                • quotemstr 6 hours ago

                  A lot of ML papers that sound revolutionary end up being duds