So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?
similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...
and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/
I wonder why the authors felt they need to use drop caps in this paper. It is a distraction and seems to value style over content.
What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.
From the title I thought this was talking about cramming the night before an exam. ;-) Or if it’s an open book exam learning during the exam as one goes through the textbook.
Is it just me, or does this seem like big news?
Key questions:
1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.
However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?
2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.
How are the references sorted?
If this was that good, why would Google release it?