Comments Page - The Role of Anchor Tokens in Self-Attention Networks

« Back The Role of Anchor Tokens in Self-Attention Networksarxiv.orgSubmitted by smooke 9 months ago

zopper 9 months ago
Surprised this isn't getting more attention. It is one of those papers that is very elegant and simple, yet very effective.
- forrestp 9 months ago
  It's expensive in this field to verify other people's work. There are a few other papers in the last 3 years that have the same high-level idea but call the anchor tokens something different -- Gist tokens being the only one I personally remember, but you can follow the citation chains back.
  Those other papers sounded like a godsend but have deficits that you only find out about if you try to use them against non-cherry-picked use-cases. I think they are on average getting better though with time.
  They call out their limitations in the bottom of the paper. For these kinds of models, it would be nice to see them exploiting & measuring the weaknesses of compressive memory -> producing exact outputs. This would be things retrieving multiple things out of context exactly, arithmetic, or copy-pasting high-entropy bits (e.g. where a basic n-gram model can't bias you out of the blurry pieces).
  The other side of it is there is often some difficulty in reproducing training for some of these architectures -- the training can be highly unstable and both difficult + expensive to dial-in on a real-world model. We see their best training run, not their 500 runs where they changed hyperparameters b/c the loss kept exploding randomly (compare this to text-only llama-esque architectures where they are wildly stable at training time / predictable / easy to invest into and hyperparams are easy to find from prior art).
  I think we are still many papers away from something ready-for-prod on this concept, but I am personally optimistic.
wantsanagent 9 months ago
Someone explain to me how this isn't reinventing LSTMs please.
- toxik 9 months ago
  I don’t understand why you think they are even similar. This is still doing pairwise attention.
  wantsanagent 9 months ago
  An LSTM takes a series of values and uses a combination of gates to determine critical information to hold on to or forget as a sequence unfolds. This is a compressive technique that removes the requirement of having all previous sequence information at the time of a particular inference.
  This paper "compress sequence information into an anchor token" which is then used at inference time to reduce the information required for prediction as well as speed up that prediction. They do this via "continually pre-training the model to compress sequence information into the anchor token."