• impossiblefork 12 days ago

    I think this is very interesting. Especially the per-layer embedding things.

    Having more than one embedding is something I've tried myself, but not separate ones for each layer.

    I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.

    • 3abiton 12 days ago

      Any resources or suggestions to learn about this? The field is moving too fast, my poor brain can't keep up.

      • impossiblefork 11 days ago

        Basically you'd familiarize yourself with transformers by implementing different variants of them, and changing them around according to your own ideas on different toy datasets.

        Then you'd figure out a set of toy tasks that you like and think are important.

        In this particular case you take something like NanoGPT, go to model.py, go to class GPT, go to __init__, modify the self.transformer ModuleDict by changing nn.Embedding to a ModuleList of nn.Embedding, then you change the for loop at line 180 to loop over a range, modify forward by adding x = x + self.transformer.wte[i], something like that I think.

        I haven't tried yet though (I've got a terrible cold, so I am on social media instead of doing anything sensible).

        • impossiblefork 10 days ago

          Also, this particular thing didn't work on my toy problems. It might still be good though.

      • krackers 6 days ago
        • limoce 12 days ago

          > https://preview.redd.it/wca7kzfq5w2f1.png?width=1190&format=...

          "4x gated residual streams" look quite weird. Is there any paper or technique report for this?

          • 3abiton 12 days ago

            While PLE is quite innovative, the interesting part is they released their [apk on github](https://github.com/google-ai-edge/gallery), compared to linking it to play store. Interesting choice.