I think this is very interesting. Especially the per-layer embedding things.
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.
Any resources or suggestions to learn about this? The field is moving too fast, my poor brain can't keep up.
Basically you'd familiarize yourself with transformers by implementing different variants of them, and changing them around according to your own ideas on different toy datasets.
Then you'd figure out a set of toy tasks that you like and think are important.
In this particular case you take something like NanoGPT, go to model.py, go to class GPT, go to __init__, modify the self.transformer ModuleDict by changing nn.Embedding to a ModuleList of nn.Embedding, then you change the for loop at line 180 to loop over a range, modify forward by adding x = x + self.transformer.wte[i], something like that I think.
I haven't tried yet though (I've got a terrible cold, so I am on social media instead of doing anything sensible).
Also, this particular thing didn't work on my toy problems. It might still be good though.
> https://preview.redd.it/wca7kzfq5w2f1.png?width=1190&format=...
"4x gated residual streams" look quite weird. Is there any paper or technique report for this?
While PLE is quite innovative, the interesting part is they released their [apk on github](https://github.com/google-ai-edge/gallery), compared to linking it to play store. Interesting choice.