I wonder how much improvement is owed to which changes. I've also never heard of "Muon - Momentum Orthogonalized by Newton-schulz" being used.
EDIT: there's a bit more info on his twitter - https://x.com/kellerjordan0
It looks like he created this optimizer. Works on 2D matrices only.
Just needs a Zero To Hero series episode offering line by line commentary to follow along on why each choice was made over alternatives.
Cool work. No license?
So it compresses info better.
That is literally intelligence.
Seems like this is a modded NanoGPT not the original.
Yes. It’s literally called “Modded-NanoGPT”.