Oh hey, I wrote this. Been a long time. I had the lucky of break of working in machine translation / parsing when the most important invention of the century happened in my niche field.
I'm pretty interested in the intersection of code / ML. If that's your thing here are some other writing you might be interested in.
* Thinking about cuda: http://github.com/srush/gpu-puzzles
* Tensors considered harmful: https://nlp.seas.harvard.edu/NamedTensor
* Differentiating SVG: https://srush.github.io/DiffRast/
* Annotated S4: https://srush.github.io/annotated-s4/
Recently moved back to industry, so haven't had a chance to write in a while.
Actually realize this is to the modern version not the original. So props to Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman who rewrote this one.
I loved the GPU puzzles, after completing all of them, I wished there were more. Learnt a bunch in process.
This is aawesome, thanks for the links and the write ups!
When getting to the attention part, I really wish people stop describing it as Key Query Value. There is nothing special about Key or Query or Value in the sense of their implied function in the transformer. The KQV matrices themselves are computed by multiplying the input vector by learned weights, which are arbitrarily random matrices that come together in the end to the correct result. I.e it doesn't matter if you have 26 or 34 for a final result of 12.
The thing that makes transformers work is multi dimensionality in the sense that you are multiplying matricies by matricies instead of computing dot products on vectors. And because matrix multiplication is effectively sums of dot products, you can represent all of the transformer as wide single layer perceptron sequences (albeit with a lot of zeros), but mathematically they would do the same thing.
I'd disagree as the K, Q and V have distinct functions within the attention calculation. In particular when you're considering decode (next token calculation during inference which follows the initial prefill stage that processes the prompt). For decode you have a single Q vector (relating to the in progress token) and multiple K and V vectors (your context, i.e. all tokens that have already been computed).
> you can represent all of the transformer as wide single layer perceptron sequences
This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input. The same mathematical function can be used to compute attention however there are no static weights. You've got your attention scores on one side and the V matrix on the other side.
Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism given they inherently have static weights and they can't directly multiply two inputs (or directly multiply two internal activations). Given an MLP is a general function approximater I guess a sufficiently large number of them could get close enough?
wow - this is really well made! i've been doing research w/ Transformer-based audio/speech models and this is made with incredible detail. Attention as a concept itself is already quite unintuitive for beginners due to is non-linearity, so this also explains it very well
> Attention as a concept itself is already quite unintuitive
Once you realize that Attention is really just a re-framing of Kernel Smoothing it becomes wildly more intuitive [0]. It also allows you to view Transformers as basically learning a bunch of stacked Kernels which leaves them in a surprisingly close neighborhood to Gaussian Processes.
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
Nice read
> I'd be grateful for any pointers to an example where system developers (or someone else in a position to know) have verified the success of a prompt extraction.
You can try this yourself with any open source llm setup that lets you provide a system prompt no? Just give it a prompt, ask the model the prompt ,and see if it matches.
gpt-oss is trained to refuse so it wont share (you can provide system prompt on lmstudio)
It’s a very popular article that has been around for a long time!
It's so good it is worth revisiting often