If you use the Cursor IDE: the folks that wrote it talked about their use of speculative decoding to make "Apply" faster on the Lex Friedman podcast last month.
Here it is on YouTube, although you can also find it on Spotify and other podcast platforms:
For those that prefer text: it seems they use a weaker but faster model for the "predicted output" / speculation. Pretty smart.
https://fireworks.ai/blog/cursor Fireworks AI has a blog about it.
I found the OpenAI page to be more interesting https://platform.openai.com/docs/guides/latency-optimization...
It's incredibly well written. I can see this being very helpful for newcomers.
As for the Predicted Outputs feature, it looks incredibly useful in a few of my pipelines. Can't wait to test it out.
This is like the likely() and unlikely() macros in the Linux kernel! Huge speedup if you're right; small penalty if you're not.
Any recommendations for high level overview/learning resources about this? It seems interesting, but like most Linux internals, things get real technical, real quick.
[dead]
[flagged]
[dead]