There's like an interesting systems article here, but at this point I'd rather they just gave me the prompt they used to generate it, so I can read it interactively in my own GPT5.5 session.
Hard to read article. The writing is curiously more robotic and repetitive than those written by AI.
It's strange - like someone went for brevity, but without the usual exercise of packing meaning into each sentence. There's a lot of fluff in the shape of serious writing, lol.
i like how even if i can parse most of it it does sound like technically accurate technobabble, could be of inspiration for a tv show :D
Take this with a grain of salt as I am new to this but IMHO for establishing memory hierarchy once and for all, it would be more helpful to present some abstract theory that
* Explains prefill (time to first token TTFT) vs decode (time between tokens TBT aka 1/tps)
* The various ways to schedule the computation, and the roles of runtime vs driver
* The scenarios and choices, taking into account traffic patterns, whether you are an inference service or doing batch or claw whatnot.
I generally don’t fill my context with enough stuff that this becomes a problem. I don’t think more data = better on the token side. Instead I’d be researching with focused prompts or subagents and surfacing only relevant context to a primary agent.
ok, so for anyone whose not played with local models and watched what's going on with the KV cache:
1. You send your prompt, and now adays, whatever harness you're using sends a whole mess of context: available skills, tools, guardrails, etc. The GPU/inference engine starts processing it into tokens. This is the "Prompt Processing" speed and it's the fastest portion of inference, but is essentially "buffering" (text -> tokens). These tokens can be cached.
2. The inference then generates, more slowly, the next tokens; these I think are cached also (tokens -> text)
Crucially: the KV cache is the _hardware_ cache; it is not a software layer currently, and even if it were, that'd make it extremely slow because it's storing _all_ the tokens in a conversation. So like all cache, cache eviction has to occur to free up the VRAM necessary.
So if you had a conversation an hour ago, in the cloud, it's doubtful any of those tokens still exist so if you got up to 500k, you're going through step #1 again; if you're doing turn by turn immediately, you can skip to #2.
So some of the reports in March about suddenly all the token gen allowance disappearing within hours was likely a KV cache/billing issue: they were charging you as if you were generating all those tokens for every back and forth. Whether it was a bug in billing vs a bug in programming, who knows.
The trouble is that the traditional webserver type of proxy caching & load balancing tricks that helped scale the web don't work here! Your conversation with 100k context has to return to the same cluster, maybe even the same GPU to rely on the extraordinary fast KV cache reuse.