Comments Page - How streaming LLM APIs work

« Back How streaming LLM APIs worktil.simonwillison.netSubmitted by tosh 10 months ago

CharlieDigital 10 months ago
You can exploit this and build your own stream of streams for interesting use cases: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond... (screen in action: https://chrlschn.dev/img/need-for-speed/generation-example.g...)
Most interesting is combining this with web components and having GPT directly output streams with small web components.
- jackmpcollins 10 months ago
  That gif is really cool! I built a Python package magentic [0] which similarly parses the LLM streamed output and allows it to be used before it is finished being generated. There are plenty of use cases / prompts that can be refactored into a "generate list, then generate for each item" pattern to take advantage of this speedup from concurrent generation.
  [0] https://magentic.dev/streaming/#object-streaming
Yanael 10 months ago
When you ask to return JSON data using streaming, you will notice that the response is incomplete and unparseable by JSON libraries, resulting in malformed errors. You will have to wait for the entire stream to complete.
To solve this problem I tried to define a spec and built a lib for it:
- [lib] https://github.com/st3w4r/openai-partial-stream/tree/main
- [spec] https://github.com/st3w4r/openai-partial-stream/blob/main/sp...
- vintagedave 10 months ago
  Very interesting. I tried to solve this problem too, and my code parses incomplete JSON allowing partial values and fully complete values to be accessed.
  Why do you wait for the entire stream to be complete? Some objects in the JSON structure can be shown to be complete before the stream ends.
  Yanael 10 months ago
  Yeah, it's an interesting problem to solve. The library is designed to parse incomplete json without waiting for the stream to finish.
- simonw 10 months ago
  I’ve been using the ijson Python library for that - I have notes on that here: https://til.simonwillison.net/json/ijson-stream
  jackmpcollins 10 months ago
  Pydantic also have support for parsing partial JSON. https://docs.pydantic.dev/latest/concepts/json/#partial-json...
  from pydantic_core import from_json partial_json_data = '["aa", "bb", "c' result = from_json(partial_json_data, allow_partial=True) print(result) #> ['aa', 'bb']
  You can also use their `jiter` package directly if you don't otherwise use pydantic. https://github.com/pydantic/jiter/tree/main/crates/jiter-pyt...
  simonw 10 months ago
  That's neat, I hadn't seen that. Docs were lacking so I submitted a PR: https://github.com/pydantic/jiter/pull/143
  Yanael 10 months ago
  Nice, it looks like a good library to build on top of. I like the available events: start_map, end_map, etc. I did try a library in JS that had similar ones, but it lacked the granularity to cover all use cases for individual fields instead of an entire item. I'll keep a note of this one if I do Python JSON streaming.
- kordlessagain 10 months ago
  These are great. I've been working on trying to get markup working with streaming and it's a seemingly hard problem. This should help with figuring it out!
- bschmidt1 10 months ago
  Awesome, works great! Love the modes "Real-time", "progressive", etc.
  Yanael 10 months ago
  Thanks! Yeah, creating an abstraction over the raw JSON and how you want to use it in your code makes it more practical.
vintagedave 10 months ago
Here’s a bit more info on generating streams like this: https://parnassus.co/building-a-copilot-1-server-fundamental...
I’m slowly building a copilot stack, and end up wrapping multiple layers of streaming: SSE as in this article, parsed on the fly as it streams from JSON (ie parsi g incomplete invalid JSON), parsed on the fly as it streams to extract Markdown, parsed on the fly as it streams to format that Markdown and render it. You can read about this here: https://parnassus.co/building-a-copilot-2-parsing-converting...
- threecheese 10 months ago
  Interesting subject, but came here to comment that you are “doing the lord’s work” by writing an LLM tool for Delphi developers. All six of them! (i kid) Best of luck with Owl.
  vintagedave 10 months ago
  Thankyou! I plan to expand it to other under-served languages. Delphi is a fun starting one :)
Yanael 10 months ago
I have been working with streaming LLMs and Server Sent Events. It provides a very simple interface to work with, but you can feel SSE was never designed for this use case. As mentioned in the blog post:
> Annoyingly these can't be directly consumed using the browser EventSource API because that only works for GET requests, and these APIs all use POST.
It is not designed to send data to open a connection. You will then struggle to work with this streaming approach using frameworks and libraries that are based on the EventSource API.
- ekojs 10 months ago
  EventSource is really really limited. However, you can instead use Fetch via something like https://github.com/Azure/fetch-event-source to consume SSEs.
  Yanael 10 months ago
  This looks very good. The Fetch API is a nice one, so leveraging it sounds perfect. Thanks for the link.
- undefined 10 months ago
  [deleted]
mmoustafa 10 months ago
OpenAI streaming has many peculiarities at production scale.
e.g. you will get “half-chunks” occasionally which are not parseable on their own and must be concatenated with the previous or subsequent chunk for parsing.
ekojs 10 months ago
I do actually wonder if it's more efficient to use something like MessagePack instead of using JSON. It's a lot of strings so it may not matter too much I guess.
brrrrrm 10 months ago
should really be titled streaming output, as full duplex streaming isn't mentioned at all. that'd be necessary for things low latency things like speech etc.
- simonw 10 months ago
  Do you know of any public APIs from LLM vendors that do that?
  As far as I know the ChatGPT voice chat API isn’t public.
bionhoward 10 months ago
Looks like a ton of wasted data on extraneous fields
undefined 10 months ago
[deleted]