You can exploit this and build your own stream of streams for interesting use cases: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond... (screen in action: https://chrlschn.dev/img/need-for-speed/generation-example.g...)
Most interesting is combining this with web components and having GPT directly output streams with small web components.
That gif is really cool! I built a Python package magentic [0] which similarly parses the LLM streamed output and allows it to be used before it is finished being generated. There are plenty of use cases / prompts that can be refactored into a "generate list, then generate for each item" pattern to take advantage of this speedup from concurrent generation.
When you ask to return JSON data using streaming, you will notice that the response is incomplete and unparseable by JSON libraries, resulting in malformed errors. You will have to wait for the entire stream to complete.
To solve this problem I tried to define a spec and built a lib for it:
- [lib] https://github.com/st3w4r/openai-partial-stream/tree/main
- [spec] https://github.com/st3w4r/openai-partial-stream/blob/main/sp...
Very interesting. I tried to solve this problem too, and my code parses incomplete JSON allowing partial values and fully complete values to be accessed.
Why do you wait for the entire stream to be complete? Some objects in the JSON structure can be shown to be complete before the stream ends.
Yeah, it's an interesting problem to solve. The library is designed to parse incomplete json without waiting for the stream to finish.
I’ve been using the ijson Python library for that - I have notes on that here: https://til.simonwillison.net/json/ijson-stream
Pydantic also have support for parsing partial JSON. https://docs.pydantic.dev/latest/concepts/json/#partial-json...
from pydantic_core import from_json
partial_json_data = '["aa", "bb", "c'
result = from_json(partial_json_data, allow_partial=True)
print(result)
#> ['aa', 'bb']
You can also use their `jiter` package directly if you don't otherwise use pydantic. https://github.com/pydantic/jiter/tree/main/crates/jiter-pyt...That's neat, I hadn't seen that. Docs were lacking so I submitted a PR: https://github.com/pydantic/jiter/pull/143
Nice, it looks like a good library to build on top of. I like the available events: start_map, end_map, etc. I did try a library in JS that had similar ones, but it lacked the granularity to cover all use cases for individual fields instead of an entire item. I'll keep a note of this one if I do Python JSON streaming.
These are great. I've been working on trying to get markup working with streaming and it's a seemingly hard problem. This should help with figuring it out!
Awesome, works great! Love the modes "Real-time", "progressive", etc.
Thanks! Yeah, creating an abstraction over the raw JSON and how you want to use it in your code makes it more practical.
Here’s a bit more info on generating streams like this: https://parnassus.co/building-a-copilot-1-server-fundamental...
I’m slowly building a copilot stack, and end up wrapping multiple layers of streaming: SSE as in this article, parsed on the fly as it streams from JSON (ie parsi g incomplete invalid JSON), parsed on the fly as it streams to extract Markdown, parsed on the fly as it streams to format that Markdown and render it. You can read about this here: https://parnassus.co/building-a-copilot-2-parsing-converting...
Interesting subject, but came here to comment that you are “doing the lord’s work” by writing an LLM tool for Delphi developers. All six of them! (i kid) Best of luck with Owl.
Thankyou! I plan to expand it to other under-served languages. Delphi is a fun starting one :)
I have been working with streaming LLMs and Server Sent Events. It provides a very simple interface to work with, but you can feel SSE was never designed for this use case. As mentioned in the blog post:
> Annoyingly these can't be directly consumed using the browser EventSource API because that only works for GET requests, and these APIs all use POST.
It is not designed to send data to open a connection. You will then struggle to work with this streaming approach using frameworks and libraries that are based on the EventSource API.
EventSource is really really limited. However, you can instead use Fetch via something like https://github.com/Azure/fetch-event-source to consume SSEs.
This looks very good. The Fetch API is a nice one, so leveraging it sounds perfect. Thanks for the link.
OpenAI streaming has many peculiarities at production scale.
e.g. you will get “half-chunks” occasionally which are not parseable on their own and must be concatenated with the previous or subsequent chunk for parsing.
I do actually wonder if it's more efficient to use something like MessagePack instead of using JSON. It's a lot of strings so it may not matter too much I guess.
should really be titled streaming output, as full duplex streaming isn't mentioned at all. that'd be necessary for things low latency things like speech etc.
Do you know of any public APIs from LLM vendors that do that?
As far as I know the ChatGPT voice chat API isn’t public.
Looks like a ton of wasted data on extraneous fields