We were investigating speech-to-speech for a project before and estimated that creating an end to end solution with the previous method would take us weeks at best for the MVP (because the pipeline was basically: speech -> whisper STT model -> text, retrieval, API calls, etc. -> prompt -> LLM -> text -> TTS model -> speech). If this works as advertised it could cut the amount of work required quite significantly, excited to try it out (when it’s available in Europe that is…).
It’s not for a production-type thing but Home Assistant has this pipeline built in and you can swap out any of the 3 steps:
* STT
* LLM
* TTS
It’s pretty cool to be able to replace one of the parts, do some tests, then change another part.
Again, it’s nothing you would use directly for a product but it’s fairly easy to test your pipeline by plugging into different aspects. (Also HA provides each component out of the box if you want them to handle STT/TTS and just test your LLM).
Add VAD to this list and it’s basically the same stack that I am running on mobile phones (on-device). It doesn’t beat OpenAI’s voice chat in terms of speed and intelligence, but it’s funny.
The LLM part isn’t great ofc due to the small size. Still experimenting with different models/tweaks until I’m satisfied enough with the total outcome on a recent’ish iPhone/Pixel.
For what it's worth, I created an MVP solution using that pipeline that took about 3 days. I used the Azure AI Speech service and SDK. Worked pretty good despite the obvious long pipeline you described.
it's frustrating that things like this get released from oAI but one still cannot use voice on the web-app, nor any of the advanced voice model stuff, without essentially emulating a phone.
it's hard to know who oAI is working for -- is it a developer resource group or an actual customer-facing business? it feels like they don't know, either.