• nickthegreek 3 hours ago

    The live test on https://play.ai/ didn't work for me in firefox. swapped to chrome and it worked quickly. I cloned my voice in 30s and was instantly talking to myself. This would easily fool most people who know me. Wild stuff.

    • Mizza 2 hours ago

      What's SOTA for open source or on-device right now?

      I tried building a babelfish with o1, but the transcription in languages other than English are useless. When it gets it correct, the translations are pretty perfect and the voice responses are super fast, but without good transcription it's kind of useless. So close!

      • kabirgoel 17 minutes ago

        I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.

        Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.

        Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)

        [1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886

        • diggan 2 hours ago

          I was literally just looking at that today, and the best one I came across was F5-TTS: https://swivid.github.io/F5-TTS/

          Only thing missing (for me) is "emotion tokens" instead of forcing the entire generation to be with a specific emotion, as the generated voice is a bit too robotic otherwise.

          • moffkalast an hour ago

            > based on flow matching with Diffusion Transformer

            Yeah that's not gonna be realtime. It's really odd that we currently have two options, ViTS/Piper that runs at a ludicrous speed on a CPU and is kinda ok, and these slightly more natural versions a la StyleTTS2 that take 2 minutes to generate a sentence with CUDA acceleration.

            Like, is there a middle ground? Maybe inverting one of the smaller whispers or something.

            • modeless an hour ago

              StyleTTS2 is faster than realtime

              • gunalx 42 minutes ago

                Bark?

            • amrrs 2 hours ago
              • refulgentis 2 hours ago

                I'm not sure what you mean fully, this is TTS, but it sounds like you're expecting an answer about transcription

                So its both hard to know what category you'd like to hear about, as well as if you do mean transcription, what your baseline is.

                Whisper is widely regarded the best in the free camp, but I wouldn't be surprised to see a paper of a model claiming better WER, or a much bigger model.

                If you meant you tried realtime 4o from OpenAI, and not o1*, it uses whisper for transcription on server, so I don't think you'll see much gain from trying whisper. my next try would be the Google Cloud APIs, but they're paid and with regard to your question re: open source SOTA, the underlying model isn't open.

                But also if you did mean 4o, the transcription shouldn't matter for output transcription quality, the model is taking in voice (I verified their claim by noticing when there's errors in the transcription, it answers correctly)

                * I keep messing these two up when talking about it, and it seems unlikely you meant o1 because it has a long synchronous delay before any part of the answer is available, and doesn't take in audio.

                If you did mean o1, then, I'd use realtime 4o for TTS, and have it natively do the translation, as it will be unaffected by errors in transcription like you're facing now

              • Yenrabbit 2 hours ago

                Quite disconcerting to have a low-latency chat with something that sounds like you! Can recommend the experience, very thought-provoking.

                • gyre007 2 hours ago

                  This is awesome! Over the summer I wrote API clients for both Go [1] and Rust [2] as we were using Play in my job at the time but there was only Python and Node SDK.

                  [1] https://github.com/milosgajdos/go-playht [2] https://github.com/milosgajdos/playht_rs

                  • lyjackal 2 hours ago

                    Is there any way to use the TTS on its own? I maintain an obsidian TTS plug-in, and am starting to add new TTS providers (its just been OpenAI thus far). From the documentation at https://docs.play.ai/documentation/get-started/introduction, it looks like their API seems to couple it to an LLM for building conversational agents. Seems like it might be nice to use standalone as just TTS.

                  • Aeolun an hour ago

                    That’s 12 times cheaper than the OpenAI models though. Those are already very good, so I can’t really see myself using this.

                    I really want a good on-device model though.

                    • BoppreH an hour ago

                      In the video demo, Play 3.0 mini (on the left) incorrectly claims that the other AI missed a word.

                      How does that end up in an announcement? Do people not notice, or not care? Or are they trying to show realistic mistakes?

                      • DevX101 3 hours ago

                        Has anyone done a comparison of combined speech to text and TTS vs speech-to-speech for create audio only interfaces? Particularly curious around latency, and quality of audio output.

                      • phkahler 3 hours ago

                        Sounds quite good, but this prompt is NOT what I'd expect an automated system to feed into it:

                        “I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

                        Phone numbers and others were read nicely, but apparently a string of alphanumerics for an order number aren't handled well yet.

                        • diggan 2 hours ago

                          > Phone numbers and others were read nicely

                          The phone numbers were not naturally read at all. A human would have read a grouping of 123-456-789 like "123", "456", "789", but instead the model generated something like "123", "45", "6789". Listen to the RVSP example again and you'll know what I mean. The pacing is generally off for normal text too, but extra noticeable for the numbers.

                          My hunch would be that it's because of tokenization, but I wouldn't be able to say that's the issue for sure. Sounds like it though :)

                          • amrrs 3 hours ago

                            Sorry, Do you mean to the audio for this text is not good?

                            “I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

                            I thought this was included in the demo, it seemed okay!

                            • BoorishBears 3 hours ago

                              Most of these prompts come from LLMs, so it's trivial to instruct them to provide a string that's broken out like that.

                              Also not the end of the world to process stuff like this with a regex.

                              Most of these newer TTS models require this type of formatting to reliably state long strings of numbers and IDs

                            • dulldata 3 hours ago

                              demo video if you don't want to go through the announcement - https://www.youtube.com/watch?v=DusTj5NLC9w

                              Good with numbers mostly!

                              • CommanderData 29 minutes ago

                                Is there a way to train this on common AI voices from video games/movies, I'd very much like a voice assistant to sound like Father/Mother from Alien or Dead Space.

                                • siscia an hour ago

                                  I honestly wanted to try to use it, but their pricing was quite off-putting.

                                  • c0brac0bra an hour ago

                                    Yes. I think $0.05/min is a high multiple of what other agent-oriented realtime TTS products are charging.

                                  • KaoruAoiShiho an hour ago

                                    Is this better than 11labs?

                                    • Asjad 4 hours ago

                                      Play 3.0 mini sounds like a game-changer for real-time multilingual TTS with its speed and voice cloning capabilities

                                      • treesciencebot 4 hours ago

                                        Much faster than OpenAI's real-time mode, wow! Quality seems to be on par if not better as well.

                                        • samsepi0l121 3 hours ago

                                          Did we watch the same video? OpenAI's model is faster, and the quality is far better.

                                        • codetrotter 3 hours ago

                                          Hey Alexa, Google “Play”!

                                          • lostmsu 2 hours ago

                                            Is this one open in any way? If no, why would anyone use it over OpenAI?

                                            • gorkemyurt 3 hours ago

                                              wow! latency is insane