• simiones 2 days ago

    I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

    The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

    The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

    And, of course, the singing part is painfully bad, I am very curious why they even included it.

    • Uehreka 2 days ago

      Their comments about the singing and background music are odd. It’s been a while since I’ve done academic research, but something about those comments gave me a strong “we couldn’t figure out how to make background music go away in time for our paper submission, so we’re calling it a feature” vibe as opposed to a “we genuinely like this and think its a differentiator” vibe.

      • phildougherty 2 days ago

        Totally felt the same way! Singing happens spontaneously? What?

        • lyu07282 2 days ago

          They mention that in the FAQ here: https://github.com/microsoft/VibeVoice/tree/main?tab=readme-...

          > In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.

          It's not a bug, it's a feature! Okaaaaay

      • jstummbillig 2 days ago

        Is there any better model you can point at? I would be interested in having a listen.

        There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.

        • Uehreka 2 days ago

          It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering.

          However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.

          If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M

          • refulgentis 2 days ago

            There's a certain know-nothing feeling I get that makes me worried if we start at the link (which has data showing it > ElevenLabs quality), jump to eh it's actually worse than anything I've heard then last 2 years, and end up at "none are as good as ElevenLabs" - the recommendation and commentary on it, of course, has nothing to do with my feeling, cheers

            • sandreas 2 days ago

              What is your opinion about F5-TTS or Fish-TTS?

              • brettpro 2 days ago

                I recently implemented Fish for a project and found it adequate for TTS but wildly impressive in voice cloning. My POC originally required 3-10 audio samples but I removed the minimum because it could usually one shot it.

                The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.

            • lynx97 2 days ago

              I cobbled together llm-tts to run as many local (and remote) TTs models s I could find and get working.

              https://github.com/mlang/llm-tts

              Strictly speaking, even music generation fits the usage pattern: text in, audio out.

              llm-tts is far from complete, but it makes it relatively "easy" to try a few models in an uniform way.

              • nipponese 2 days ago

                Not OS or local, but just try ChatGPT Voice Conversation mode. To my ears, it's a generation ahead of these VibeVoice samples.

                • riquito 2 days ago

                  Probably not even the best ones, but among some recent models I find Dia and Orpheus more natural

                  - http://dia-tts.com/

                  - https://github.com/canopyai/Orpheus-TTS

                  • popalchemist a day ago

                    Higgs Audio v2 is currently SOTA in OSS TSS.

                    • satellite2 2 days ago

                      Elevenlabs v3 (not local)

                      • whimsicalism 2 days ago

                        i think orpheus and sesame sound better

                      • rcarmo 2 days ago

                        One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works.

                        • watsonmusic 2 days ago

                          bonus usage

                        • IshKebab 2 days ago

                          I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago.

                          • selkin 2 days ago

                            Results correlate to investment, and there’s more in synthesizing female coded voices. As for the why female coded voices gets more investments, we all know, only difference is in attitude towards that (the correct answer, of course, is “it sucks”)

                            • recursive 2 days ago

                              We all know? Female voices have better intelligibility? That's my guess anyway.

                              • kadoban 2 days ago

                                There's a lot of money and effort spent in satisfying the sexual desires of (predominantly straight) men. There's not typically quite as much interest in doing the same for women.

                                For example I've been looking at models and loras for generating images, and the boards are _full_ of ones that will generate women well or in some particular style. Quite often at least a couple of the preview images for each are hidden behind a button because they contain nudity. Clearly the intent is that they are at least able to generate porn containing women. There's a small handful that are focused on men and they're very aware of it, they all have notes lampshading how oddball they are to even exist.

                                I would expect that this is not as pronounced an effect in the world generating speech, but it must still exist.

                                • lacy_tinpot 2 days ago

                                  I think this is a very lazy kind of cultural analysis. The reason female voices are being chosen over male ones is a little more multifaceted than just SEX. Heterosexual women also tend to prefer female voices over male ones.

                                  Female voices are often rated as being clearer, easier to understand, "warmer", etc.

                                  Why this is the case is still an open question, but it's definitely more complex than just SEX.

                                  • kadoban 2 days ago

                                    I don't think that this is the only factor, I just suspect that it is _a_ factor.

                                    • lacy_tinpot 20 hours ago

                                      >There's not typically quite as much interest in doing the same for women.

                                      Women also prefer female voices.

                                    • selkin 2 days ago

                                      That you consider it sex (rather than gender), is exactly why there’s a preference for female coded voices. Consider where we do hear male recorded voices used as default.

                                      • recursive 2 days ago

                                        Overloaded term. It was a reference to the parent's reference.

                                        > satisfying the sexual desires of

                                        So, "sex" as a reference to "sexual desires". In English, it just so happens that "sex" has other meanings, but those weren't in play at the time.

                                        • akimbostrawman a day ago

                                          How the hell would you determine someone's self assigned social gender based on there voice which is a result of there physical sex.

                                          • pylotlight 2 days ago

                                            woosh

                                      • selkin 2 days ago

                                        If you don't know, it's on you to learn. If you do know and prefer to make an asshole of yourself, that's also on you.

                                  • odie5533 2 days ago

                                    It's good but not the best free model. I find Chatterbox to be more realistic with no robot-sounding and better (though not perfect) intonation.

                                    • lastdong 2 days ago

                                      Chatterbox sounds great, their demo page is a good introduction: https://resemble-ai.github.io/chatterbox_demopage/

                                      • eaglehead 2 days ago

                                        I agree. We switched from elevenlabs to chatterbox (hosted on Resemble.ai) and it is much much cheaper and better.

                                      • echelon 2 days ago

                                        This is close to SOTA emotional performance, at least the female voices.

                                        I trust the human scores in the paper. At least my ear aligns with that figure.

                                        With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.

                                        • kamranjon 2 days ago

                                          Hmmmm… what is your opinion on the examples showcased here vs the ones on the Dia demo page?

                                          https://yummy-fir-7a4.notion.site/dia

                                          I am not sure why but I find the pacing of the parakeet based models (like Dia) to be much more realistic.

                                          • watsonmusic 2 days ago

                                            11labs is facing a real competitor

                                          • iansinnott 2 days ago

                                            The English/Mandarin section was VERY impressive. The accents of both the woman speaking English and the man speaking Chinese were spot on. Both sound very convincingly like they are speaking a second language, which anyone here can hear from the Chinese woman speaking English voice. I'd like to add that the foreigner speaking Chinese was also spot on.

                                            • skripp 2 days ago

                                              The male Chinese speakers had THICK American accents. Nothing really wrong with the language, but think the stereotype German speaking English. That was kind of strange to me.

                                              • ascorbic 2 days ago

                                                I think it's because it was using the American voice for it. Conversely the female voice in the Mandarin conversation spoke English with a Chinese accent.

                                              • mclau157 2 days ago

                                                ElevenLabs has a much more convincing voice model

                                                • sys32768 2 days ago

                                                  They also offer an AI Voice Changer that will take a recording and transform it into a different voice but retain the cadence and intonation.

                                                  • DrBenCarson 2 days ago

                                                    Open source?

                                                    • watsonmusic 2 days ago

                                                      it's not oss

                                                    • johanyc 2 days ago

                                                      The Chinese is good. The Mandarin to English example she sounds native. The English to Mandarin sounds good too but he does have an English speaker's accent, which I think is intentional.

                                                      • MengerSponge 2 days ago

                                                        > (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that

                                                        https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

                                                      • giancarlostoro 2 days ago

                                                        I really hope someone within Microsoft is naming their open source coding agent Microsoft VibeCode. Let this be a thing. Its either that or "Lo" then you can have Lo work with Phi, so you can Vibe code with Lo Phi.

                                                        https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

                                                        • simiones 2 days ago

                                                          Knowing the history of Microsoft marketing, it will either be called something like "Microsoft Copilot Code Generator for VSCode" or something like "Zunega"...

                                                          • giancarlostoro 2 days ago

                                                            Well don't forget "Microsoft SQL" ;) They'll name something as though they invented it and now have the worse possible way to google it.

                                                            • kelvinjps10 2 days ago

                                                              For me it doesn't sounds like they invented it but that it's Microsoft version of SQL idk but I hate Microsoft version of anything

                                                              • loloquwowndueo 2 days ago

                                                                “Microsoft Word” haha reminds me of the old joke : “Microsoft Works” is an oxymoron.

                                                                • giancarlostoro 2 days ago

                                                                  Oh my goodness, I forgot about "Microsoft Works" you just shot me back in time to the 2000s

                                                                  • esafak 2 days ago

                                                                    You misquoted Microsoft "Works"

                                                                  • parineum 2 days ago

                                                                    Just like MariaDB sounds as though they invented databases, right?

                                                                  • cush 2 days ago

                                                                    Later renamed to Microsoft Zune, a handheld AI companion that lives in your pocket

                                                                    • polytely 2 days ago

                                                                      GitHub Dotnet Copilot Code Generator for VSC (new)

                                                                      • datadrivenangel 2 days ago

                                                                        (preview)

                                                                      • yellowapple 2 days ago

                                                                        Microsoft Copilot .NET for Workgroups

                                                                        • airstrike 2 days ago

                                                                          Now I need a new project just so I can call it Zunega... lmao

                                                                        • watsonmusic 2 days ago

                                                                          genius

                                                                        • malnourish 2 days ago

                                                                          This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is.

                                                                          • heeton 2 days ago

                                                                            I'm no audio engineer either, but those computer voice sound "saw-tooth"y to me.

                                                                            From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality. Better models are produced smoother output.

                                                                            https://www.perfectcircuit.com/signal/difference-between-wav...

                                                                            • codebastard 2 days ago

                                                                              I would describe it as blockly, as if we visualise the sound wave it seems to be without peaks and cut upwards and downwards producing a metallic boxy echo.

                                                                              • jofzar 2 days ago

                                                                                Yeah it sounds super low bitrate to me, reminds me of someone on Bluetooth microphone

                                                                              • lvncelot 2 days ago

                                                                                After hearing them myself, I think I know what you mean. The voices get a bit warbly and sound at times like they are very mp3-compressed.

                                                                              • davorak a day ago

                                                                                Any insight on my the code and the large model were removed? Some copies are floating around and are MIT licensed. In cases like this I do not know why the projects are yanked. If the project was mistakenly released under MIT, copied elsewhere, is any damage control possible by yanking the copies you have control over? Mostly seems like bad PR, if minor.

                                                                                • androiddrew a day ago

                                                                                  Ok anyone have a link to the code and weights?

                                                                                  • fivestones a day ago

                                                                                    Wondering this too.

                                                                                  • strangescript 2 days ago

                                                                                    The male voices seem much worse than the female voices, borderline robotic. Every sample of their website starts with a female voice. They clearly are aware of the issue.

                                                                                    • jsomedon 2 days ago

                                                                                      I felt the same, male voice feels kinda artificial.

                                                                                    • aargh_aargh 2 days ago

                                                                                      Is there a current, updated list (ideally, a ranking) of the best open weights TTS models?

                                                                                      I'm actually more interested in STT (ASR) but the choices there are rather limited.

                                                                                      • Uehreka 2 days ago

                                                                                        Yes: https://huggingface.co/models?pipeline_tag=text-to-speech

                                                                                        Generally if a model is trending on that page, there’s enough juice for it to be worth a try. There’s a lot of subjective-opinion-having in this space, so beyond “is it trending on HF” the best eval is your own ears. But if something is not trending on HF it is unlikely to be much good.

                                                                                        • odie5533 2 days ago

                                                                                          Best TTS: VibeVoice, Chatterbox, Dia, Higgs, F5 TTS, Kokoro, Cosy Voice, XTTS-2.

                                                                                          • kroaton a day ago

                                                                                            Unmute.sh (same team as Kokoro) gets slept on, but it's really good.

                                                                                          • xnx 2 days ago

                                                                                            Click leaderboard in the hamburger menu: https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2

                                                                                            • prophesi 2 days ago

                                                                                              Is there a way to filter out hosted models? The top three winners currently are all proprietary as far as I can tell.

                                                                                              edit: Ah, there's a lock icon next to the name of each proprietary model.

                                                                                              • odie5533 2 days ago

                                                                                                That's a highly incomplete comparison

                                                                                              • watsonmusic 2 days ago

                                                                                                yes the best

                                                                                              • TheAceOfHearts 2 days ago

                                                                                                Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

                                                                                                Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

                                                                                                • tempodox 2 days ago

                                                                                                  This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

                                                                                                  • NitpickLawyer 2 days ago

                                                                                                    > with acceptable quality

                                                                                                    Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO.

                                                                                                    • selkin 2 days ago

                                                                                                      Different use cases:

                                                                                                      If you need a not-visual output of text, SoyA is a waste of electrons.

                                                                                                      If you want to try and mimic a human speaker, then it ain’t.

                                                                                                      Question is why would you need to have the computer sound more human, except for “because I can”.

                                                                                                      • NitpickLawyer 2 days ago

                                                                                                        I tried listening to audiobooks generated with tts. It takes me out of it most of the time, and I lose focus. That podcast thing from google was the first time I felt like I could listen to an entire thing without feeling the uncanny valley thing. And I knew it was genAI. So I'm looking for that, but for my content. Grab a bunch of articles (long form, deeply researched) and "podcast" them but with natural voices, sans hype. Or books. Have them ready when I'm out and about.

                                                                                                        • andrew_lettuce 2 days ago

                                                                                                          The Google podcasts are so cringey positive it emotionally pains me. Nobody finds pineapple on pizza that amazing.

                                                                                                          • lagniappe 2 days ago

                                                                                                            >Nobody finds pineapple on pizza that amazing

                                                                                                            We can't be friends

                                                                                                        • crazygringo 2 days ago

                                                                                                          Audiobooks and other material you want to listen to (articles, blog posts, etc.).

                                                                                                          There's a lot of stuff I don't have time to sit down and read, but want to listen to while I cook/laundry/shower/drive/etc.

                                                                                                          Often recordings don't exist. Or when they do, an audiobook just has a bad voiceover artist, or one that just rubs you the wrong way.

                                                                                                          The more human text-to-speech sounds, the easier and less distracting it is to listen to. There's real value in it, it's not "because I can".

                                                                                                          You know how it's nicer to read in 300 dpi instead of 72 dpi? Or in Garamond rather than Courier? Or in Helvetica rather than Comic Sans? It's like that, only for speech.

                                                                                                          • Ukv 2 days ago

                                                                                                            > Question is why would you need to have the computer sound more human

                                                                                                            I think translation would be a big use - maybe translating your voice to another language while maintaining emotion and intonation, or dubbing content (videos, movies, podcasts, ...) that isn't otherwise available in your native language.

                                                                                                            Traditional non-ML TTS for longer content like podcasts or audiobooks seems like it'd become grating to the point of being unlistenable, or at least a significantly worse experience. Stands to benefit from more natural sounding voices that can place emphasis in the right places.

                                                                                                            Since Stephen Hawking was brought up, there are likely also people with voice-impairing illnesses who would like to speak in their own voice again (in addition to those who are fine with a robotic voice). Or alternatively, people who are uncomfortable with their natural voice and want to communicate closer to how they wish to be perceived.

                                                                                                            Could also potentially be used for new forms of interactive media that aren't currently feasible - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.

                                                                                                    • Insanity 2 days ago

                                                                                                      What an odd name to me, becaus "Vibe" is, in my mind, equal to somewhat poor quality. Like "Vibe Coding". But that's probably just some bias from my side.

                                                                                                      • mxfh 2 days ago

                                                                                                        Vibe coding just became a term this spring. I doubt that that the substantial part, like giving it a project code name and getting company approval of this research project started after that. It's not libe vibe has a negative connotation in general yet.

                                                                                                        • Insanity 2 days ago

                                                                                                          'Vibe' as a word / product was definitely less common though. I kinda doubt that 'VibeVoice' is _not_ a consequence of 'VibeCode'.

                                                                                                          But I do agree with you in that generally there's probably no negative connotation (yet).

                                                                                                        • andrew_lettuce 2 days ago

                                                                                                          Vibe always meant "specific feel" and makes sense related to AI coding "by touch" vs. understanding what's actually happening. It's just the results have now made the word pejorative.

                                                                                                        • baxuz 2 days ago

                                                                                                          Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages.

                                                                                                          It seems that it's only variants of English, Spanish and Chinese which are somewhat working.

                                                                                                          • lukax 2 days ago

                                                                                                            Have you tried Soniox for speech recognition? It supports Croatian. Or are you just looking for self-hosted open-source models? Soniox is very cheap ($0.1/h for async, $0.12/h for real-time) and you get $200 free credits on signup.

                                                                                                            https://soniox.com/

                                                                                                            Disclaimer: I used to work for Soniox

                                                                                                            • baxuz 2 days ago

                                                                                                              I meant in general purpose tools from Google and Apple. Most of this assistant and "AI" stuff is practically useless for me because I refuse to talk to my devices in English.

                                                                                                              In Android Auto / CarPlay I can't even get voice guidance that works properly, much less reading notifications, or composing a reply using STT

                                                                                                          • rafaelmn 2 days ago

                                                                                                            The Spontaneous Emotion dailog sounds like a team member venting through LLMs.

                                                                                                            They could have skipped the singing part, it would be better if the model did not try to do that :)

                                                                                                          • Meneth 2 days ago

                                                                                                            Open-source, eh? Where's the training data, then?

                                                                                                            • Joel_Mckay 2 days ago

                                                                                                              Most scraped data is often full of copyright, usage agreement, and privacy law violations.

                                                                                                              Making it "open" would be unwise for a commercial entity. =3

                                                                                                              • zoobab 2 days ago

                                                                                                                Open source is being abused to not provide the actual source. Stop this.

                                                                                                                • Joel_Mckay 2 days ago

                                                                                                                  A lot of code has multiple FOSS licenses that are not contaminating like GPL. GPL violations do occur on code, but have nothing to do with the training Data.

                                                                                                                  For example, many academic data sets are not public domain, and can't be used in a commercial context. A GPL claim on that data is often an argument of which thief showed up first.

                                                                                                                  Rule #24: A lawyers Strategic Truth is to never lie, but also avoid voluntarily disclosing information that may help opponents.

                                                                                                                  Thus, a business will never disclose they paid a fool to break laws for them... =3

                                                                                                                  • nullc 2 days ago

                                                                                                                    Perhaps, but it is not Open Source in the traditional sense if they do not provide the preferred form for modifications.

                                                                                                                    • Joel_Mckay 2 days ago

                                                                                                                      There are also some weird OSS license rules that only trip the disclosure obligation when distributing the build to end users.

                                                                                                                      Indeed, these adversarial behaviors do not follow the spirit of FOSS community standards. If a project started as FOSS, than FOSS it should remain. =3

                                                                                                            • crvdgc 2 days ago

                                                                                                              Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin.

                                                                                                              • stuffoverflow 2 days ago

                                                                                                                VibeVoice-Large is the first local TTS that can produce convincing Finnish speech with little to no accent. I tinkered with it yesterday and was pleasantly surprised at how good the voice cloning is and how it "clones" the emotion in the speech as well.

                                                                                                                • data-ottawa 2 days ago

                                                                                                                  Looks like the repo went private

                                                                                                                  https://github.com/microsoft/VibeVoice

                                                                                                                  I was trying to get this working on strix halo.

                                                                                                                  • lxe 2 days ago

                                                                                                                    There are 2 "best" TTS models out right now: HiggsAudio and VibeVoice. I found that Higgs is both faster and much higher fidelity than Vibe. Can't speak to expressiveness, but don't sleep on it.

                                                                                                                    • mpaepper 2 days ago

                                                                                                                      Unfortunate naming given I named my repo which does open source locally running speech to text vibevoice 7 months ago:

                                                                                                                      https://github.com/mpaepper/vibevoice

                                                                                                                      • ndkap 2 days ago

                                                                                                                        Here is AI being as close as possible to the most animated person I know and here I am sounding robotic in every conversation I have, despite my best efforts to sound otherwise. Sometimes, I just wish I could have an AI speak for me

                                                                                                                        • glenstein 2 days ago

                                                                                                                          Very good and I could see how I might believe they are real people if I let my guard down. The male voice sounded a little sedated though and there was a smoothness to it that could be samey over long stretches.

                                                                                                                          Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.

                                                                                                                          • regularfry 2 days ago

                                                                                                                            Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.

                                                                                                                            So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.

                                                                                                                            • cush 2 days ago

                                                                                                                              To me this is like early generative AI art, where the images came out very "smooth" and visually buttery, but instead there's no timbre to the voices. Intonation issues aside, these models could use a touch of vocal fry and some body to be more believable

                                                                                                                              • viggity 2 days ago

                                                                                                                                I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

                                                                                                                                I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

                                                                                                                                • watsonmusic 2 days ago

                                                                                                                                  this model is superb

                                                                                                                                • bityard 2 days ago

                                                                                                                                  I thought the name sounded familiar, I'm guessing its no relation to this project which has been around for 7 months? https://github.com/mpaepper/vibevoice

                                                                                                                                  • faxmeyourcode 2 days ago

                                                                                                                                    I tried the colab notebook that they link to and couldn't replicate the quality for whatever reason. I just swapped out the text and let it run on the introduction paragraph of Metamorphosis by Franz Kafka and it seemingly could not handle the intricacies.

                                                                                                                                    • wewewedxfgdf 2 days ago

                                                                                                                                      I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do.

                                                                                                                                      Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.

                                                                                                                                      • xp84 16 hours ago

                                                                                                                                        I’m just a yank, but a lot of the AI-voiced videos on YouTube that I’ve been listening to while I’m falling asleep lately have British voices that sound quite nice to me.

                                                                                                                                        • specproc 2 days ago

                                                                                                                                          I'd like one that really nails Brummie.

                                                                                                                                        • bazlan 2 days ago

                                                                                                                                          Sad to not see vui on the comparisons!

                                                                                                                                          A 100M podcast model

                                                                                                                                          https://huggingface.co/spaces/fluxions/vui-space

                                                                                                                                          • ementally 2 days ago

                                                                                                                                            they vibecoded their demo website? the text is invisible on Firefox.

                                                                                                                                            • double_one 2 days ago

                                                                                                                                              Same problem here. A quick refresh solved it for me — maybe try that?

                                                                                                                                              • recursive 2 days ago

                                                                                                                                                Works for me

                                                                                                                                              • anarticle 2 days ago

                                                                                                                                                The first example sounds like a cry for help.

                                                                                                                                                Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.

                                                                                                                                                • qwertytyyuu 2 days ago

                                                                                                                                                  Woah they even immitate the western chinese accent well

                                                                                                                                                  • baal80spam 2 days ago

                                                                                                                                                    Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

                                                                                                                                                    • x187463 2 days ago

                                                                                                                                                      The giveaway is they will never talk over each other. Only one speaker at a time, consistently.

                                                                                                                                                      • tracker1 2 days ago

                                                                                                                                                        Fair enough... though it would be possible to generate that and edit to overlay the speech, introducing stuttering/pauses at the beginning and end of statements then edit the output to overlay the steps.

                                                                                                                                                        Would probably want to do similar to balance crossfade anyway... having each speaker's input offset from center instead of straight mono.

                                                                                                                                                        • kaptainscarlet 2 days ago

                                                                                                                                                          Also the lack of stutter and perfect flow of speech are a dead giveaway

                                                                                                                                                          • kridsdale1 2 days ago

                                                                                                                                                            And longer pause between turns than humans would do.

                                                                                                                                                          • tracker1 2 days ago

                                                                                                                                                            Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago. Aside: Installing a sound card (unused) on a windows server just to be able to generate TTS was interesting. It was required by the platform, even if it wasn't used for it.

                                                                                                                                                            I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.

                                                                                                                                                          • ml_basics 2 days ago

                                                                                                                                                            what's the relationship between this work and the recently announced voice models from Microsoft AI? https://microsoft.ai/news/two-new-in-house-models/

                                                                                                                                                            • ehutch79 2 days ago

                                                                                                                                                              The examples are kind of off-putting. We're definitely in uncanny valley territory here.

                                                                                                                                                              • nextworddev 2 days ago

                                                                                                                                                                Still haven’t found anything better than kokoro tts. Anyone know something better?

                                                                                                                                                                • weeb 2 days ago

                                                                                                                                                                  does anyone know of recent TTS options that let you specify IPA rather than written words? Azure lets you do this, but something local (and better than existing OS voices) would be great for my project.

                                                                                                                                                                  • andybug 2 days ago

                                                                                                                                                                    I'm using Kokoro via https://github.com/remsky/Kokoro-FastAPI. It has a `generate_audio_from_phonemes()` endpoint that I'm sure maps to the Kokoro library if you want to use it directly.

                                                                                                                                                                    My usage is for Chinese, but the phonemes it generated looked very much like IPA.

                                                                                                                                                                  • egorfine 2 days ago

                                                                                                                                                                    [deleted - I'm an idiot]

                                                                                                                                                                    • x187463 2 days ago

                                                                                                                                                                      Whisper is speech-to-text. VibeVoice is text-to-speech.

                                                                                                                                                                      • mpeg 2 days ago

                                                                                                                                                                        There is a text-to-speech version of whisper, but IMHO the quality is much worse than the demos of this model.

                                                                                                                                                                        • x187463 2 days ago

                                                                                                                                                                          Are you referring to this?

                                                                                                                                                                          https://github.com/WhisperSpeech/WhisperSpeech

                                                                                                                                                                          Or is there some OpenAI official Whisper TTS?

                                                                                                                                                                          • mpeg 2 days ago

                                                                                                                                                                            Yep, nothing official that I know, but that one is fairly popular so maybe they were referring to it (although AFAIK it's not frontier?)

                                                                                                                                                                        • egorfine 2 days ago

                                                                                                                                                                          I stand corrected

                                                                                                                                                                      • tehlike 2 days ago

                                                                                                                                                                        The comments in the html code is chinese, which is very interesting.

                                                                                                                                                                        • swiftcoder 2 days ago

                                                                                                                                                                          Ah, yes, the Furious 7 soundtrack. Definitely something everyone recalls

                                                                                                                                                                          • closewith 2 days ago

                                                                                                                                                                            The most popular song of the year from one of the most popular movie franchises that had been in the global news due to the death of its star. Probably the most memorable song from a soundtrack of the century so far.

                                                                                                                                                                            • agos 2 days ago

                                                                                                                                                                              I'm Just Ken (Barbie), Skyfall, Let it Go (Frozen), Remember Me (Coco), Happy (from Despicable Me 2), a Star is Born (Shallow), are all arguably wayyyyy more memorable and these are just off the top of my head. We've had quite a few memorable songs in soundtracks this millennium.

                                                                                                                                                                              edit: I had forgotten about Jai Ho (Slumdog Millionaire) and Lose Yourself (8 mile)

                                                                                                                                                                              • closewith 2 days ago

                                                                                                                                                                                It's obviously subjective, but in terms of numbers the only contender in that list is Let It Go, which had about 1/3rd the reach.

                                                                                                                                                                                Nothing on that list - movies or songs - had the cultural impact of Furious 7 or See You Again.

                                                                                                                                                                                • ascorbic 2 days ago

                                                                                                                                                                                  And most recently "Golden"

                                                                                                                                                                            • throwaw12 2 days ago

                                                                                                                                                                              Will there be a support for SSML to have more control of conversation?

                                                                                                                                                                              • lagniappe 2 days ago

                                                                                                                                                                                Bots should never sing.

                                                                                                                                                                                • Havoc 2 days ago

                                                                                                                                                                                  MIT license - very nice!

                                                                                                                                                                                  • ComputerGuru 2 days ago

                                                                                                                                                                                    The application of known FOSS licenses to what is effectively a binary-only release is misleading and borderline meaningless.

                                                                                                                                                                                    • Havoc 2 days ago

                                                                                                                                                                                      It is an unfortunate recycling of an existing regime that no doubt offends Stallman to his very core, but I wouldn't call it meaningless.

                                                                                                                                                                                      If you're in a company and need a model which one do you think you're getting past compliance & legal - the one that says MIT or the one that says "non-commercial use only"?

                                                                                                                                                                                    • em-bee 2 days ago

                                                                                                                                                                                      what does that mean in this context? it seems to depend on an LLM. so can i run this completely offline? if i have to sign up and pay for an LLM to make it work, then it's not really more useful than any other non-free system

                                                                                                                                                                                      • watsonmusic 2 days ago

                                                                                                                                                                                        Microsoft is cool

                                                                                                                                                                                      • agos 2 days ago

                                                                                                                                                                                        seemingly supports only English, Indian and Chinese

                                                                                                                                                                                        • plingamp 2 days ago

                                                                                                                                                                                          Indian and Chinese are not languages

                                                                                                                                                                                          • agos a day ago

                                                                                                                                                                                            I'm very aware of this. The project does not specify more than an in- and zh- prefix.

                                                                                                                                                                                            • ascorbic 2 days ago

                                                                                                                                                                                              Voices, not languages. The "English" one is American though.

                                                                                                                                                                                          • cush 2 days ago

                                                                                                                                                                                            I tried using the demo but it just errors out

                                                                                                                                                                                            • amelius 2 days ago

                                                                                                                                                                                              I tried some TTS models a while ago, but I noticed that none of them allowed to put markup statements in the text. For example, it would be nice to do something like:

                                                                                                                                                                                                   Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles]
                                                                                                                                                                                              
                                                                                                                                                                                              etc.

                                                                                                                                                                                              In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor.

                                                                                                                                                                                            • lyu07282 a day ago

                                                                                                                                                                                              Did they delete the repo? It's 404 for me now: https://github.com/microsoft/VibeVoice

                                                                                                                                                                                              • RealtyDAO a day ago

                                                                                                                                                                                                they must have removed it.. been down for hrs.

                                                                                                                                                                                              • sciencesama 2 days ago

                                                                                                                                                                                                Need this for mac

                                                                                                                                                                                                • double_one 2 days ago

                                                                                                                                                                                                  I tried it on my MacBook Pro — works great!

                                                                                                                                                                                                • watsonmusic 2 days ago

                                                                                                                                                                                                  one of the best models built by Microsoft

                                                                                                                                                                                                  • enigma101 a day ago

                                                                                                                                                                                                    only microsoft could come up with such a name rofl

                                                                                                                                                                                                    • defrost a day ago

                                                                                                                                                                                                      Lippy got vetoed.