Comments Page - EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

« Back EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformerhaidog-yaqub.github.ioSubmitted by blacktechnology 10 months ago

maxglute 10 months ago
>A man talking as water splashes and gurgles and a motor engine hums in the background.
This the first time I heard AI Simlish. I wonder what the training data was. Seems like work is done by John Hopkins and Tencent, but the fake AI language sounds... Indic? Are there other examples of AI generating speech in... hallucinated languages?
- ben_w 10 months ago
  > Are there other examples of AI generating speech in... hallucinated languages?
  Sure: https://suno.com/song/0c05e4bd-5879-4e1d-9bdd-555d76569501
  No chance that it's getting ancient Summerian correct.
  Y_Y 10 months ago
  Of course, you can't just mathematically derive a language that isn't in the training set.
  Except Sanskrit, naturally.
- mhuffman 10 months ago
  If you go to their demo[0] and type in a prompt to ask it to say something (eg. a person says "hello") it seems to hallucinate a response in a made up language ... maybe, I don't speak every language.
  [0]https://huggingface.co/spaces/OpenSound/EzAudio
- bane 9 months ago
  I can't wait for a new round of moral panic to set in over this. Some of us remember the hysteria over playing rock music backwards.
- alex_duf 10 months ago
  Simlish is the first thing that cam to my mind too.
tigermafia 10 months ago
Elevenlabs started rolling out a generator for very basic sound effects. Using it made me wonder what the application for things like this would be. If it was realtime it could be used for games but then there is the lack of predictable quality control.
For (cinematic) sounddesign the quality is not nearly good enough yet. For simple home-style videos dozens of (more fun) options exist - foley, free sound libraries, freesound.org, going out with a phone and record stuff.
- mhuffman 10 months ago
  >Using it made me wonder what the application for things like this would be.
  Almost certainly in video shorts or high volume video content, I would think.
- earthnail 10 months ago
  Same as image generation. When it gets to a certain quality level, it's much faster to describe what you want than to search for it.
cchance 10 months ago
People don't realize that an entire job field of creating these sounds today in post for videos and movies. As this sort of model improves that fields basically gone
- lyu07282 10 months ago
  > As this sort of model improves that fields basically gone
  What about this field should be different than all the other fields, some (copywriters) have already been completely disrupted. It's like pointing at climate change and saying we really ought to be doing something about it. Well, we won't.
- smrtinsert 10 months ago
  Depends on quality. Birds in a stream. Are the birds accurate to the movies location? Is the fidelity good enough? Does it match the movie? I don't think it's as simple as prompt and done
zaptrem 10 months ago
Classic "code and weights released at X." But when you go to the repo at X there's nothing there and possibly never will be.
doctorpangloss 10 months ago
The quality of the audio is giving me these vibes:
https://www.youtube.com/watch?v=ngZ0K3lWKRc
Hayao Miyazaki, 7 years ago, on AI generated motion capture.
- CamperBob2 10 months ago
  (Shrug) Art, like science, advances one funeral at a time.
  doctorpangloss 10 months ago
  I don't know. "Hayao Miyazaki bad" is a loser idea. It is insane to me that this board will waste millions of characters litigating the opensourceyness of licenses, but when it comes to using their own ears and making a gut opinion millions are capable of doing every day: no.
  For music specifically, until someone invents a viable alternative to Spotify, which is the same as inventing an audience that pays more for music, I am not sure how even if EzAudio were good - if anything existed which generated unlimited, high quality music - will change much.
  Good music generation will strengthen, not weaken, Spotify. It will change whom gets paid by Spotify, in that the sons and daughters of record and banking executives can truly be talentless, but it will not change who is doing the paying.
  Anyway, isn't this the status quo right now? There is maybe 10,000x more high quality music tracks than an individual could ever listen to in his lifetime practicably. And it does make Spotify a better value every day.
owenpalmer 10 months ago
"A man yells, slams a door and then speaks."
These are hilarious.