from Hacker News

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

by blacktechnology on 9/24/24, 3:16 AM with 17 comments

  • by maxglute on 9/24/24, 7:59 AM

    >A man talking as water splashes and gurgles and a motor engine hums in the background.

    This the first time I heard AI Simlish. I wonder what the training data was. Seems like work is done by John Hopkins and Tencent, but the fake AI language sounds... Indic? Are there other examples of AI generating speech in... hallucinated languages?

  • by tigermafia on 9/24/24, 10:55 AM

    Elevenlabs started rolling out a generator for very basic sound effects. Using it made me wonder what the application for things like this would be. If it was realtime it could be used for games but then there is the lack of predictable quality control.

    For (cinematic) sounddesign the quality is not nearly good enough yet. For simple home-style videos dozens of (more fun) options exist - foley, free sound libraries, freesound.org, going out with a phone and record stuff.

  • by cchance on 9/24/24, 7:06 PM

    People don't realize that an entire job field of creating these sounds today in post for videos and movies. As this sort of model improves that fields basically gone
  • by zaptrem on 9/24/24, 6:14 PM

    Classic "code and weights released at X." But when you go to the repo at X there's nothing there and possibly never will be.
  • by doctorpangloss on 9/24/24, 3:58 PM

    The quality of the audio is giving me these vibes:

    https://www.youtube.com/watch?v=ngZ0K3lWKRc

    Hayao Miyazaki, 7 years ago, on AI generated motion capture.

  • by owenpalmer on 9/24/24, 4:56 PM

    "A man yells, slams a door and then speaks."

    These are hilarious.