from Hacker News

BASE TTS: The largest text-to-speech model to-date

by jcuenod on 2/14/24, 7:09 PM with 78 comments

  • by qwertox on 2/14/24, 9:43 PM

    Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

    If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

    Also note the Ethical Statement on BASE TTS:

    > An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

    [0] https://github.com/metavoiceio/metavoice-src

    [1] https://ttsdemo.themetavoice.xyz/

  • by minimaxir on 2/14/24, 7:38 PM

    The emotion examples are interesting. One of the current most obvious indicators of AI-generated voices/voice cloning is a lack of emotion and range, which make them objectively worse compared to professional voice actors, unless a lack of emotion and range is the desired voice direction.

    But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.

  • by oersted on 2/14/24, 10:41 PM

    The Spanish voice has an interesting accent: 85% Castillian (from Spain) pronunciation, with a few unexpected Latin American tonalities and phonemes (especially "s") sprinkled in.

    I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.

  • by IronWolve on 2/14/24, 7:30 PM

    Awhile ago, when amazon had its text limited but unlimited free use of its neural tts, I was converting an ebook to audiobook, it was amazing how it could sound so lifelike and inflections of the voice. Amazing.

    Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.

    I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.

  • by LarsDu88 on 2/14/24, 8:28 PM

    Sounds about as good as ElevenLabs.io Hopefully if this ships on AWS, it will support SSML tags. I used Elevenlabs.io for all the voices in my VR game (https://roguestargun.com), but its still lacking on the emotion front which is all one-shot
  • by solarized on 2/15/24, 5:34 AM

    From the ethical statement.

    > However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

    Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?

  • by unsupp0rted on 2/14/24, 7:28 PM

    > Echoing the widely-reported "emergent abilities" of Large Language Models when trained on increasing volume of data, we show that BASE TTS variants built with 10k+ hours start to exhibit advanced understanding of texts that enable contextually appropriate prosody.
  • by revenga99 on 2/14/24, 7:30 PM

    Wow. I could see this as threatening audio book narrators. However I would still prefer a real narrator to this in its current state. I think what it might be missing is different voices/accents for different characters.
  • by mrfakename on 2/14/24, 8:13 PM

    Sadly they didn't release the code or models
  • by maxglute on 2/14/24, 7:50 PM

    Are there any decent TTS models that can be ran locally that plugs into existing software like SAPI without too much lag?
  • by sebmellen on 2/14/24, 11:04 PM

    Open question: does anyone know of a TTS model which can synchronize the output to an SRT or other subtitle file?
  • by nshm on 2/14/24, 9:58 PM

    Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

    Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

  • by SparkyMcUnicorn on 2/14/24, 7:58 PM

    > ... capable of mimicking speaker characteristics with just a few seconds of reference audio ... we have decided against open-sourcing this model as a precautionary measure.

    Disappointed yet again.

  • by mrfakename on 2/17/24, 8:43 PM

    Looks like the website (amazon-ltts-paper.com) now redirects to amazon.science. They took out the "Ethical Statement" section. (The original page can still be accessed from the Wayback Machine: https://web.archive.org/web/20240215005705/https://amazon-lt...)
  • by JanSt on 2/14/24, 9:30 PM

    I would love an API for this.. any information on availability?
  • by precompute on 2/15/24, 6:38 PM

    Ah, so that's where all the Alexa recordings went.
  • by somesun on 2/18/24, 1:31 AM

    is there any open sourced library can reach the quality of Microsoft tts and support multi-language