by jcuenod on 2/14/24, 7:09 PM with 78 comments
by qwertox on 2/14/24, 9:43 PM
If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.
Also note the Ethical Statement on BASE TTS:
> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.
by minimaxir on 2/14/24, 7:38 PM
But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.
by oersted on 2/14/24, 10:41 PM
I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.
by IronWolve on 2/14/24, 7:30 PM
Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.
I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.
by LarsDu88 on 2/14/24, 8:28 PM
by solarized on 2/15/24, 5:34 AM
> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.
Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?
by unsupp0rted on 2/14/24, 7:28 PM
by revenga99 on 2/14/24, 7:30 PM
by mrfakename on 2/14/24, 8:13 PM
by maxglute on 2/14/24, 7:50 PM
by sebmellen on 2/14/24, 11:04 PM
by nshm on 2/14/24, 9:58 PM
Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.
by SparkyMcUnicorn on 2/14/24, 7:58 PM
Disappointed yet again.
by mrfakename on 2/17/24, 8:43 PM
by JanSt on 2/14/24, 9:30 PM
by precompute on 2/15/24, 6:38 PM
by somesun on 2/18/24, 1:31 AM