from Hacker News

NaturalSpeech: End-to-end text to speech synthesis with human-level quality

by phsilva on 5/17/22, 8:46 PM with 149 comments

by webmaven on 5/17/22, 9:54 PM
Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.
And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:
The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.
The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).
In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.
Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.
[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.
by causality0 on 5/18/22, 4:41 AM
It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.
by thorum on 5/18/22, 6:21 AM
See also the recently published Tortoise TTS, which IMO sounds even better: https://github.com/neonbjb/tortoise-tts
by DonHopkins on 5/18/22, 2:06 AM
Nice pitch envelopes. But it's a bit uncanny that natural human pitch envelopes encode and express what you understand and intend to convey about the meaning of the words you're saying, and what you want to emphasize about each individual word, emotionally. Like how you'll say a word you don't really mean sarcastically. It can figure out it's a question because the sentence ends in a question mark, and it raises the pitch at the end, but it can't figure out what the meaning or point of the question is, and which words to emphasize and stress to convey that meaning. (Not a criticism of this excellent work, just pointing out how hard a problem it is!)
For example, compare "rebuke and abash": in the NaturalSpeech, one goes down like she's sure and the other goes up like she's questioning, where in the recording, they are both more balanced and emphasized as equally important words in the sentence. And the pause after insolent in "insolent and daring" sounds uneven compared to the recording, which emphasizes the pair of words more equally and tightly.
Jiminy Glick interviews (and does an impression of) Jerry Seinfeld:
https://www.youtube.com/watch?v=AE2utktZ92Y
by jcims on 5/18/22, 2:35 AM
There's no way I'll find it but somewhere along the way there was a collection of samples in which one of these contemporary model-based speech synthesizers (possibly wavenet or tacotron) was forced to output data with no useful text (can't remember if it was just noise or literally zero input). The synthesizer just started creating weird breathy pops and purrs and gibberish utterances. Some of them sounded like panic breathing and it was one of the more jarring things I've heard in quite some time.
This isn't exactly it but it's very close - https://www.deepmind.com/blog/wavenet-a-generative-model-for... CTRL+F 'babbling'
by sandreas on 5/18/22, 6:57 AM
For german users, I can recommend to take a look at
https://www.thorsten-voice.de/
https://github.com/thorstenMueller/Thorsten-Voice
where someone contributed a huge set of his voice samples and a tutorial / script collection to build a pretty decent TTS model LOCALLY.
Quality-wise it is not as good as the samples in the article, but its free and pretty easy to follow for a tech enthusiast.
by Loeffelmann on 5/18/22, 5:56 AM
What's a good TTS cloud service that has anything even close to these voices. I looked at the Google and Amazon ones and was pretty disappointed.
by big_fan on 5/18/22, 4:44 AM
Industry research lab claims human parity on end-to-end text-to-speech and releases a web page with five samples as proof? Microsoft, you're a little late to the party - Google has been using this playbook for 5 years!
by explorigin on 5/18/22, 2:46 PM
It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.
by themodelplumber on 5/18/22, 4:20 AM
Reminds me of how good the choir instruments are these days. https://youtu.be/ulK3_o7OyEk?t=392
by midjji on 5/18/22, 7:48 AM
While every sample they provide is suspiciously similar to the human version,(indicating overtraining, either on the samples or on a single voice), where I would have expected a different if still human quality voice from a fully functional system, this tech is coming, and soon. And when it does, voice acting will no longer prevent videogames from having complex stories, and we will find out if the industry is still capable of making them. Looking forward to it :)
by Quequau on 5/18/22, 10:00 AM
I don't suppose anyone could recommend a good text-to-speech for Linux?
Command line is fine but it would be much better if it could trivially take clipboard content for input. The last time I looked I found stuff that wasn't that great and was pretty inconvenient.
by est31 on 5/18/22, 2:05 AM
Very subtle differences, can be heard, but I have my headphones on. For example, in the last example, "borne" and "commission" seem to have some kind of artificial noise inside the "b" and "c" sounds. The "th" in "clothing" sounds artificial too. Still, it's extremely amazing, and probably in 90% of settings, people won't be able to find a difference at all. It even does breaths: "scientific certainty <breath> that".
by justinlloyd on 5/18/22, 3:47 AM
This is pretty impressive work, except for this one: "who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars"
Both the NaturalSpeech and the human said pretty much every word in that sentence completely incorrectly for the context of the words. It is the difference between "the car Seat" and "the car seat". "It's pronounced Ore-garh-no" to paraphrase the insufferable Hermione Granger.
by exebook on 5/18/22, 9:20 AM
One thing I've noticed is that I can hear human inhale before they continue speaking. Got curious if tts of the future should have this feature too.
by microtherion on 5/18/22, 3:32 AM
Good quality overall, though it's difficult to tell from a small, hand picked set of examples (which appear to come from the training data, too — have the corresponding recordings been included in the voice build or held out?).
There is a rather obvious problem with the stress on "warehouses", and a more subtle problem with "warrants on them", where it's difficult to get the stress pattern just right.
by vtts on 5/20/22, 2:06 AM
The Text-To-Speech service by https://vtts.xyz is the perfect choice for anyone who needs an instant human sounding voiceover for their commercial or non-commercial projects. Got a product to sell online? Why not transform your boring text into a natural sounding voiceover and impress your customers. What about adding a voiceover to your animation or instructional video? It will make it sound more professional and engaging! Our human sounding voices add inflections in the voice that make them sound natural, and our custom text editor makes it easy to get exactly what you want from both Male & Female voices included over 30 different tones, including: Serious, Joyful & normal
by DantesKite on 5/18/22, 1:39 AM
That is crazy. Any way I can start using this soon? I have a backlog of articles I’d love to listen to.
by lvl102 on 5/18/22, 9:13 AM
Microsoft/Nuance has been doing great in this area. I am very impressed with TTS on Windows. It makes proofing documents that much easier. I do think there is a need for some type of markup (akin to sheet music) for supervised learning.
by sebringj on 5/18/22, 2:55 AM
You could tell the difference in that the AI pronounced "Hussars" correctly where as the human reader did not. Without adding in our human error, our AI-trained version will be the more educated one for certain going forward.
by jollybean on 5/18/22, 11:24 AM
It'd be nice if we could input our own text because otherwise these things are subject to a lot of training corpus and other biases.
Sounds really good though.
by p1necone on 5/18/22, 9:41 AM
This kind of stuff is going to be amazing for indie gamedevs. I want a model trained for "powerful narrator voice" and villain speeches.
by coolspot on 5/18/22, 4:04 PM
> We train our proposed system on 8 NVIDIA V100 GPUs with 32G memory
Sounds like openly reproducing this result is within independent researchers’ reach.
by msluyter on 5/18/22, 2:07 PM
Totally tangential comment:
You can click play on any/all of the samples simultaneously, resulting in a neat sonic effect vaguely reminiscent of Steve Reich's famous "Come out." [1]
[1] https://www.youtube.com/watch?v=g0WVh1D0N50 (skip to like 7 minutes in to get the idea)
by sriku on 5/18/22, 2:14 AM
I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.
by IYasha on 5/18/22, 9:38 PM
As a TTS daily user, sometimes I'm even fine with espeak quality for system messages. But one thing concerns me more than beauty of the voice - the ability to process mixed language text and abbreviations. And I don't see these problems addressed in this project. (
by vtts on 5/20/22, 2:12 AM
Do you want to check how it works? you can test the operation of standard voices and advanced neural voices, at this url: https://vtts.xyz/home/tryme
by UltraViolence on 5/23/22, 8:00 PM
I'm not sure what to make of this. The TTS output seems identical to the recording.
Why don't they use this tech to recreate some dead actor's speech, for example?
by qgin on 5/18/22, 1:42 AM
Unbelievable. This has traversed the uncanny valley and come out the other side.
by infinitone on 5/18/22, 4:05 AM
This is definitely human-level quality. In fact, the synthesized versions pronounce some words better than human. Kudos to MSFT! I think they've been longest in the game too...
edit; is the Nuance acquisition compounding yet?
by colordrops on 5/18/22, 4:09 AM
I actually think the TTS voice is better sounding than the human's voice.
by hooloovoo_zoo on 5/18/22, 4:15 AM
Cadence still seems way off for the AI. Maybe it’s going word by word?
by PedroBatista on 5/18/22, 12:39 PM
Very cool, but..
What's the end game here? because I cannot use it, I cannot buy it and this seems more than just a scientific paper.
So what's the objective here?
by karmasimida on 5/18/22, 5:27 AM
I think we have reached the stage of development of AI, I am no longer surprised/excited by this results by any means.
by baxuz on 5/18/22, 8:46 AM
Is there any TTS engine which isn't based on English?
I'd love to be able to use an assistant device in Croatian in my lifetime.
by skykooler on 5/18/22, 4:02 AM
Wow, this is the first speech synthesis I've seen on here where I thought I was listening to a human at first.
by wrycoder on 5/18/22, 3:04 AM
Is this available in other languages yet?
by hrdwdmrbl on 5/18/22, 12:26 PM
[Take my money meme] I want this for articles and books now, please
by nwatab on 5/18/22, 1:10 PM
Sounds nice. I'm interested in making business based on TTS
by blueflow on 5/18/22, 10:24 AM
What the fuck is end-to-end text? My Bullshit-O-Meter is off the charts. End of what? I only know end-to-end encryption.
by SemanticStrengh on 5/18/22, 8:02 AM
Can someone please upload the results? On https://paperswithcode.com/sota/text-to-speech-synthesis-on-...