from Hacker News

Emotionally Expressive Text to Speech

by interweb on 5/16/20, 4:54 AM with 101 comments

by crazygringo on 5/16/20, 8:51 PM
This is fascinating.
But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.
Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?
by sonantic on 5/16/20, 8:17 PM
Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here.
Thanks for your thoughts and feedback thus far! I'd be happy to answer questions (within reason) about our latest cry demo / emotional TTS! Feel free to fire away on this thread.
by spaceprison on 5/16/20, 6:42 PM
My daughter is dyslexic and would love to play things like stardew valley, pokemon or even animal crossing but being text only makes them such a slog for her.
The same goes for sub titles, she'd be perfectly fine with a robot voice for the actors if they sounded real enough like this.
Game changer.
by ArneVogel on 5/16/20, 6:04 PM
This site has to have one of the worst cookie choice decision popups: https://imgur.com/a/YLsGadP
by vessenes on 5/17/20, 12:28 AM
Hi Zeena, I love this! I just filled out your form.
I was just mucking around with Nvidia's latest, called flowtron, and I know from that experience there's a significant amount of work between getting a tech demo out and launching a usable product, whether API-based, or with some visual workflow like your video shows.
One thing I think worth considering on the commercialization front is whether or not the core offering is the workflow niceties around your engine, the engine-as-API, or both. I'm just a random person on the internet, so take these thoughts with a large grain of salt, but thinking about it, it seems to me that prioritizing integration with say unity, unreal engine, video compositing tools, blog posting tools are all interesting and viable market paths. The underlying networks are going to keep improving for some time, so you're really trying to buy some long term customers.
Some stuff that's obvious, but I can't resist:
I could off the top of my head imagine using this for massively reducing the cost to develop games, for script writers pulling comps together, for myself to create audio versions of my own writing, for better IOT applications inside the home... I'd really love to be able to play with this.
There still isn't a truly non-annoying virtual assistant voice; when the first tacotron paper came out, I was hopeful I would see more prosody embedded in assistants by now, but the longer we live with siri and google, the more sensitive I think we are to their shortcomings. I have a preference for passive / ambient communication and updates, so I would place a really high value on something that could politely interrupt or say hello with information.
At any rate, congratulations, this is cool. :)
by diminish on 5/16/20, 5:49 PM
Impressive next step for text-to-speect. Wish there was some simple real demos. I also work on the same thing using DL- and hope to open source the "emotional part" of it.
We soon can create emotionally expressive youtube videos with synthetic actors..
by yc-kraln on 5/17/20, 12:07 AM
I have a comment and a question:
The comment: I noticed that your demo video also had "emotional" video layered on top of the dialogue. This could be considered manipulative; perhaps consider sharing a naked version so we could attempt to interpret the emotion based solely on the text to speech engine.
The question: You mention you met at EF. I was wondering if, beyond bringing you together, you found EF to be worth the cost of admission?
by microtherion on 5/16/20, 6:00 PM
The prosody sounds nice. But two of the longer samples have a lot of vocal fry, and the third sounds like the voice has a stuffy nose and/or a slight lisp. I wonder whether those mannerisms were chosen to camouflage artifacts inherent in their current implementation.
by schoolornot on 5/16/20, 6:46 PM
Between this and Lyrebird there seem to be a high number of cutting edge TTS solutions being worked on in the private sector. Does anyone know why there haven't been much advancement with the FOSS libraries?
by jariel on 5/16/20, 7:39 PM
Recommending editing the video down to 43-60 seconds.
It would be nice to try with actual text inputs right on the page, that this doesn't exist is tiny flag.
A great choice to work with voice actors, because there isn't any 'pure' TTY that's good enough in the most general sense, having the actual voice actor as a working basis will help.
Perhaps for small game houses, they can just use something off the shelf, big houses can use a customized voice, and then not worry if they have to make tweaks or changes, they don't have to do a whole production.
by microcolonel on 5/17/20, 1:44 AM
Very cool demo, but the quality of the vocoding is not state of the art, and it's audibly artificial, which is probably why you covered it up with the obnoxiously loud music.
Next time be honest about what you have when presenting it; every human with functioning ears is attuned to the sound of speech. This sort of technology would be amazing for narrative video games even with the less than perfect vocoding.
by amelius on 5/16/20, 6:34 PM
Sounds nice but difficult to judge with the background music.
by voiper1 on 5/16/20, 8:52 PM
Is there any pay-to-use or open source voice for Hebrew?
Amazon's Polly English voice, Matthew is pretty nice. But they don't have Hebrew. Also Google doesn't have Hebrew. Bing has some attribution requirement that I haven't fully investigated.
by DenisM on 5/16/20, 6:37 PM
This is very impressive.
I wonder if attaching this to a modern-day Elisa will improve the Turing test scores? Emotional load can reduce the requirement for semantic coherence.
by tomByrer on 5/16/20, 10:52 PM
@sonantic Seems you don't do real-time yet?
If so, have plans for a Web Speech API plugin? I'm about to release a reader demo based around it. https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
by aasasd on 5/17/20, 2:11 AM
As a non-native-speaker, I understood exactly four words from the monologue in the vid. Which might be on par for some movies, often having actors whisper and breathy-voice through the whole thing (ahem House of Cards cough). However, for actual TTS like webpages and audiobooks, the ‘Dina’ voice works much better.
by wishinghand on 5/19/20, 5:40 AM
Hey Zeena- will there be options to make the voices more unreal? The use case I imagine is for a character with a damaged vocoder or a broken speaker. Other glitchy affectations could be useful too.
by diskmuncher on 5/18/20, 10:47 PM
History has shown us that technological advancement of this kind will be adopted first by ...
Obvious application: H-anime. Reduced parameters for the "emotion" as well.
by hyperpallium on 5/17/20, 12:08 AM
the video https://youtube.com/watch?v=zwYiDraKtSA
by sarabande on 5/16/20, 8:53 PM
If this could generate well-done audiobooks instantly from a text, that would be fantastic. All e-books could have an audiobook version overnight.
by Animats on 5/16/20, 8:39 PM
Can't hear the voices over the music.
by moron4hire on 5/16/20, 8:49 PM
Any plans to support languages other than English? This would be huge in the foreign language instruction field.
by blattimwind on 5/16/20, 6:48 PM
I could see this being used for RPG games to fix the choice deficiency that has been caused by going for fully voiced dialogue. Also, making Hitler read copypastas even more convincingly.
by dequalant on 5/16/20, 10:45 PM
This is amazing! I was looking something like this to come up for a long time. Finally someone did it!
by terrycody on 5/17/20, 2:27 AM
Applied the form. Really cool.
I want the know the price and when can we use it in production.
by cemregr on 5/16/20, 10:32 PM
Is there an actual demo?
by dejongh on 5/17/20, 12:42 PM
Borh Cool and creepy!
by maxdo on 5/16/20, 5:48 PM
Wow sounds very real