by interweb on 5/16/20, 4:54 AM with 101 comments
by crazygringo on 5/16/20, 8:51 PM
But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.
Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?
by sonantic on 5/16/20, 8:17 PM
Thanks for your thoughts and feedback thus far! I'd be happy to answer questions (within reason) about our latest cry demo / emotional TTS! Feel free to fire away on this thread.
by spaceprison on 5/16/20, 6:42 PM
The same goes for sub titles, she'd be perfectly fine with a robot voice for the actors if they sounded real enough like this.
Game changer.
by ArneVogel on 5/16/20, 6:04 PM
by vessenes on 5/17/20, 12:28 AM
I was just mucking around with Nvidia's latest, called flowtron, and I know from that experience there's a significant amount of work between getting a tech demo out and launching a usable product, whether API-based, or with some visual workflow like your video shows.
One thing I think worth considering on the commercialization front is whether or not the core offering is the workflow niceties around your engine, the engine-as-API, or both. I'm just a random person on the internet, so take these thoughts with a large grain of salt, but thinking about it, it seems to me that prioritizing integration with say unity, unreal engine, video compositing tools, blog posting tools are all interesting and viable market paths. The underlying networks are going to keep improving for some time, so you're really trying to buy some long term customers.
Some stuff that's obvious, but I can't resist:
I could off the top of my head imagine using this for massively reducing the cost to develop games, for script writers pulling comps together, for myself to create audio versions of my own writing, for better IOT applications inside the home... I'd really love to be able to play with this.
There still isn't a truly non-annoying virtual assistant voice; when the first tacotron paper came out, I was hopeful I would see more prosody embedded in assistants by now, but the longer we live with siri and google, the more sensitive I think we are to their shortcomings. I have a preference for passive / ambient communication and updates, so I would place a really high value on something that could politely interrupt or say hello with information.
At any rate, congratulations, this is cool. :)
by diminish on 5/16/20, 5:49 PM
We soon can create emotionally expressive youtube videos with synthetic actors..
by yc-kraln on 5/17/20, 12:07 AM
The comment: I noticed that your demo video also had "emotional" video layered on top of the dialogue. This could be considered manipulative; perhaps consider sharing a naked version so we could attempt to interpret the emotion based solely on the text to speech engine.
The question: You mention you met at EF. I was wondering if, beyond bringing you together, you found EF to be worth the cost of admission?
by microtherion on 5/16/20, 6:00 PM
by schoolornot on 5/16/20, 6:46 PM
by jariel on 5/16/20, 7:39 PM
It would be nice to try with actual text inputs right on the page, that this doesn't exist is tiny flag.
A great choice to work with voice actors, because there isn't any 'pure' TTY that's good enough in the most general sense, having the actual voice actor as a working basis will help.
Perhaps for small game houses, they can just use something off the shelf, big houses can use a customized voice, and then not worry if they have to make tweaks or changes, they don't have to do a whole production.
by microcolonel on 5/17/20, 1:44 AM
Next time be honest about what you have when presenting it; every human with functioning ears is attuned to the sound of speech. This sort of technology would be amazing for narrative video games even with the less than perfect vocoding.
by amelius on 5/16/20, 6:34 PM
by voiper1 on 5/16/20, 8:52 PM
Amazon's Polly English voice, Matthew is pretty nice. But they don't have Hebrew. Also Google doesn't have Hebrew. Bing has some attribution requirement that I haven't fully investigated.
by DenisM on 5/16/20, 6:37 PM
I wonder if attaching this to a modern-day Elisa will improve the Turing test scores? Emotional load can reduce the requirement for semantic coherence.
by tomByrer on 5/16/20, 10:52 PM
If so, have plans for a Web Speech API plugin? I'm about to release a reader demo based around it. https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
by aasasd on 5/17/20, 2:11 AM
by wishinghand on 5/19/20, 5:40 AM
by diskmuncher on 5/18/20, 10:47 PM
Obvious application: H-anime. Reduced parameters for the "emotion" as well.
by hyperpallium on 5/17/20, 12:08 AM
by sarabande on 5/16/20, 8:53 PM
by Animats on 5/16/20, 8:39 PM
by moron4hire on 5/16/20, 8:49 PM
by blattimwind on 5/16/20, 6:48 PM
by dequalant on 5/16/20, 10:45 PM
by terrycody on 5/17/20, 2:27 AM
I want the know the price and when can we use it in production.
by cemregr on 5/16/20, 10:32 PM
by dejongh on 5/17/20, 12:42 PM
by maxdo on 5/16/20, 5:48 PM