by CyborgCabbage on 7/18/23, 12:29 PM with 42 comments
by bradrn on 7/18/23, 1:20 PM
The voice can be modeled using two main components. The vocal chords are a periodic source of sound, which is then filtered by the mouth and tongue to produce vowel sounds [0]. The filter can be modeled as a set of band-pass filters, each of which let through a specific band of frequencies — these are called ‘formants’ in acoustic phonetics. Different vowel sounds are produced by combining formants at different pitches in a systematic way [1]. You can hear this yourself by very slowly moving your mouth from saying an ‘eeeee’ sound to an ‘ooooo’ sound: if you listen carefully, you can hear one formant changing pitch while the others stay the same. (I like [2] as an intro to this kind of stuff.)
The ‘voder’ works by having one key for each possible frequency band-pass filter. Pressing multiple keys adds the resulting sounds, producing an output sound with distinct formants. If you use the right formants, the resulting sound is very similar to that produced by a human mouth saying a specific vowel! Software such as the vowel editor in Praat [3] take it further, by allowing selection of formants from a standard vowel chart.
[0] Consonantal sounds are a bit more complicated, since they tend to involve various different noise sources and transient disturbances of the sound. For instance, /ʃ/ (the ‘sh’ sound) is noise of a lower frequency than /s/. I can’t work out how Harper produced the difference between those two sounds in the video — it seems to be impossible to do this with the live demo. In fact, any sort of pitch control seems to be impossible in the demo.
[1] This is how overtone singing and throat singing works! Selectively amplifying one formant gives the impression that you’re singing that note as the same time as the ‘base’ pitch. In fact, if you do that, your vocal cords are producing a pitch plus all its overtones, while your mouth is enhancing one overtone while filtering out all the others.
[2] https://newt.phys.unsw.edu.au/jw/voice.html
[3] https://www.fon.hum.uva.nl/praat/ — probably also available from your favourite Linux distro!
by jvm___ on 7/18/23, 2:09 PM
by kypro on 7/18/23, 1:48 PM
I was thinking about the complexity of expression in TTS voice synthesizers recently and it struck me just how difficult a problem that is.
To be as expressive as a human the AI model would need to fully "understand" the context of what is being said. Consider how a phrase like "I hate you" can be said in a loving way between friends sharing a joke at each others expense, vs being said with anger or in sadness.
It got me wondering if all sufficiently complex problems require models to be generally intelligent – at least in the sense that they have deep, nuanced models of the world.
For example, perhaps for a self-driving car to be as "good" as a human it actually needs to generally intelligent in that it needs to understand that it's appropriate to drive differently if it is in an emergency situation vs a leisurely weekend drive through a scenic part of town. When driving through my city after 8PM on the weekend I tend to drive slower and more cautiously because I know drunk people often walk out in front for my car – would a good self-driving car not need to understand these nuances of the world too?
This is interesting because it highlights just how important the element human understanding is in to accurately convey expression in a voice synthesizer. While I'd argue modern voice synthesizers have been more intelligible than this for some time the expressiveness of this machine has probably only been recently been rivalled by state of the art AI models.
by JKCalhoun on 7/18/23, 1:01 PM
I've played with the SP0256 speech synthesis IC and found constructing intelligible words challenging even with all the phonemes available on that silicon.
This extended video has me thinking it probably was legit though:
by bsza on 7/18/23, 6:58 PM
[0] https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_spea...
by joezydeco on 7/18/23, 2:24 PM
by slmnsmk on 7/18/23, 1:22 PM
Hey I know the person who made this!
Thanks for sharing, it really was a labor of love. I remember Griffin being super excited about how it turned out. They are really passionate about the worlds fair!
by lacrimacida on 7/18/23, 1:06 PM
by userbinator on 7/18/23, 1:29 PM
by jcpst on 7/18/23, 12:50 PM
by lbriner on 7/18/23, 4:16 PM
I heard a sample of "Say, good afternoon radio audience", then the Voder produces something very similar, but listen to it without the prompt and you would have to guess what it meant.
A Derren Brown kind of trick :-)
by mwcampbell on 7/18/23, 2:39 PM
by JoeDaDude on 7/18/23, 2:36 PM
BTW, there was one fellow who built one, something I'd like to try someday. See his recreation here:
by colanderman on 7/18/23, 1:49 PM
by Minor49er on 7/19/23, 1:01 AM
by fbdab103 on 7/19/23, 2:10 AM
by chaosprint on 7/18/23, 1:12 PM
I have added this to my feature list for https://glicol.org
the source code looks fairly straightforward. very cool
```js function makeFormantNode(ctx, f1, f2) { const sinOsc = ctx.createOscillator(); sinOsc.type = 'sawtooth'; sinOsc.frequency.value = 110; sinOsc.start();
const bandPass = ctx.createBiquadFilter();
bandPass.type = 'bandpass';
bandPass.frequency.value = (f1 + f2) / 2;
bandPass.Q.value = ((f1 + f2) / 2) / (f2 - f1);
const gainNode = ctx.createGain();
gainNode.gain.value = 0.0;
sinOsc.connect(bandPass);
bandPass.connect(gainNode);
gainNode.connect(ctx.destination);
return {
start() {
gainNode.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
},
stop() {
gainNode.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
},
panic() {
gainNode.gain.cancelScheduledValues(ctx.currentTime);
gainNode.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
},
};
}function makeSibilanceNode(ctx) { const buffer = ctx.createBuffer(1, NOISE_BUFFER_SIZE, ctx.sampleRate); const data = buffer.getChannelData(0); for (let i = 0; i < NOISE_BUFFER_SIZE; ++i) { data[i] = Math.random(); }
const noise = ctx.createBufferSource();
noise.buffer = buffer;
noise.loop = true;
const noiseFilter = ctx.createBiquadFilter();
noiseFilter.type = 'bandpass';
noiseFilter.frequency.value = 5000;
noiseFilter.Q.value = 0.5;
const noiseGain = ctx.createGain();
noiseGain.gain.value = 0.0;
noise.connect(noiseFilter);
noiseFilter.connect(noiseGain);
noiseGain.connect(ctx.destination);
noise.start();
return {
start() {
noiseGain.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
},
stop() {
noiseGain.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
},
panic() {
noiseGain.gain.cancelScheduledValues(ctx.currentTime);
noiseGain.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
},
};
}function initialize() { audioCtx = new (window.AudioContext || window.webkitAudioContext)(); audioNodes['a'] = makeFormantNode(audioCtx, 0, 225); audioNodes['s'] = makeFormantNode(audioCtx, 225, 450); audioNodes['d'] = makeFormantNode(audioCtx, 450, 700); audioNodes['f'] = makeFormantNode(audioCtx, 700, 1000); audioNodes['v'] = makeFormantNode(audioCtx, 1000, 1400); audioNodes['b'] = makeFormantNode(audioCtx, 1400, 2000); audioNodes['h'] = makeFormantNode(audioCtx, 2000, 2700); audioNodes['j'] = makeFormantNode(audioCtx, 2700, 3800); audioNodes['k'] = makeFormantNode(audioCtx, 3800, 5400); audioNodes['l'] = makeFormantNode(audioCtx, 5400, 7500); audioNodes[' '] = makeSibilanceNode(audioCtx); } ```
by zzzeek on 7/18/23, 1:57 PM