by kwindla on 6/26/24, 9:51 PM with 99 comments
I'm convinced that voice is going to be a bigger and bigger part of how we all interact with generative AI. But one thing that's hard, today, is building voice bots that respond as quickly as humans do in conversation. A 500ms voice-to-voice response time is just barely possible with today's AI models.
You can get down to 500ms if you: host transcription, LLM inference, and voice generation all together in one place; are careful about how you route and pipeline all the data; and the gods of both wifi and vram caching smile on you.
Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:
https://fastvoiceagent.cerebrium.ai/
We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.
macOS mic input 40
opus encoding 30
network stack and transit 10
packet handling 2
jitter buffer 40
opus decoding 30
transcription and endpointing 200
llm ttfb 100
sentence aggregation 100
tts ttfb 80
opus encoding 30
packet handling 2
network stack and transit 10
jitter buffer 40
opus decoding 30
macOS speaker output 15
----------------------------------
total ms 759
Everything in AI is changing all the time. LLMs with native audio input and output capabilities will likely make it easier to build fast-responding voice bots soon. But for the moment, I think this is the fastest possible approach/tech stack.by firefoxd on 6/27/24, 7:32 AM
I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.
One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."
The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.
by vessenes on 6/27/24, 11:50 AM
One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.
Humans work like this; we use lots of filler words as we sort of get going responding to things.
Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].
Very cool, thank you for sharing this.
by luke-stanley on 6/27/24, 9:06 AM
There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.
GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.
I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!
I do wonder how needed or optimal a single combined model is for latency and cost optimisation.
The breakdown provided is interesting.
I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?
by _def on 6/27/24, 10:02 AM
"Oh I think I figured out your secret!"
"Please tell me"
"You achieve the short response times by keeping a short context"
"You're absolutely right"
by mdbackman on 6/27/24, 10:43 AM
by az226 on 6/27/24, 8:11 AM
by trueforma on 6/27/24, 9:21 PM
by asjir on 6/27/24, 2:56 PM
by geofffox on 6/27/24, 4:55 AM
by andrewstuart on 6/27/24, 7:00 AM
Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.
by realyashnag on 6/27/24, 3:28 PM
by spuz on 6/27/24, 9:23 AM
by etherealG on 7/3/24, 3:20 PM
https://www.youtube.com/live/hm2IJSKcYvo
hn discussion here: https://news.ycombinator.com/item?id=40866569
by dijit on 6/27/24, 5:50 AM
I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”
by andrewmcwatters on 6/27/24, 6:33 PM
Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.
The same math holds up today because of a combination of fundamental limits and state of the art limits.
by anonzzzies on 6/27/24, 5:23 AM
by SubiculumCode on 6/27/24, 2:12 PM
by mmcclure on 6/27/24, 8:14 AM
Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).
by hackerbob on 6/27/24, 7:21 AM
by c0brac0bra on 6/27/24, 11:26 AM
I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.
To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?
by amluto on 6/28/24, 11:26 AM
> jitter buffer [40ms]
Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.
by yjftsjthsd-h on 6/27/24, 4:03 AM
by yalok on 6/27/24, 11:45 AM
by _DeadFred_ on 6/27/24, 10:35 PM
by jaybrendansmith on 6/27/24, 3:55 AM
by gsjbjt on 6/28/24, 9:41 PM
by tamimio on 6/29/24, 10:38 PM
by andruby on 6/27/24, 12:32 PM
And this was from a mobile connection in Europe, with a shown latency of just over 1s.
by aussieguy1234 on 6/27/24, 3:00 AM
by p_frank on 6/27/24, 2:28 PM
by sumedh on 6/27/24, 10:43 AM
by ftth_finland on 6/27/24, 11:15 AM
Perfect comprehension and no problem even with bad accents.
by isoprophlex on 6/27/24, 8:52 AM
by spark_chicken on 6/27/24, 2:21 PM
by preciousoo on 6/27/24, 5:51 AM
by Borborygymus on 6/28/24, 10:41 AM