from Hacker News

Show HN: Voice bots with 500ms response times

by kwindla on 6/26/24, 9:51 PM with 99 comments

Last year when GPT-4 was released I started making lots of little voice + LLM experiments. Voice interfaces are fun; there are several interesting new problem spaces to explore.

I'm convinced that voice is going to be a bigger and bigger part of how we all interact with generative AI. But one thing that's hard, today, is building voice bots that respond as quickly as humans do in conversation. A 500ms voice-to-voice response time is just barely possible with today's AI models.

You can get down to 500ms if you: host transcription, LLM inference, and voice generation all together in one place; are careful about how you route and pipeline all the data; and the gods of both wifi and vram caching smile on you.

Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:

https://fastvoiceagent.cerebrium.ai/

We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.

  macOS mic input                 40
  opus encoding                   30
  network stack and transit       10
  packet handling                  2
  jitter buffer                   40
  opus decoding                   30
  transcription and endpointing  200
  llm ttfb                       100
  sentence aggregation          100
  tts ttfb                        80
  opus encoding                   30
  packet handling                  2
  network stack and transit       10
  jitter buffer                   40
  opus decoding                   30
  macOS speaker output           15
  ----------------------------------
  total ms                       759
Everything in AI is changing all the time. LLMs with native audio input and output capabilities will likely make it easier to build fast-responding voice bots soon. But for the moment, I think this is the fastest possible approach/tech stack.
  • by firefoxd on 6/27/24, 7:32 AM

    Well that was fast. Kudos, really neat. Speed trumps everything else. I only noticed the robotic voice after I read the comments.

    I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.

    One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."

    The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.

  • by vessenes on 6/27/24, 11:50 AM

    This is so, so good. I like that it seems to be a teaser app for cerebrium, if I understand it. It has good killer app potential. My tests from iPad ranged from 1400ms to 400ms reported latency; in the low end, it felt very fluid.

    One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.

    Humans work like this; we use lots of filler words as we sort of get going responding to things.

    Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].

    Very cool, thank you for sharing this.

  • by luke-stanley on 6/27/24, 9:06 AM

    A cross-platform browser VAD module is: https://github.com/ricky0123/vad. This is an ONNX port of Silero's VAD network. By cross-platform, I mean it works in Firefox too. It doesn't need a WebRTC session to work, just microphone access, so it's simpler. I'm curious about the browser providing this as a native option too.

    There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

    GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.

    I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

    I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

    The breakdown provided is interesting.

    I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?

  • by _def on 6/27/24, 10:02 AM

    This was fun to try out. Earlier this week I tried june-va and the long response time kind of killed the usefulness. It's a great feature to get fast responses, this feels much more like a conversation. Funny enough, I asked it to tell me a story and then it only answered with one sentence at a time, requiring me to say "yes", "aha", "please continue" to get the next line. Then we had the following funny conversation:

    "Oh I think I figured out your secret!"

    "Please tell me"

    "You achieve the short response times by keeping a short context"

    "You're absolutely right"

  • by mdbackman on 6/27/24, 10:43 AM

    Very, very impressive! It's incredibly fast, maybe too fast, but I think that's the point. What's most impressive though is how the VAD and interruptions are tuned. That was, by far, the most natural sounding conversation I've had with an agent. Really excited to try this out once it's available.
  • by az226 on 6/27/24, 8:11 AM

    Your marketing says 500 but your math says 759.
  • by trueforma on 6/27/24, 9:21 PM

    I too am excited about voice inferencing. I wrote my own Websocket Faster whisper implementation before OpenAI's gpt4o release . They steamrolled my interview coach concept https://intervu.trueforma.ai and https://sales.trueforma.ai - sales pitch coach implementations. I defaulted to Push to talk implementation as I couldn't get VAD to work reliably. I run it all on a panda Latte :) Was looking to implement Groq's hosted whisper. I love the idea of having Llama3 uncensored on Groq as the LLM as I'm tired of the boring corporate conversations. I hope to reduce my latency and learn from your examples - Kudos to your efforts. I wish I could try the demo - seems to be over subscribed as I can't get in to talk to the bot. I'm sure my latte Panda would melt if just 3 people try to inference at the same time :)
  • by asjir on 6/27/24, 2:56 PM

    Personally, I use https://github.com/foges/whisper-dictation with llama-70b on groq. I start talking, navigate to website, and by the time it's loaded, and I picked llama-70b I finish talking, so 0 overhead. I read much faster than listen, so it works for me perfectly.
  • by geofffox on 6/27/24, 4:55 AM

    I use Firefox... still.
  • by andrewstuart on 6/27/24, 7:00 AM

    Damned impressive.

    Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.

  • by realyashnag on 6/27/24, 3:28 PM

    This was scary fast. Neat interface and (almost) indistinguishable from a human over the phone / internet. Kudos @cerebrium.ai.
  • by spuz on 6/27/24, 9:23 AM

    It's not exactly clear is this a voice-to-voice model or a voice-to-text-to-voice model? When it is finally released, OpenAI claim their GPT4o audio model will be a lot faster at conversations because there's no delay to convert from audio to text and back to audio again. I'm also looking forward to using voice models for language learning.
  • by etherealG on 7/3/24, 3:20 PM

    moshi by Kyutai seems to have beaten your approach by about 500ms, and they're going to release open source.

    https://www.youtube.com/live/hm2IJSKcYvo

    hn discussion here: https://news.ycombinator.com/item?id=40866569

  • by dijit on 6/27/24, 5:50 AM

    I’m genuinely shocked by how conversational this is.

    I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”

  • by andrewmcwatters on 6/27/24, 6:33 PM

    I love it when engineers worth their salt actually do the back-of-the-envelope calculations for latency, etc.

    Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.

    The same math holds up today because of a combination of fundamental limits and state of the art limits.

  • by anonzzzies on 6/27/24, 5:23 AM

    This is pretty amazing ; it’s very fast indeed. I don’t really care about the voice responding sounding robotic; low latency is more important for whatever I do. And you can interrupt it too. Lovely.
  • by SubiculumCode on 6/27/24, 2:12 PM

    A chatbot that interrupts me even faster. Sorry for the sarcasm. maybe im just slow, but when I'm trying to formulate a question on the spot, I pause a lot. having the chatbot jump in and interrupt is frustrating. Humans recognize the difference between someone still planning on saying something, and when they've finished. I even tried to give it a rule where it shouldn't respond until I said "The End", and of course it couldn't follow that instruction.
  • by mmcclure on 6/27/24, 8:14 AM

    Wow, Kwin, you’ve outdone yourself! The speed makes an even bigger difference than I expected going in.

    Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).

  • by hackerbob on 6/27/24, 7:21 AM

    This is indeed fast! Also seems to be no issue interrupting it while speaking. Is this using WebRTC echo cancellation to avoid microphone and speaker audio mix ups?
  • by c0brac0bra on 6/27/24, 11:26 AM

    I've been developing with Deepgram for a while, and this is one of the coolest demos I've seen with it!

    I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.

    To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?

  • by amluto on 6/28/24, 11:26 AM

    Maybe silly question:

    > jitter buffer [40ms]

    Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.

  • by yjftsjthsd-h on 6/27/24, 4:03 AM

    Dumb question - I see 2 opus encodes and decodes for a total around 120ms; is opus the fastest option?
  • by yalok on 6/27/24, 11:45 AM

    you may be double counting opus encoding/decoding delay - usually, you can run it with 20ms frame, and both encoder and decoder take less than 1ms of realtime for their operation - so it should be ~ 21ms, instead of 30+30ms for 1 direction.
  • by _DeadFred_ on 6/27/24, 10:35 PM

    This is super cool. Thanks for sharing. And I'm excited it encourage other to share. I'm excited to spend some time this weekend looking at the different ways people in this thread implemented solutions.
  • by jaybrendansmith on 6/27/24, 3:55 AM

    This thing is incredible. It finished a sentence I was saying.
  • by gsjbjt on 6/28/24, 9:41 PM

    That's awesome - can you say anything about what datasets this was trained on? I assume something specifically conversational?
  • by tamimio on 6/29/24, 10:38 PM

    Or we can say the latency is a good listening skills!! It was fast but occasionally interrupted me to answer.
  • by andruby on 6/27/24, 12:32 PM

    This is really good. I'm blown away by how important the speed is.

    And this was from a mobile connection in Europe, with a shown latency of just over 1s.

  • by aussieguy1234 on 6/27/24, 3:00 AM

    Fast yes, but the voice sounds robotic.
  • by p_frank on 6/27/24, 2:28 PM

    Amazing to see the metrics of each part that is involved! I've wondererd why you could not introduce a small sound that overplays the waiting time? Like an "hmm" to skip a few 100ms of the response time? Could be pregenerated (like 500 different versions) and play after 200ms of the last users input.
  • by sumedh on 6/27/24, 10:43 AM

    This is very impressive, me and my kid had fund talking about space.
  • by ftth_finland on 6/27/24, 11:15 AM

    This is excellent!

    Perfect comprehension and no problem even with bad accents.

  • by isoprophlex on 6/27/24, 8:52 AM

    Jesus fuck that's fast, and I had no idea speed mattered that much. Incredible. Feels like an entirely different experience than the 5+ seconds latency with openai.
  • by spark_chicken on 6/27/24, 2:21 PM

    i have tried it. it is really fast! I know making a real-time voice bot is not easy with this low latency. which LLM did you use? how large LLM to make the conversation efficient?
  • by preciousoo on 6/27/24, 5:51 AM

    This is so cool!
  • by Borborygymus on 6/28/24, 10:41 AM

    It /was/ nice and quick. Thanks for putting the demo online. It was quick to tell me complete nonsense. Apparently 7122 is the atomic number of Barium.