from Hacker News

Show HN: An open source framework for voice assistants

by kwindla on 5/13/24, 5:21 PM with 39 comments

I've been obsessed for the past ~year with the possibilities of talking to LLMs. I built a bunch of one-off prototypes, shared code on X, started a Meetup group in SF, and co-hosted a big hackathon. It turns out that there are a few low-level problems that everybody building conversational/real-time AI needs to solve on the way to building/shipping something that works well: low-latency media transport, echo cancellation, voice activity detection, phrase endpointing, pipelining data between models/services, handling voice interruptions, swapping out different models/services.

On the theory that something like a LlamaIndex or LangChain for real-time/conversational AI would be useful, a few of us started working on a Python library for voice (and multimodal) AI assistants/agents.

So ... Pipecat: a framework for building things like personal coaches, meeting assistants, story-telling toys for kids, customer support bots, virtual friends, and snarky social bots.

Most of the core contributors to Pipecat so far work together at our day jobs. This has been a kind of "20% time" thing at our company. But we're serious about welcoming all contributions. We want Pipecat to support any and all models, services, transport layers, and infrastructure tooling. If you're interested in this stuff, please check it out and let us know what you think. Submit PRs. Become a maintainer. Join the Discord. Post cool stuff. Post funny stuff when your voice agent goes completely off the rails (as mine sometimes do).

by awenix on 5/13/24, 6:59 PM
Nice to see an open source implementation, i have been seeing many startups get into this space like https://www.retellai.com/, https://fixie.ai/ etc. They always end up needing speech-to-speech models (current approach seems speech-text-text-speech with multiple agents handling 1 listening + 1 speaking), excited to see how this plays with recently announced gpt-4o
by ilaksh on 5/13/24, 6:12 PM
This is great but we really need an audio-to-audio model like they demoed in the open source world. Does anyone know of anything like that?
Edit: someone found one: https://news.ycombinator.com/item?id=40346992
by johnmaguire on 5/13/24, 6:17 PM
Siri came out in October 2011. Amazon Alexa made its debut in November 2014. Google Assistant's voice-activated speakers were released in May 2016.
From what I can tell, Siri is still a dumpster fire that nobody is willing to use. And I have no personal experience with Alexa, so I can't speak to it. But I do have a few Google Home speakers and an Android phone, and I have seen no major improvements in years. In fact, it has gotten worse - for example, you can no longer add items directly to AnyList[0], only Google Keep.
Or, as an incredibly simple example of something I thought we'd get a long time ago, it's still unable to interpret two-part requests, e.g. "please repeat that but louder," or "please turn off the kitchen and dining room lights."
I find voice assistants very useful - especially when driving, lying in bed, cooking, or when I'm otherwise preoccupied. Yet they have stagnated almost since their debut. I can only imagine nobody has found a viable way to monetize them.
What will it take to get a better voice assistant for consumers? Willow[1] doesn't seem to have taken off.
[0] https://help.anylist.com/articles/google-assistant-overview/
[1] https://heywillow.io/
edit: I realize I hijacked your thread to dump something that's been on my mind lately. Pipecat looks really cool, and I hope it takes off! I hope to get some time to experiment this weekend.
by userhacker on 5/14/24, 3:59 AM
Just made https://feycher.com thats similar, but has realtime lip syncing as well. Let me know if you are interested and we can chat
by xan_ps007 on 5/13/24, 7:24 PM
We're also building bolna an open source voice orchestration: https://github.com/bolna-ai/bolna
by russ on 5/13/24, 7:49 PM
LiveKit Agents, which OpenAI uses in voice mode is also open source:
https://github.com/livekit/agents
by orliesaurus on 5/14/24, 3:11 AM
The whole VAD thing is very interesting, keen to learn more about how it works and especially with multiple speakers!
by canadiantim on 5/13/24, 6:02 PM
Very cool, great work! I can def self using this when I start building in that direction.
by 35mm on 5/14/24, 2:30 PM
How would I go about using this to live translate phone calls?
by bamazizi on 5/13/24, 6:10 PM
I wonder how the just announced "GPT-4o" with real-time voice impacts projects like this?
The demo on real-time multi language translation conversation blew me away!