by plurby on 10/1/24, 5:45 PM with 97 comments
by qwertox on 10/1/24, 6:28 PM
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
and `response.done` [1] with the response text.
[0] https://platform.openai.com/docs/api-reference/realtime-serv...
[1] https://platform.openai.com/docs/api-reference/realtime-serv...
by siva7 on 10/1/24, 7:03 PM
by ponty_rick on 10/1/24, 6:46 PM
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
[ {key1:value1}, {key2:value2} ]
by serjester on 10/1/24, 6:45 PM
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
by alach11 on 10/1/24, 9:37 PM
by thenameless7741 on 10/1/24, 6:01 PM
- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/
- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...
- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/
- Model Distillation in the API: https://openai.com/index/api-model-distillation/
Docs updates:
- Realtime API: https://platform.openai.com/docs/guides/realtime
- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision
- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching
- Model Distillation: https://platform.openai.com/docs/guides/distillation
- Evaluating model performance: https://platform.openai.com/docs/guides/evals
Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396
- New prompt generator on https://playground.openai.com
- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)
Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582
- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).
by 101008 on 10/1/24, 7:41 PM
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
by superdisk on 10/1/24, 6:47 PM
by minimaxir on 10/1/24, 6:57 PM
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
by N_A_T_E on 10/1/24, 8:50 PM
by simonw on 10/2/24, 6:31 PM
by modeless on 10/1/24, 8:27 PM
by sammyteee on 10/1/24, 7:05 PM
by nielsole on 10/1/24, 6:22 PM
I guess this is using their "old" turn-based voice system?
by cedws on 10/4/24, 10:33 AM
by og_kalu on 10/1/24, 7:19 PM
Audio output in the api now but you lose image input. Why ? That's a shame.
by jbaudanza on 10/2/24, 1:24 AM
by hidelooktropic on 10/1/24, 6:02 PM
by lysecret on 10/1/24, 7:08 PM
by bigcat12345678 on 10/1/24, 5:56 PM