from Hacker News

OpenAI DevDay 2024 live blog

by plurby on 10/1/24, 5:45 PM with 97 comments

by qwertox on 10/1/24, 6:28 PM
> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
and `response.done` [1] with the response text.
[0] https://platform.openai.com/docs/api-reference/realtime-serv...
[1] https://platform.openai.com/docs/api-reference/realtime-serv...
by siva7 on 10/1/24, 7:03 PM
I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software
by ponty_rick on 10/1/24, 6:46 PM
> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
[ {key1:value1}, {key2:value2} ]
by serjester on 10/1/24, 6:45 PM
The eval platform is a game changer.
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
by alach11 on 10/1/24, 9:37 PM
It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.
by thenameless7741 on 10/1/24, 6:01 PM
Blog updates:
- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/
- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...
- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/
- Model Distillation in the API: https://openai.com/index/api-model-distillation/
Docs updates:
- Realtime API: https://platform.openai.com/docs/guides/realtime
- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision
- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching
- Model Distillation: https://platform.openai.com/docs/guides/distillation
- Evaluating model performance: https://platform.openai.com/docs/guides/evals
Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396
- New prompt generator on https://playground.openai.com
- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)
Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582
- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).
by 101008 on 10/1/24, 7:41 PM
I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
by superdisk on 10/1/24, 6:47 PM
Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.
by minimaxir on 10/1/24, 6:57 PM
From the Realtime API blog post: https://openai.com/index/introducing-the-realtime-api/
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
by N_A_T_E on 10/1/24, 8:50 PM
I just need their API to be faster. 15-30 seconds per request using 4o-mini isn't good enough for responsive applications.
by simonw on 10/2/24, 6:31 PM
For anyone who’s interested, I’ve written up details of how the underlying live blog system works here: https://til.simonwillison.net/django/live-blog
by modeless on 10/1/24, 8:27 PM
I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.
by sammyteee on 10/1/24, 7:05 PM
Loving these live updates, keep em coming! Thanks Simon!
by nielsole on 10/1/24, 6:22 PM
> The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.
I guess this is using their "old" turn-based voice system?
by cedws on 10/4/24, 10:33 AM
WebSockets for realtime? WS is TCP based, wouldn’t it be better to use something UDP based if you want to optimise for latency?
by og_kalu on 10/1/24, 7:19 PM
Image output for 4o in the API would be very nice but i'm not sure if that's at all in the cards.
Audio output in the api now but you lose image input. Why ? That's a shame.
by jbaudanza on 10/2/24, 1:24 AM
Interesting choice of a 24kHz sample rate for PCM audio. I wonder if the model was trained on 24kHz audio, rather than the usual 8/16kHz for ML models.
by hidelooktropic on 10/1/24, 6:02 PM
Any word on increased weekly caps on o1 usage?
by lysecret on 10/1/24, 7:08 PM
Using structured outputs for generative ui is such a cool idea does anyone know some cool web demos related to this ?
by bigcat12345678 on 10/1/24, 5:56 PM
Seems mostly standard items so far.