from Hacker News

Ask HN: Best practice for deploying LLM API with streaming

by wonderfuly on 2/3/25, 3:30 PM with 0 comments

In LLM applications, a common pattern is: the browser sends a request to the application's backend API, the backend code requests the LLM (like OpenAI)'s API, and streams the response back to the browser.

I've noticed this brings new challenges to deployment that not many people are talking about: the response time for streaming can sometimes last several minutes (especially when using reasoning models), which is quite different from the traditional API requests that complete in just a few seconds. At the same time, we don't want ongoing requests to be interrupted when deploying a new version.

How did you guys do it?