by wonderfuly on 2/3/25, 3:30 PM with 0 comments
I've noticed this brings new challenges to deployment that not many people are talking about: the response time for streaming can sometimes last several minutes (especially when using reasoning models), which is quite different from the traditional API requests that complete in just a few seconds. At the same time, we don't want ongoing requests to be interrupted when deploying a new version.
How did you guys do it?