by yubozhao on 6/16/22, 6:26 PM with 42 comments
by kroolik on 6/16/22, 7:42 PM
Why? You are doing the inference in the same request, which is synchronous from the perspective of the caller. The request can be memory-intensive or cpu-intensive. And the issue is you can't efficiently consider all the workloads for a single machine without being bottlenecked by Python.
I would say that the problem is in your approach trying to use the webapp hammer for all the different flavors of nails in your system, using a language that isn't suited for concurrency. What I would do is decoupling the validation/interface logic from your models via a queue. This way you can scale your capacity according to workload and make sure the workload runs on hardware most relevant to the job.
I have a feeling trying to throw a webapp at the problem might not solve your root issue, only delay it in time.
by detroitcoder on 6/16/22, 7:53 PM
All models at scale eventually need to be executed by an async queue processor which is fundamentally different from a request response REST API. For simplicity managing this outside of the process making the web request will help you debug issues when people start asking why they are getting 502 responses. If you are forced to use python for this, I would always suggest of going to celery/huey/dramatiq as an immediate next step after the REST API MVP. I hear Celery is getting better but I have ran into issues over the year so it pains me to recommend it.
by beckingz on 6/16/22, 7:53 PM
Yup. There's a huge amount of work that you need to do to do the whole ML lifecycle, and FastAPI doesn't support that out of the box like a full fledged ML Platform.
But you probably don't actually want a full ML Platform because they're all opinionated and if you try and fight them it's often worse than just serving it as an API via FastAPI...
by ttymck on 6/16/22, 8:15 PM
It would've been delightful to see "instantiate a runner in your existing Starlette application". I don't want to instantiate a Bento service. Perhaps I can mount the bento service on the Starlette application?
Apologies if I am still grossly misunderstanding. I tried to look through some of the _internal codebase to see how the Runner is implemented, the constructor signatures are very complex and the indirection to RunnerMethod had me cross-eyed.
by anderskaseorg on 6/16/22, 7:21 PM
This confuses me. How is that FastAPI’s fault? Can’t you just asynchronously delegate them to a concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor? What does Starlette provide here that FastAPI doesn’t? If the FastAPI limitations are due to ASGI, shouldn’t Starlette have the same limitations?
by sgt101 on 6/16/22, 7:16 PM
by isoprophlex on 6/16/22, 8:01 PM
by lmeyerov on 6/16/22, 7:28 PM
For model serving, we were thinking Triton (native vs python server) as it is a tightly scoped problem and optimized: any perf comparison there?
by andrewstuart on 6/16/22, 7:32 PM
by timliu99 on 6/16/22, 6:34 PM