from Hacker News

Ask HN: ML Practictioners, how do you serve “large” models in production?

by dankle on 8/31/23, 5:43 AM with 0 comments

We are about to start serving a large model in a production setting. We have a long history of serving smaller ML models in torch/tf/sklearn and in those cases we typically bundle the model in a docker image along with a fastapi backend to serve it in k8s (GKE in our case). It's been working well for us over the years.

Now, when a model is 10+ Gb or some LLMs even 100+Gb, we can't package them in a docker image anymore. How are those of you running these models in production serving them? Some options that we're looking at include

1. Model in a storage bucket and custom fastapi backend, read model from bucket at pod startup 2. Model on a persistent disk that we mount with a PVC, custom fastapi backend, read model from disk on pod startup (faster than reading from a bucket) 3. Install KServe in our k8s cluster and commit to their best practices 4. Vertex AI Endpoints 5. HF Inference Endpoints 6. idk bento? Other tools we havent' considered?

So how to you folks do it? What have worked well and what are pitfalls when going from small ≈2Gb models to 10+Gb models.