by dankle on 8/31/23, 5:43 AM with 0 comments
Now, when a model is 10+ Gb or some LLMs even 100+Gb, we can't package them in a docker image anymore. How are those of you running these models in production serving them? Some options that we're looking at include
1. Model in a storage bucket and custom fastapi backend, read model from bucket at pod startup 2. Model on a persistent disk that we mount with a PVC, custom fastapi backend, read model from disk on pod startup (faster than reading from a bucket) 3. Install KServe in our k8s cluster and commit to their best practices 4. Vertex AI Endpoints 5. HF Inference Endpoints 6. idk bento? Other tools we havent' considered?
So how to you folks do it? What have worked well and what are pitfalls when going from small ≈2Gb models to 10+Gb models.