by lewq on 12/24/23, 6:26 PM with 1 comments
by lewq on 12/24/23, 6:26 PM
It includes a GPU scheduler that can do finegrained GPU memory scheduling (Kubernetes can only do whole-GPU, we do it per-GB of GPU memory to pack both inference and fine tuning jobs into the same fleet) to fit model instances into GPU memory to optimally trade off user facing latency with vram memory utilization
It's a pretty simple stack of control plane and a fat container that runs anywhere you can get hold of a GPU (e.g. runpod).
Architecture: https://docs.helix.ml/docs/architecture
Demo walkthrough showing runner dashboard: https://docs.helix.ml/docs/overview
Run it yourself: https://docs.helix.ml/docs/controlplane
Roast me!