from Hacker News

Ask HN: How do you manage your ML experiments?

by ridgeflex on 1/10/21, 5:45 AM with 0 comments

For work, I need to run many PyTorch/TF/MXNet experiments in parallel on a cloud instance with multiple GPUs. Currently, I use Tensorboard (and its variants) to log results and tmux to run experiments simultaneously on multiple GPUs.

However, I often run into these issues: 1. Some experiments fail due to run-time errors and tmux allows them to fail silently

2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment

3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+1

4. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs

5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmux

Ideally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications/keep track of all experiment progress.

I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights & Biases (only has experiment progress/logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.

What's the best way to manage experiments like this?

P.S: I'm not an engineer, I'm a researcher, so I would really prefer solutions that don't make it difficult to keep focusing on the research and development side of things