by ridgeflex on 1/10/21, 5:45 AM with 0 comments
However, I often run into these issues: 1. Some experiments fail due to run-time errors and tmux allows them to fail silently
2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment
3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+1
4. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs
5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmux
Ideally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications/keep track of all experiment progress.
I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights & Biases (only has experiment progress/logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.
What's the best way to manage experiments like this?
P.S: I'm not an engineer, I'm a researcher, so I would really prefer solutions that don't make it difficult to keep focusing on the research and development side of things