by ridgeflex on 1/10/21, 8:23 PM with 1 comments
However, I often run into these issues:
1. Some experiments fail due to run-time errors and tmux allows them to fail silently
2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment
3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+1
4. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs
5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmux
Ideally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications/keep track of all experiment progress.
I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights & Biases (only has experiment progress/logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.
What's the best way to manage experiments like this?
by p1esk on 1/11/21, 1:31 AM
I've seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow.
I'd like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes.