by Wookai on 10/18/21, 8:01 AM with 2 comments
by Wookai on 10/18/21, 8:17 AM
In a nutshell, xmanager allows you to:
- define an experiment, which is a collection of one or more work units (think combination of hyperparamters)
- manage the different jobs/executable required to run this experiment (TPU workers, tensorboard job, etc.)
- collect and display measurements from work units (loss, other metrics)
- keep a reproducible artifact which allows you to re-run the same experiment at any point in the future
See e.g. https://github.com/deepmind/xmanager/blob/main/examples/ for a few concrete examples of a launcher scripts.
I wish they had included screenshots of the tool itself in the repo, I'll make that suggestion :).
by dekhn on 10/18/21, 2:19 PM
It's one of the few systems in ML that I've used and thought "huh, this was well-designed and properly architected from the start"