by ironbound on 4/6/24, 2:29 PM with 43 comments
by johndough on 4/6/24, 8:41 PM
Validation accuracy: https://i.imgur.com/8ZtX7Rd.png
Train loss: https://i.imgur.com/o5XdQ29.png
Code: https://bpa.st/NVJQ (currently only runs on my computer, but not enough time to clean it up)
Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.
In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.
by tysam_and on 4/6/24, 8:16 PM
Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.
There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.
Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.
-Fern
by shostack on 4/6/24, 8:08 PM
by danielhanchen on 4/7/24, 2:10 AM
* Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
* Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
* A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.
* I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-... wasn't pleasant.
by rand0mwalk on 4/6/24, 5:43 PM
by ShamelessC on 4/6/24, 11:35 PM
by yinser on 4/6/24, 7:46 PM
by wwilim on 4/7/24, 3:24 PM