by superfx on 11/12/18, 3:17 PM with 57 comments
by fwilliams on 11/13/18, 3:52 AM
The major contribution of the work is showing that ResNet needs a number of parameters which is polynomial in the dataset size to converge to a global optimum in contrast to traditional neural nets which require an exponential number of parameters.
by iandanforth on 11/12/18, 9:00 PM
by ramgorur on 11/12/18, 9:39 PM
1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.
2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.
3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.
Please correct me if I am wrong.
by charleshmartin on 11/13/18, 6:36 PM
https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...
But they miss something..the weight matrices also display power law behavior.
https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...
This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.
This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel
by gogut on 11/14/18, 9:52 PM
In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.
by brentjanderson on 11/12/18, 8:05 PM
> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).
If this variant of gradient descent is able to reach global minima in polynomial time, and if neural networks are proven to approximate any function, then ostensibly this technique could be used to guarantee the lowest error possible in approximating any function. This seems incredibly important. Can someone correct my reading of the abstract?
by juskrey on 11/13/18, 5:05 AM
by pj_mukh on 11/12/18, 9:56 PM
by orf on 11/12/18, 11:24 PM
Can someone expand on this? I've never heard of this before, at least not in the general case.
by lbj on 11/12/18, 8:00 PM