from Hacker News

When gradient descent is a kernel method

by cgadski on 10/28/23, 12:23 AM with 37 comments

by eachro on 10/28/23, 7:22 AM
This really brings me back to my days in college, when this was the exact sort of stuff that ML classes focused on. You could have an entire course on various interpretations and extensions of kernel based methods. I wonder how much these insights are worth anymore in the age of LLMs, deep neural networks. I havent kept up too much with the NTK literature but it seems like the theoretical understanding of kernel based methods, gaussian processes does not confer you any advantage in being a better modern ML (specifically working on LLMs) engineer, where the skill set is more heavily geared towards systems engineering and/or devops for babysitting all your experiments.
by cgadski on 10/28/23, 12:45 PM
It's great to see so much interest in this post on HN! In view of the attention, I feel like I need to emphasize that although my writing and illustrations are original, the "big idea" certainly is not. In fact, my goal was just to present the simplest example I could find that illustrates a connection between gradient descent and kernel methods. There's already a lot of work being done on this relationship, most of which I'm just now learning of.
(Although I don't have time to respond meaningfully today, I really appreciate the comments pointing out other relationships by mvcalder, Joschkabraun, yagyu and others.)
by uoaei on 10/28/23, 2:12 AM
Excellent article and a really clear view on statistical model training dynamics. This perspective will no doubt contribute to the development of deep learning theory.
I'm interested especially in the lessons we can learn about the success of overparametrization. As mentioned at the beginning of the article:
> To use the picturesque idea of a "loss landscape" over parameter space, our problem will have a ridge of equally performing parameters rather than just a single optimal peak.
It has always been my intuition that overparametrization makes this ridge an overwhelming statistical majority of the parameter space, which would explain the success in training. What is less clear, as mentioned at the end, is why it hedges against overfitting. Could it be that "simple" function combinations are also overwhelmingly statistically likely vs more complicated ones? I'm imagining a hypersphere-in-many-dimensions kind of situation, where the "corners" are just too sharp to stay in for long before descending back into the "bulk".
Interested to hear others' perspectives or pointers to research on this in the context of a kernel-based interpretation. I hope understanding overparametrization may also go some way toward explaining the unreasonable effective of analog-based learning systems such as human brains.
by mvcalder on 10/28/23, 12:09 PM
Whenever this topic comes up I like to provide a citation to some work I've done:
https://towardsdatascience.com/gradient-kernel-regression-e4...
Not out of vanity (ok, a little) but because I think the idea has importance that has not been fully explored. The article's Bayesian perspective may be the whole story but somehow I don't think so. Unlike the article's author, my work left me feeling model architecture was the most important thing (behind training data) whereas they seem to feel it is ancillary.
by chongli on 10/28/23, 2:26 PM
What is a kernel method? I know what an operating system kernel is. I know what the kernel of a homomorphism is. I have taken courses in computational statistics and neural networks. Yet I’ve never encountered this use of the word kernel, which the article unhelpfully does not define (yet is critically based on). Googling for kernel didn’t help either, because the term is extremely overloaded.
Can someone help?
by yagyu on 10/28/23, 4:34 AM
This seems like a young talent that we’ll see more of. I like your to the point writing style and obvious passion for mathematical clarity. Keep it up and best wishes for your phd studies.
by archmaster on 10/28/23, 2:58 AM
Interesting article at first glance, although I definitely have to reread when I'm actually... awake
Is that my Water.css color scheme from so many years ago I see? :)
by Joschkabraun on 10/28/23, 10:30 AM
Reminds me of "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos: https://arxiv.org/abs/2012.00152
by globalnode on 10/28/23, 2:09 AM
I had to ask gpt every line about the meaning of terms :D, i know im not the target audience, but i find this stuff interesting at a conceptual level.
Is it basically about using stats to improve on linear models?
by zeec123 on 10/28/23, 8:44 AM
You may be interested in the book Gaussian Processes for Machine Learning: http://gaussianprocess.org/gpml/ (freely available)
by denton-scratch on 10/28/23, 9:11 AM
Was the .ski domain chosen because it's about gradient descent?
by bt1a on 10/28/23, 4:19 AM
i can't comprehend this, but i enjoyed your main page and seeing the quine shuffle as i zoomed in and out with my scrollwheel
by ShamelessC on 10/28/23, 2:48 AM
That feeling when the article is clearly high quality (so upvotes) but few have the expertise to say why (few comments).
by superhumanuser on 10/28/23, 4:01 AM
Is scrolling incredibly jerky on mobile for anyone else?