from Hacker News

Micro-models: purposefully overfit models that are good at one specific thing

by ulrikhansen54 on 8/5/21, 1:15 PM with 30 comments

by ALittleLight on 8/6/21, 5:01 AM
This doesn't really seem like "Overfitting" as I understand the concept. This is more like training a model to do a specific task rather than a more general task. Overfitting would be if your model started to memorize the training data - which doesn't seem to be what they are talking about and doesn't seem like it would be very useful..
by ruinar50 on 8/6/21, 4:09 PM
Putting aside the detailed discussions on what exactly is "overfitting" for the moment, interested to hear more about the utility of micro-models in actual value delivery pipelines.
Does it matter if it's technically overfitting or not if everyone understands what their "one specific thing" is and how to "stitch" them together to get accurate results over a some real-world problem space? (conversely, people have to recognize the limitations.) Also, for "micro-model" as a word, appreciate having neutral vocabulary to talk about a model that doesn't solve the whole problem space, but does work for some of it. As opposed to "overfit model" or "incomplete model", which seem to cast negative connotations on a concept which is potentially useful when properly applied. (Though an eventual consensus on vocabulary likely necessary as the space matures...)
Later parts of the article introduced kick-off, iteration, and prototyping time as concrete benefits. Interested to see a follow-up addressing how micro-models fit into general problem-solving pipeline. What's next in terms of speeding up the assembly-line process? Where do they fit into data-oriented programming on the whole?
by brainwipe on 8/6/21, 8:16 AM
I'm not sure this is overfitting but a very narrow training set. It's still generalising against inputs it hasn't seen. If it was really overfitted then it wouldn't work for any unseen frames and it would be learning the "noise". It's not learning noise else you'd get lots of false positives, such as dark areas in the frame that look a bit like Batman but aren't. The main reason you want to generalise is noise rejection (no mention of this in the article). I think the S/N ratio in a video is exceptionally high as the dataset is directly repeatable so the source of truth is exceptionally accurate.
That being said, narrow training sets are a great idea and this application looks great.
by robojoker on 8/6/21, 6:22 AM
This is an approach that I have used when doing attribution. Given error signals of a larger system, I couldn’t get great performance to attribute the errors to a particular broken component in the system. However, when I broke down that component into its set of particular issues and built a classifier per issue, I was able to get great performance. With the light weight models we used, it was straight forward to automate most of the training / validation of these component-issue specific models and decom them when the issue no longer existed (a fix was put in).
by l-lousy on 8/6/21, 10:33 AM
Interesting article, this also seems like a form of knowledge distillation. There have been a lot of examples of people distilling an ensemble into a single model, maybe you could try that here directly by taking out the middle man (match their outputs directly instead of labeling data).
Anyway, I’ve been trying to think of how this could be used for text data, specifically NER, which generally requires a lot more semantic understanding of the input. Sadly it seems like there might not be much room for the ‘micro’ part of the micro models.
by jogundas on 8/6/21, 8:18 AM
A nice example of overfitting!
However, it is hard to imagine an actual application of the process. If I understand it correctly, the author suggests using a set of micro-models for annotating a dataset which is then used to train another model. The latter model can actually detect Batman in a general environment, ie, can generalize. However, enriching a training dataset by adding adjacent frames depicting Batman from the same movie will likely have limited usefulness when training an actual Batman detection (non-micro!) model. Or do I get the final application wrong?
by tomrod on 8/6/21, 4:52 AM
Neat concept. Suggestion to the author: show the out of sample fit stats and how the interpolation versus extrapolation regions are determined.
by underaxon on 8/6/21, 8:14 AM
I may be wrong but I think this is what kernel methods (eg. SVM) do, right? So this looks like a (deep)SVM where the kernels are small NNs.
by klysm on 8/6/21, 7:49 AM
I think the important piece missing from the headline is that these micro models are combined in ensemble like fashion. Because of that I wouldn’t really call it overfitting per se - more of a very restricted space to care about.
by abz10 on 8/6/21, 1:16 PM
Nothing new was discovered here and the key terminology is used incorrectly.
To be fair, most of the industry are amateurs, but most people don’t write medium posts and continue to argue their ignorance on HN.