from Hacker News

Show HN: Frankenfit, a Python-based DSL for predictive data modeling pipelines

by 6502nerdface on 2/11/23, 4:34 PM with 1 comments

by 6502nerdface on 2/11/23, 4:34 PM
The documentation is still very much in progress, but I thought I'd share this library/DSL that I've been working on, which is inspired by my experiences in academia and quant finance, as well as a desire to improve on scikit-learn's "pipeline" module. I'm hopeful that others might see the benefits of this style of writing data-modeling pipelines, or perhaps critique it mercilessly :)
One point of inspiration here is my experience that many researchers and data scientists in industry, while fastidious about the in-sample/out-of-sample distinction for the predictive "core" of their model (i.e. the regression or the ANN or whatever), are often less conscious of the same distinction for all of the feature-preparation steps and prediction transformations that may precede or follow that core in their over-all pipeline. For example, before fitting your regression or whatever, you might z-score some of the predictors, and when you later apply that now-fit regression to make predictions on some held-out data, you really ought to z-score the held-out feature values using the same means and standard deviations that were "learned" at fit-time (they should be considered part of the "state" of your model). But this is often inconvenient or overlooked by practitioners, who might naively z-score the held-out batch using its own means and SDs, before feeding it to their trained model.
But if you imagine receiving the held-out data not as a batch, but as a stream of observation one at a time, and needing to generate a prediction for each as it arrives, it's clear that this is not correct! (As you make each prediction, you can't know what the held-out batch's means and SDs "will be".)
So with Frankenfit, I wanted to make it easy to write end-to-end data modeling pipelines where this distinction between fit-time and apply-time is baked in all the way through the intermediate transformations, be they z-scores, winsorizations, imputations, or whatever. And it turns out that this can also make for very elegant expressions of common resampling workflows like cross-validation, hyperparameter search, sequential learning, and so on.