by Ambolia on 5/28/23, 4:35 PM with 46 comments
by radford-neal on 5/28/23, 6:51 PM
To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1/2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you'll find that the probability of the value in the observation, which is heads, is now 2/3, greater than it was under the prior.
And that's OVERFITTING, according to the definition in the blog post.
Not according to any sensible definition, however.
by CrazyStat on 5/28/23, 6:36 PM
> Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality p ( y i | y − i ) < p ( y i | y ) , for any point i and any model.
Let y_1, ... y_n be iid from a Uniform(0,theta) distribution, with some nice prior on theta (e.g. Exponential(1)). Then the posterior for theta, and hence the predictive density for a new y_i, depends only on max(y_1, ..., y_n). So for all but one of the n observations the author's strict inequality does not hold.
by syntaxing on 5/28/23, 4:53 PM
by to-mi on 5/28/23, 7:46 PM
by MontyCarloHall on 5/28/23, 6:20 PM
\int p(y_N|θ)p(θ|{y_1...y_{N-1}}) dθ = p(y_N|{y_1...y_{N-1})
by the law of total probability.Expanding the leave-one-out posterior (via Bayes' rule), we have
p(θ|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|θ)p(θ)/\int p({y_1...y_{N-1}}|θ')p(θ') dθ'
which when plugged back into the first equation is \int p(y_N|θ) p({y_1...y_{N-1}}|θ)p(θ) dθ/(\int p({y_1...y_{N-1}}|θ')p(θ') dθ')
I don't see how this simplifies to the harmonic mean expression in the post.Regardless, the author is asserting that
p(y_N|{y_1...y_{N-1}}) ≤ p(y_N|{y_1...y_N})
which seems intuitively plausible for any trained model — given a model trained on data {y_1...y_N}, performing inference on any datapoint y_1...y_N in the training set will generally be more accurate than performing inference on a datapoint y_{N+1} not in the training set.by alexmolas on 5/28/23, 5:12 PM
p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y) \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta
why is that? Can someone explain the rationale behind this?
by vervez on 5/28/23, 5:08 PM
by joshjob42 on 5/29/23, 5:02 AM
by bbminner on 5/28/23, 6:41 PM
by psyklic on 5/28/23, 8:01 PM
For example, a simple model might underfit in general, but it may still fit the training set better than the test. If this happens yet both are poor fits, it is clearly underfitting and not overfitting. Yet by the article's definition, it would both be underfitting and overfitting simultaneously. So, I suspect this is not an ideal definition.
by dmurray on 5/28/23, 4:51 PM
by chunsj on 5/29/23, 1:25 AM
by tesdinger on 5/28/23, 10:49 PM