from Hacker News

Winsorized mean

by dedalus on 10/17/23, 11:04 PM with 53 comments

by Alligaturtle on 10/18/23, 6:25 PM
I'm not sure I would feel comfortable using the Winsorized mean -- it doesn't have any particular statistical properties, and it lacks any intuition appeal because it's not clear what the value represents.
I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.
What to do with those extreme values?
Here's something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.
What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?
by jedberg on 10/18/23, 5:42 PM
The trimmed mean and Winsorized mean are both super useful in monitoring and metrics systems. In most cases you don't actually want the median but you also don't want the extreme outliers to throw everything off with a mean.
Both methods give you better metrics for comparing periodically, like day over day or week over week.
by bluenose69 on 10/18/23, 9:10 PM
I use it mainly as a data-exploration method. For example, if the Winsorized mean gives a value that differs a lot from the conventional mean, then I might examine the outliers in a bit more detail with tools like a boxplot or a histogram.
The source of the data matters a lot in what methods make sense. For example, hand-entered numbers might involve transposed digits, or missing signs, or decimal points in the wrong place. Numbers deriving from some electronic measurements might have problems with numbers "pegging out" at some limit. In other cases, those numbers might "wrap around". Data that have been examined at an earlier stage might have numbers changed to something that is obviously wrong, like a temperature of -999.999 or something. The list goes on.
My point is that exploring outliers is often quite productive, and comparing means to Winsorized means can be a very quick way to see if outliers are an issue. This is not so much an issue for interactive work, for which plotting data is usually an early step, but it can come in handy during a preliminary stage of processing large datasets non-interactively. It can also be handy as part of a quality-control pipeline in a data stream.
by some_random on 10/18/23, 6:28 PM
One of the fascinating things about statistics is that in some ways it's more an art than a science, and the question of "when would I choose to use this over a normal mean or median" is a great example of that.
by phlip9 on 10/18/23, 7:03 PM
A related example out in the wild:
Rust's `cargo bench` "winzorizes" the benchmark samples before computing summary statistics (incl. the mean).
https://github.com/rust-lang/rust/blob/master/library/test/s...
by SubiculumCode on 10/18/23, 5:20 PM
It can be useful data cleaning method when used judiciously, but I'm surprised its at the top of HN
by hornban on 10/19/23, 1:05 AM
There's some interesting discussion in this thread about truncated vs winsorized means. For my own part, this is the first time I've come across either of these terms.
I tend to benefit the most from seeing the entire distribution visually, and that helps me decide if I'm looking for a median, a "normal" mean, a "mean minus some weird outliers", or something different entirely.
Does anybody happen to know of a good visual guide for how different measures of central tendency apply to various distributions? Anything that emphasizes pathological cases is helpful.
by thih9 on 10/18/23, 5:56 PM
Is all fun and games until one or two of the extreme remaining values (that are later used to replace the rest) turn out to be an outlier in itself.
by snicker7 on 10/18/23, 6:15 PM
Why would I prefer a winsorized mean over a median?
by laughy on 10/19/23, 7:12 AM
A better alternative is to assume t-distributed errors
by croisillon on 10/18/23, 8:03 PM
not to be confused with Florida mean