by dedalus on 10/17/23, 11:04 PM with 53 comments
by Alligaturtle on 10/18/23, 6:25 PM
I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.
What to do with those extreme values?
Here's something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.
What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?
by jedberg on 10/18/23, 5:42 PM
Both methods give you better metrics for comparing periodically, like day over day or week over week.
by bluenose69 on 10/18/23, 9:10 PM
The source of the data matters a lot in what methods make sense. For example, hand-entered numbers might involve transposed digits, or missing signs, or decimal points in the wrong place. Numbers deriving from some electronic measurements might have problems with numbers "pegging out" at some limit. In other cases, those numbers might "wrap around". Data that have been examined at an earlier stage might have numbers changed to something that is obviously wrong, like a temperature of -999.999 or something. The list goes on.
My point is that exploring outliers is often quite productive, and comparing means to Winsorized means can be a very quick way to see if outliers are an issue. This is not so much an issue for interactive work, for which plotting data is usually an early step, but it can come in handy during a preliminary stage of processing large datasets non-interactively. It can also be handy as part of a quality-control pipeline in a data stream.
by some_random on 10/18/23, 6:28 PM
by phlip9 on 10/18/23, 7:03 PM
Rust's `cargo bench` "winzorizes" the benchmark samples before computing summary statistics (incl. the mean).
https://github.com/rust-lang/rust/blob/master/library/test/s...
by SubiculumCode on 10/18/23, 5:20 PM
by hornban on 10/19/23, 1:05 AM
I tend to benefit the most from seeing the entire distribution visually, and that helps me decide if I'm looking for a median, a "normal" mean, a "mean minus some weird outliers", or something different entirely.
Does anybody happen to know of a good visual guide for how different measures of central tendency apply to various distributions? Anything that emphasizes pathological cases is helpful.
by thih9 on 10/18/23, 5:56 PM
by snicker7 on 10/18/23, 6:15 PM
by laughy on 10/19/23, 7:12 AM
by croisillon on 10/18/23, 8:03 PM