from Hacker News

Five ways to reduce variance in A/B testing

by Maro on 9/28/24, 12:42 PM with 28 comments

by vijayer on 9/29/24, 6:23 PM
This is a good list that includes a lot of things most people miss. I would also suggest:
1. Tight targeting of your users in an AB test. This can be through proper exposure logging, or aiming at users down-funnel if you’re actually running a down-funnel experiment. If your new iOS and Android feature is going to be launched separately, then separate the experiments.
2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.
Everything mentioned in this article, including stratified sampling and CUPED are available, out-of-the-box on Statsig. Disclaimer: I’m the founder, and this response was shared by our DS Lead.
by kqr on 9/29/24, 5:35 PM
> Winsorizing, ie. cutting or normalizing outliers.
Note that outliers are often your most valuable data points[1]. I'd much rather stratify than cut them out.
By cutting them out you indeed get neater data, but it no longer represents the reality you are trying to model and learn from, and you run a large risk of drawing false conclusions.
[1]: https://entropicthoughts.com/outlier-detection
by sunir on 9/29/24, 5:08 PM
One of the most frustrating results I found is that A/B split tests often resolved into a winner within the sample size range we set; however if I left the split running over a longer period of time (eg a year) the difference would wash out.
I had retargeting in a 24 month split by accident and found it didn’t matter after all the cost in the long term. We could bend the conversion curve but not change the people who would convert.
And yes we did capture more revenue in the short term but over the long term the cost of the ads netted it all to zero or less than zero. And yes we turned off retreating after conversion. The result was customers who weren’t retargeted eventually bought anyway.
Has anyone else experienced the same?
by tmoertel on 9/29/24, 8:50 PM
Just a note that "stratification" as described in this article is not what is traditionally meant by taking a stratified sample. The article states:
> Stratification lowers variance by making sure that each sub-population is sampled according to its split in the overall population.
In common practice, the main way that stratification lowers variance is by computing a separate estimate for each sub-population and then computing an overall population estimate from the sub-population estimates. If the sub-populations are more uniform ("homogeneous") than is the overall population, the sub-populations will have smaller variances than the overall population, and a combination of the smaller variances will be smaller than the overall population's variance.
In short, you not only stratify the sample, but also correspondingly stratify the calculation of your wanted estimates.
The article does not seem to take advantage of the second part.
(P.S. This idea, taken to the limit, is what leads to importance sampling, where potentially every member of the population exists in its own stratum. Art Owen has a good introduction: https://artowen.su.domains/mc/Ch-var-is.pdf.)
by pkoperek on 9/29/24, 5:30 PM
Good read. Does anyone know if any of the experimentation frameworks actually uses these methods to make the results more reliable (e.g. allow to automatically apply winsorization or attempt to make the split sizes even)?
by usgroup on 9/29/24, 6:56 PM
Adding covariates to the post analysis can reduce variance. One instance of this is CUPED by there are lots of covariates which are easier to add (eg request type, response latency, day of week, user info, etc).
by withinboredom on 9/29/24, 4:26 PM
good advice! From working on an internal a/b testing platform, we had built-in tooling to do some of this stuff after the fact. I don't know of any off-the-shelf a/b testing tool that can do this stuff.
by kqr on 9/29/24, 4:42 PM
See also sample unit engineering: https://entropicthoughts.com/sample-unit-engineering
Statisticians have a lot of useful tricks to get higher quality data out of the same cost (i.e. sample size.)
Another topic I want to learn properly is running multiple experiments in parallel in a systematic way to get faster results and be able to control for confounding. Fisher advocated for this as early as 1925, and I still think we're learning that lesson today in our field: sometimes the right strategy is not to try one thing at a time and keep everything else constant.
by musicale on 10/2/24, 5:42 AM
6. Switch to A/A testing.
by sanchezxs on 9/29/24, 7:16 PM
Yes.