Columbus Data & Analytics Wednesdays - June 2024
JasonPacker
225 views
31 slides
Jun 13, 2024
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Columbus Data & Analytics Wednesdays, June 2024 with Maria Copot 20
Size: 1.83 MB
Language: en
Added: Jun 13, 2024
Slides: 31 pages
Slide Content
Running experiments - from design to
analysis
A/B tests under the hood
Maria Copot - CbusDAW
Who am I?
Researcher at Ohio State (Dept of Linguistics)
Main interest: characterising speakers’
knowledge of the words of their language
Broader interest in quantitative methodologies
and statistical analysis
…Why am I here?
We both collect data and run experiments!
The pitch
Digital analysts run experiments, software drives design and implementation
Privacy laws, Cookie Ban™ - consequences for experimental data and design
A peek under the hood of common software, and proposals for branching out
The next half hour of your life
1.The gears of A/B testing software
2.The benefits of Bayesian methods
3.Paid participants - Prolific.co, Mechanical Turk
The goal vs the tools
All statistical analysis and experimentation is in the service of a goal.
Never lose track of the bigger picture!
Is A/B testing what you need?
When to use A/B testing
Comparing pairs of options differing along one dimension
Compare multiple dimensions between pairs of options? Multivariate tests!
Compare holistically different page designs? Redirect tests!
A/B testing
Goal
Hypothesis
Choose variations
Results!
Mysterious A/B testing black box
(…goblins?)
The importance of having a hypothesis
Testing for random features might sometimes increase clicks or engagement
but…
The importance of having a hypothesis
Important to start from what you think is missing and why it will improve things.
“Users prefer X to Y because Z”
- Identifying areas for improvement helps come up with tests (X vs Y)
- Both positive and negative outcomes teach us about user motivations (is Z
true?)
The importance of having a hypothesis
Results do not speak on their own
The narrative behind them is crucial for
decisions
Statistical significance
A/B test - is CR higher with a blue or a red button?
Users CR P-value
Blue (old)1000 15%
0.006
Red (new)1000 18%
15% 18%
p = 0.006
Statistical significance
Assume no difference between the blue and red button
How likely is the red button observation?
P value = probability of observation assuming
no difference
Statistical significance
Threshold for significance usually set at p < 0.05 (arbitrary value)
Statistical significance
Threshold for significance usually set at p < 0.05 (arbitrary value)
Can lower threshold for more precision, depending on application and variance
in user behaviour.
✨Nothing magical about 0.05 ✨
Statistical significance
Always look at both effect size and p-value!
Larger expected effect sizes, easier to get low
p-values
Effect size: the difference between the
means of the two hypotheses
The dangers of working with means
P-values and sample size
Low p-values are facilitated by large sample sizes
Users Conversion rate P-value
Blue (old) 10 70%
0.69
Red (new) 10 80%
Users Conversion rate P-value
Blue (old) 100 70%
0.05
Red (new) 100 80%
P-values and sample size
Users Conversion rate P-value
Blue (old) 100000000 80.000
0.03
Red (new) 100000000 80.001
Power analysis
how many data points are needed to have an 80% chance
to discover an effect of the anticipated size?
P-values, sample sizes and time
Running assumption: users split 50/50 between variations.
But for certain features, not possible.
Example: purchase totals for people using apple pay vs manually entering
card details.
P-value
Effect size
(bigger difference = less overlap
between hypotheses)
Sample size
(higher = more certainty in the
difference)
A/B testing - sneaking in frequentist statistics
Frequentists ask: how likely is the observation if the null hypothesis is true?
Probability of the data given the (null) hypothesis
Bayesian A/B testing
Bayesians ask: how likely is my hypothesis, given the
data?
Bayesians vs frequentists
Frequentist estimation
a single unknown number with uncertainty around it, tested against the null hypothesis
Bayesian estimation
the entire distribution of the parameters of interest
Bayesian updating
posterior likelihood prior
marginal
p(hypothesis|evidence)=
p(evidence|hypothesis)*p(hypothesis)
p(evidence)
Out of all the times people press the button, how many were blue vs red?
Advantages of Bayesian methods
•Intuitive results (p-values and confidence intervals are misunderstood)
•Reliable even for small sample sizes (no need to pre-define sample sizes)
•No need to estimate the effect size in advance
•Early stopping is allowed (continuous updating)
•Faster pipeline to decisions
•Can incorporate domain knowledge through priors
•Estimates entire distribution
Testing features in the age of privacy laws
Desired tests are often more complex and nuanced than “red vs blue button”
•Longitudinal tracking
•Multiple outcomes
Need to know what behaviour comes from the same user
Paid participant pools
Platforms like Prolific.co, MTurk, Qualtrics Panel (and others!)
Consent forms allow you to track participant behaviour in depth
Participants can be recontacted for follow-up qualitative assessments
Large participant pool filtrable by detailed demographic information
Cons: participants must be paid and know they are taking part in an
experiment