Columbus Data & Analytics Wednesdays - June 2024

JasonPacker 225 views 31 slides Jun 13, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Columbus Data & Analytics Wednesdays, June 2024 with Maria Copot 20


Slide Content

Running experiments - from design to
analysis
A/B tests under the hood
Maria Copot - CbusDAW

Who am I?
Researcher at Ohio State (Dept of Linguistics)
Main interest: characterising speakers’
knowledge of the words of their language
Broader interest in quantitative methodologies
and statistical analysis

…Why am I here?
We both collect data and run experiments!

The pitch
Digital analysts run experiments, software drives design and implementation
Privacy laws, Cookie Ban™ - consequences for experimental data and design
A peek under the hood of common software, and proposals for branching out

The next half hour of your life
1.The gears of A/B testing software
2.The benefits of Bayesian methods
3.Paid participants - Prolific.co, Mechanical Turk

The goal vs the tools
All statistical analysis and experimentation is in the service of a goal.
Never lose track of the bigger picture!
Is A/B testing what you need?

When to use A/B testing
Comparing pairs of options differing along one dimension
Compare multiple dimensions between pairs of options? Multivariate tests!
Compare holistically different page designs? Redirect tests!

A/B testing
Goal
Hypothesis
Choose variations

Results!
Mysterious A/B testing black box
(…goblins?)

The importance of having a hypothesis
Testing for random features might sometimes increase clicks or engagement
but…

The importance of having a hypothesis
Important to start from what you think is missing and why it will improve things.
“Users prefer X to Y because Z”
- Identifying areas for improvement helps come up with tests (X vs Y)
- Both positive and negative outcomes teach us about user motivations (is Z
true?)

The importance of having a hypothesis

Results do not speak on their own
The narrative behind them is crucial for
decisions

Statistical significance
A/B test - is CR higher with a blue or a red button?
Users CR P-value
Blue (old)1000 15%
0.006
Red (new)1000 18%

15% 18%
p = 0.006
Statistical significance
Assume no difference between the blue and red button
How likely is the red button observation?
P value = probability of observation assuming
no difference

Statistical significance
Threshold for significance usually set at p < 0.05 (arbitrary value)

Statistical significance
Threshold for significance usually set at p < 0.05 (arbitrary value)
Can lower threshold for more precision, depending on application and variance
in user behaviour.
✨Nothing magical about 0.05 ✨

Statistical significance

Always look at both effect size and p-value!
Larger expected effect sizes, easier to get low
p-values
Effect size: the difference between the
means of the two hypotheses

The dangers of working with means

P-values and sample size
Low p-values are facilitated by large sample sizes
Users Conversion rate P-value
Blue (old) 10 70%
0.69
Red (new) 10 80%
Users Conversion rate P-value
Blue (old) 100 70%
0.05
Red (new) 100 80%

P-values and sample size
Users Conversion rate P-value
Blue (old) 100000000 80.000
0.03
Red (new) 100000000 80.001
Power analysis
how many data points are needed to have an 80% chance
to discover an effect of the anticipated size?

P-values, sample sizes and time
Running assumption: users split 50/50 between variations.
But for certain features, not possible.
Example: purchase totals for people using apple pay vs manually entering
card details.

P-value
Effect size
(bigger difference = less overlap
between hypotheses)
Sample size
(higher = more certainty in the
difference)

A/B testing - sneaking in frequentist statistics
Frequentists ask: how likely is the observation if the null hypothesis is true?
Probability of the data given the (null) hypothesis

Bayesian A/B testing
Bayesians ask: how likely is my hypothesis, given the
data?

Bayesians vs frequentists
Frequentist estimation
a single unknown number with uncertainty around it, tested against the null hypothesis
Bayesian estimation
the entire distribution of the parameters of interest

Bayesian updating
posterior likelihood prior
marginal
p(hypothesis|evidence)=
p(evidence|hypothesis)*p(hypothesis)
p(evidence)
Out of all the times people press the button, how many were blue vs red?

Bayesian updating
p(hypothesis|evidence)=p(evidence|hypothesis)*p(hypothesis)p(evidence)
posterior likelihood prior
marginal

Advantages of Bayesian methods
•Intuitive results (p-values and confidence intervals are misunderstood)
•Reliable even for small sample sizes (no need to pre-define sample sizes)
•No need to estimate the effect size in advance
•Early stopping is allowed (continuous updating)
•Faster pipeline to decisions
•Can incorporate domain knowledge through priors
•Estimates entire distribution

Testing features in the age of privacy laws
Desired tests are often more complex and nuanced than “red vs blue button”
•Longitudinal tracking
•Multiple outcomes
Need to know what behaviour comes from the same user

Paid participant pools
Platforms like Prolific.co, MTurk, Qualtrics Panel (and others!)
Consent forms allow you to track participant behaviour in depth
Participants can be recontacted for follow-up qualitative assessments
Large participant pool filtrable by detailed demographic information
Cons: participants must be paid and know they are taking part in an
experiment

Thank you!
Any further questions:
[email protected]
Tags