5 Myths that Stop you from running more experiments
visualwebsiteoptimizer
89 views
40 slides
Feb 27, 2025
Slide 1 of 40
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
About This Presentation
Imagine you're gearing up for a big experiment—maybe it's a new feature, a pricing change, or a redesign. You’re excited, but then someone from leadership says, “Wait, won’t overlapping experiments mess up the results?” A data scientist chimes in, “CUPED will solve our sample siz...
Imagine you're gearing up for a big experiment—maybe it's a new feature, a pricing change, or a redesign. You’re excited, but then someone from leadership says, “Wait, won’t overlapping experiments mess up the results?” A data scientist chimes in, “CUPED will solve our sample size problem, right?” Meanwhile, your team debates whether a holdout group will really help measure long-term impact. Sound familiar? These are just a few of the myths that keep teams from running more experiments, slowing down innovation and decision-making.
In this webinar, Pritul Patel, an experienced data scientist and experimentation platform product manager (Apple, Peacock TV, eBay, Yahoo), will tackle these myths head-on using real-world examples, intuitive math, and visual stats. He’ll explain why ARPU isn’t the north star metric you think it is, why copying your competitor’s CRO tactics won’t guarantee success, and why common interpretations of AA tests often lead to the wrong conclusions. If you’ve ever hesitated to run an experiment because of these concerns, this session will give you the confidence to test more, test smarter, and move faster.
Size: 1.13 MB
Language: en
Added: Feb 27, 2025
Slides: 40 pages
Slide Content
5 Myths that Stop you from
running more experiments
In partnership with VWO
Pritul Patel
Experimentation Platform
Data Scientist and Product Manager
currently freelancing
https://www.linkedin.com/in/pritul-patel
BELIEFS THAT STUNT EXPERIMENTATION VELOCITY
01
OVERLAPPING EXPERIMENTS CAUSE INTERACTION EFFECTS
02
ARPU (AOV) AS A NORTH STAR METRIC
03
CUPED FOR LOW SAMPLE SIZE EXPERIMENTS
04
HOLDOUT GROUPS TO MEASURE LONG-TERM IMPACTS
05
CRO IS ADOPTING SUCCESSFUL TACTICS FROM OTHERS
3
Prerequisite: You are familiar with basics of online A/B testing and managing experimentation programs
●Hypothesis backed with quantitative or qualitative data
●Identify audience targeting and triggering conditions
●Identify minimum detectable effect and calculate Sample size based on
the audience targeting and triggering conditions
●Setup, QA and Launch experiment
●Monitor for validity threats
●Stop, Analyze, deep dive, and share findings
RECAP OF A/B TESTING PROCESS
5
OVERLAPPING EXPERIMENTS CAUSE
INTERACTION EFFECTS
01
How overlapping works
WHAT EXACTLY IS INTERACTION EFFECT?
8
T1T2
vs
C1C2
●Experiment design emphasizes controlling all variables except one
●Interaction effects can mask true effects
●Successful treatments will be missed due to interference (low win rates)
●“Blue button text + Blue background = invisible text!”
(This is interaction, but precisely, this is a case of “conflicting” experiment)
So what do we do?
Run sequential experiments or mutually exclusive experiments
POPULAR OBJECTIONS…
9
FALSE POSITIVE RISK
Given the coin is truly fair, what is the chance of seeing this extreme result?
10
What p-value means
What most people interpret it as
“There is 95% chance that the coin is biased”
(Technically that’s what alternate hypothesis would be. However, it is still not a guarantee. The coin may not be biased.
Just this time shows it to be biased. Real life example: a fair Roulette wheel in 1918 yielded 26 blacks in a row!)
Reality
What is the chance when we say the coin is biased but it is not.
Null Hypothesis: Coin is Fair (50/50 chance of heads or tails)
Extreme Result: Getting 8 heads in 10 flips
FPR, Win-Rates,
Interactions
VELOCITY WITH MUTUALLY EXCLUSIVE EXPERIMENTS
●Do you have traffic for Mutually Exclusive experiments in 2-3 weeks?
●Have you considered “Affected” traffic in overlapping setting?
●Do they all have same start and stop dates?
●Are you mistaking interaction effects with conflicting experiments?
●For interaction effects, have you used Response Surface Methodology
that uses factorial designs?
If the answers to above are NO, then Run more overlapping experiments,
with exceptions to conflicts and RSM variants.
FACTORS TO CONSIDER
13
WOULD YOU RATHER….
…burn months and years
chasing perfection
OR
sprint towards breakthrough by
embracing the controlled chaos
of overlapping experiments?
14
ARPU (AOV) AS A NORTH STAR METRIC
02
●Direct connection to business goals
●Easily understood by all stakeholders
●Concrete and objective
●Works across contexts / business models
WHY IT MAKES SENSE…
16
ARPU variability
Push for Measuring
Revenue as primary KPI
Despite it being highly variable metric
FLYWHEEL OF DECLINING REVENUE
01
02 03
04
No Winners or Losers
Because revenue is a highly variable
metric, statistics will fail to pickup any
signal from the noise with high significance
and power. You will be frustrated.
Users decline and so does
revenue
The future revenue pull to today also starts
diminishing because now the user base
starts declining. Now you spend even more
on marketing to get more users into the
door but still continue to measure revenue.
Extreme experiments
Out of frustration, you come up with more
extreme experiments that will start stealing
revenue from future to today by harming
user experiences. Now you see more
revenue but at the cost of product
enshittification
GOODHART’S LAW
●Are you a subscription only business?
●Do you have only handful of SKUs on your store?
●Do customers purchase only once during experiment period?
●Is your buyer conversion rate <40%?
If the answers to above are NO, then use ARPU as guardrail metric NOT
primary KPI. Use surrogate KPIs like “buyer conversion rate” as primary.
FACTORS TO CONSIDER
20
CUPED FOR LOW SAMPLE SIZE EXPERIMENTS
03
WHY IT MAKES SENSE…
●For each user in the experiment, use pre-experiment data to see if it
correlates with experiment data. If it highly correlates then reduce that
users’ impact in the overall variance calculation by a factor.
●A robust and widely-acceptable variance reduction technique
22
CUPED IS NOT A MAGIC PILL
FACTORS DRIVING CORRELATIONS & MISSING DATA
24
Factors affecting Low Correlations
- User behavior changes (data drift)
- Seasonal patterns
- Product updates
- Market events
- Business model
Factors affecting Missing data
- Logged out user experiments
- Involuntary cookie churn
- Voluntary churn
- User mix shift
- Long experiment durations
Variance reduction with
Low correlations and
Missing data
●Are you testing logged-in flows only?
●In logged-out flows, are you retargeting visitors through 3rd party
trackers? If so, what percentage of cookie loss in first party data?
(you may not be able to use 3rd party trackers to run experiments. First party cookie loss affects CUPED)
●Do you have high percentage of repeat visitors? (e.g. Netflix, Amazon,
Instagram, Apple Music)
●Do you know if you have moderate to high correlations in pre vs
experiment period?
FACTORS TO CONSIDER
26
HOLDOUT GROUPS
04
3 TYPES OF HOLDOUTS
●Experiment holdout (reverse test)
●Feature holdout (combine tests for same feature in one holdout to
measure that program’s ROI e.g. personalization)
●Universal holdout (combine ALL tests in one holdout to measure
experiment program ROI)
28
HOLDOUTS ARE BIASED FROM THE GET GO …..
29
50% 50%
5% 95% (new Forward A/B tests)
Forward A/B test
Holdout Test
Would you reshuffle before holdout test OR take 5% of previous control and hold it back?
Forward A/B test Holdout Test
TIME4 weeks
13 weeks
New A/B Tests were
NOT present in the 4
weeks period
HOW USERS IN FEATURE AND UNIVERSAL HOLDOUTS FEEL
●Are you constantly running reverse tests? If so, is there anything
stopping you from trusting forward A/B tests?
●Do you have infrastructure to holdout feature and universal holdouts for
extended period of time?
●Have you considered business impact of running holdouts for long?
CONSIDERATIONS
31
ADOPTING SUCCESSFUL TACTICS OF OTHERS
05
WHY IT MAKES SENSE…
●For similar competitors, if they’ve invested millions in identifying what
has worked then why reinvent the wheel?
●Easy to test client side web experiment with WYSIWYG editors
●Low cost to execute the test
33
●… quantitative or qualitative studies
●… experiment lifecycles
●… MDE or Sample size requirements
●… tests that were run but NEVER launched and WHY
BUT YOU HAVE NO INSIGHT INTO THEIR…
34
THE LEMMING EFFECT
A phenomenon wherein crowds
of people, across various fields of
life, exhibit a certain kind of
behaviour for no reason other
than the fact that a majority of
their peers do so.
35
●Benchmarking obsession
●Time pressure to deliver
●Hype from thought leaders
●Ideas distributed in Lead magnets
●Webinars…!!!
WHY IT HAPPENS
36
●Exit-intent popups
●“Favorite Shopify Themes”
●Countdown timers
●Chatbots galore
Every website starts looking the same. So where is the moat and USP?
A NOVEL IDEA QUICKLY BECOMES MEDIOCRE OR SUBPAR
37
●Analytics team to provide quantitative insights
●Research team to provide qualitative insights
●Process to convert qualitative and quantitative into idea pipeline
WHAT CAN YOU DO
38
RECAP
01
OVERLAP EXPERIMENTS EVERYTIME WITH A FEW EXCEPTIONS
02
USE SURROGATE KPIs INSTEAD OF ARPU IN MOST CASES
03
CUPED WORKS BUT FOR SPECIFIC USE CASES
04
ABORT HOLDOUT GROUPS. USE CAUSAL INFERENCE METHODS
FOR PROGRAM ROI
05
DON’T BLINDLY FOLLOW “BEST PRACTICES”. CREATE YOUR
OWN RIGOROUS IDEA PIPELINE
39