5 Myths that Stop you from running more experiments

visualwebsiteoptimizer 89 views 40 slides Feb 27, 2025
Slide 1
Slide 1 of 40
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40

About This Presentation

Imagine you're gearing up for a big experiment—maybe it's a new feature, a pricing change, or a redesign. You’re excited, but then someone from leadership says, “Wait, won’t overlapping experiments mess up the results?” A data scientist chimes in, “CUPED will solve our sample siz...


Slide Content

5 Myths that Stop you from
running more experiments
In partnership with VWO
Pritul Patel

Experimentation Platform
Data Scientist and Product Manager
currently freelancing
https://www.linkedin.com/in/pritul-patel

BELIEFS THAT STUNT EXPERIMENTATION VELOCITY
01
OVERLAPPING EXPERIMENTS CAUSE INTERACTION EFFECTS
02
ARPU (AOV) AS A NORTH STAR METRIC
03
CUPED FOR LOW SAMPLE SIZE EXPERIMENTS
04
HOLDOUT GROUPS TO MEASURE LONG-TERM IMPACTS
05
CRO IS ADOPTING SUCCESSFUL TACTICS FROM OTHERS
3
Prerequisite: You are familiar with basics of online A/B testing and managing experimentation programs

●Hypothesis backed with quantitative or qualitative data

●Identify audience targeting and triggering conditions

●Identify minimum detectable effect and calculate Sample size based on
the audience targeting and triggering conditions

●Setup, QA and Launch experiment

●Monitor for validity threats

●Stop, Analyze, deep dive, and share findings




RECAP OF A/B TESTING PROCESS
5

OVERLAPPING EXPERIMENTS CAUSE
INTERACTION EFFECTS

01

How overlapping works

WHAT EXACTLY IS INTERACTION EFFECT?
8
T1T2
vs
C1C2

●Experiment design emphasizes controlling all variables except one

●Interaction effects can mask true effects

●Successful treatments will be missed due to interference (low win rates)

●“Blue button text + Blue background = invisible text!”
(This is interaction, but precisely, this is a case of “conflicting” experiment)


So what do we do?
Run sequential experiments or mutually exclusive experiments



POPULAR OBJECTIONS…
9

FALSE POSITIVE RISK
Given the coin is truly fair, what is the chance of seeing this extreme result?




10
What p-value means
What most people interpret it as
“There is 95% chance that the coin is biased”
(Technically that’s what alternate hypothesis would be. However, it is still not a guarantee. The coin may not be biased.
Just this time shows it to be biased. Real life example: a fair Roulette wheel in 1918 yielded 26 blacks in a row!)




Reality
What is the chance when we say the coin is biased but it is not.




Null Hypothesis: Coin is Fair (50/50 chance of heads or tails)

Extreme Result: Getting 8 heads in 10 flips

FPR, Win-Rates,
Interactions

VELOCITY WITH MUTUALLY EXCLUSIVE EXPERIMENTS

●Do you have traffic for Mutually Exclusive experiments in 2-3 weeks?

●Have you considered “Affected” traffic in overlapping setting?

●Do they all have same start and stop dates?

●Are you mistaking interaction effects with conflicting experiments?

●For interaction effects, have you used Response Surface Methodology
that uses factorial designs?

If the answers to above are NO, then Run more overlapping experiments,
with exceptions to conflicts and RSM variants.
FACTORS TO CONSIDER
13

WOULD YOU RATHER….
…burn months and years
chasing perfection

OR

sprint towards breakthrough by
embracing the controlled chaos
of overlapping experiments?

14

ARPU (AOV) AS A NORTH STAR METRIC

02

●Direct connection to business goals

●Easily understood by all stakeholders

●Concrete and objective

●Works across contexts / business models




WHY IT MAKES SENSE…
16

ARPU variability

Push for Measuring
Revenue as primary KPI

Despite it being highly variable metric
FLYWHEEL OF DECLINING REVENUE
01
02 03
04
No Winners or Losers

Because revenue is a highly variable
metric, statistics will fail to pickup any
signal from the noise with high significance
and power. You will be frustrated.
Users decline and so does
revenue

The future revenue pull to today also starts
diminishing because now the user base
starts declining. Now you spend even more
on marketing to get more users into the
door but still continue to measure revenue.
Extreme experiments

Out of frustration, you come up with more
extreme experiments that will start stealing
revenue from future to today by harming
user experiences. Now you see more
revenue but at the cost of product
enshittification

GOODHART’S LAW

●Are you a subscription only business?

●Do you have only handful of SKUs on your store?

●Do customers purchase only once during experiment period?

●Is your buyer conversion rate <40%?


If the answers to above are NO, then use ARPU as guardrail metric NOT
primary KPI. Use surrogate KPIs like “buyer conversion rate” as primary.
FACTORS TO CONSIDER
20

CUPED FOR LOW SAMPLE SIZE EXPERIMENTS

03

WHY IT MAKES SENSE…
●For each user in the experiment, use pre-experiment data to see if it
correlates with experiment data. If it highly correlates then reduce that
users’ impact in the overall variance calculation by a factor.

●A robust and widely-acceptable variance reduction technique




22

CUPED IS NOT A MAGIC PILL

FACTORS DRIVING CORRELATIONS & MISSING DATA





24
Factors affecting Low Correlations

- User behavior changes (data drift)

- Seasonal patterns

- Product updates

- Market events

- Business model

Factors affecting Missing data

- Logged out user experiments

- Involuntary cookie churn

- Voluntary churn

- User mix shift

- Long experiment durations

Variance reduction with
Low correlations and
Missing data

●Are you testing logged-in flows only?

●In logged-out flows, are you retargeting visitors through 3rd party
trackers? If so, what percentage of cookie loss in first party data?
(you may not be able to use 3rd party trackers to run experiments. First party cookie loss affects CUPED)

●Do you have high percentage of repeat visitors? (e.g. Netflix, Amazon,
Instagram, Apple Music)

●Do you know if you have moderate to high correlations in pre vs
experiment period?



FACTORS TO CONSIDER

26

HOLDOUT GROUPS
04

3 TYPES OF HOLDOUTS
●Experiment holdout (reverse test)

●Feature holdout (combine tests for same feature in one holdout to
measure that program’s ROI e.g. personalization)

●Universal holdout (combine ALL tests in one holdout to measure
experiment program ROI)





28

HOLDOUTS ARE BIASED FROM THE GET GO …..
29
50% 50%
5% 95% (new Forward A/B tests)
Forward A/B test
Holdout Test
Would you reshuffle before holdout test OR take 5% of previous control and hold it back?
Forward A/B test Holdout Test
TIME4 weeks
13 weeks
New A/B Tests were
NOT present in the 4
weeks period

HOW USERS IN FEATURE AND UNIVERSAL HOLDOUTS FEEL

●Are you constantly running reverse tests? If so, is there anything
stopping you from trusting forward A/B tests?

●Do you have infrastructure to holdout feature and universal holdouts for
extended period of time?

●Have you considered business impact of running holdouts for long?




CONSIDERATIONS
31

ADOPTING SUCCESSFUL TACTICS OF OTHERS
05

WHY IT MAKES SENSE…
●For similar competitors, if they’ve invested millions in identifying what
has worked then why reinvent the wheel?

●Easy to test client side web experiment with WYSIWYG editors

●Low cost to execute the test




33

●… quantitative or qualitative studies

●… experiment lifecycles

●… MDE or Sample size requirements

●… tests that were run but NEVER launched and WHY




BUT YOU HAVE NO INSIGHT INTO THEIR…
34

THE LEMMING EFFECT
A phenomenon wherein crowds
of people, across various fields of
life, exhibit a certain kind of
behaviour for no reason other
than the fact that a majority of
their peers do so.




35

●Benchmarking obsession

●Time pressure to deliver

●Hype from thought leaders

●Ideas distributed in Lead magnets

●Webinars…!!!

WHY IT HAPPENS
36

●Exit-intent popups

●“Favorite Shopify Themes”

●Countdown timers

●Chatbots galore



Every website starts looking the same. So where is the moat and USP?

A NOVEL IDEA QUICKLY BECOMES MEDIOCRE OR SUBPAR
37

●Analytics team to provide quantitative insights

●Research team to provide qualitative insights

●Process to convert qualitative and quantitative into idea pipeline


WHAT CAN YOU DO
38

RECAP
01
OVERLAP EXPERIMENTS EVERYTIME WITH A FEW EXCEPTIONS
02
USE SURROGATE KPIs INSTEAD OF ARPU IN MOST CASES
03
CUPED WORKS BUT FOR SPECIFIC USE CASES
04
ABORT HOLDOUT GROUPS. USE CAUSAL INFERENCE METHODS
FOR PROGRAM ROI
05
DON’T BLINDLY FOLLOW “BEST PRACTICES”. CREATE YOUR
OWN RIGOROUS IDEA PIPELINE
39

Questions?
40