5 Myths that Stop you from running more experiments

visualwebsiteoptimizer 89 views 40 slides Feb 27, 2025

Slide 1 of 40

About This Presentation

Imagine you're gearing up for a big experiment—maybe it's a new feature, a pricing change, or a redesign. You’re excited, but then someone from leadership says, “Wait, won’t overlapping experiments mess up the results?” A data scientist chimes in, “CUPED will solve our sample siz...

Size: 1.13 MB

Language: en

Added: Feb 27, 2025

Slides: 40 pages

Slide Content

5 Myths that Stop you from
running more experiments
In partnership with VWO
Pritul Patel

Experimentation Platform
Data Scientist and Product Manager
currently freelancing
https://www.linkedin.com/in/pritul-patel

BELIEFS THAT STUNT EXPERIMENTATION VELOCITY
01
OVERLAPPING EXPERIMENTS CAUSE INTERACTION EFFECTS
02
ARPU (AOV) AS A NORTH STAR METRIC
03
CUPED FOR LOW SAMPLE SIZE EXPERIMENTS
04
HOLDOUT GROUPS TO MEASURE LONG-TERM IMPACTS
05
CRO IS ADOPTING SUCCESSFUL TACTICS FROM OTHERS
3
Prerequisite: You are familiar with basics of online A/B testing and managing experimentation programs

●Hypothesis backed with quantitative or qualitative data

●Identify audience targeting and triggering conditions

●Identify minimum detectable effect and calculate Sample size based on
the audience targeting and triggering conditions

●Setup, QA and Launch experiment

●Monitor for validity threats

●Stop, Analyze, deep dive, and share ﬁndings

RECAP OF A/B TESTING PROCESS
5

OVERLAPPING EXPERIMENTS CAUSE
INTERACTION EFFECTS

01

How overlapping works

WHAT EXACTLY IS INTERACTION EFFECT?
8
T1T2
vs
C1C2

●Experiment design emphasizes controlling all variables except one

●Interaction effects can mask true effects

●Successful treatments will be missed due to interference (low win rates)

●“Blue button text + Blue background = invisible text!”
(This is interaction, but precisely, this is a case of “conﬂicting” experiment)

So what do we do?
Run sequential experiments or mutually exclusive experiments

POPULAR OBJECTIONS…
9

FALSE POSITIVE RISK
Given the coin is truly fair, what is the chance of seeing this extreme result?

10
What p-value means
What most people interpret it as
“There is 95% chance that the coin is biased”
(Technically that’s what alternate hypothesis would be. However, it is still not a guarantee. The coin may not be biased.
Just this time shows it to be biased. Real life example: a fair Roulette wheel in 1918 yielded 26 blacks in a row!)

Reality
What is the chance when we say the coin is biased but it is not.

Null Hypothesis: Coin is Fair (50/50 chance of heads or tails)

Extreme Result: Getting 8 heads in 10 ﬂips

FPR, Win-Rates,
Interactions

VELOCITY WITH MUTUALLY EXCLUSIVE EXPERIMENTS

●Do you have traffic for Mutually Exclusive experiments in 2-3 weeks?

●Have you considered “Affected” traffic in overlapping setting?

●Do they all have same start and stop dates?

●Are you mistaking interaction effects with conﬂicting experiments?

●For interaction effects, have you used Response Surface Methodology
that uses factorial designs?

If the answers to above are NO, then Run more overlapping experiments,
with exceptions to conﬂicts and RSM variants.
FACTORS TO CONSIDER
13

WOULD YOU RATHER….
…burn months and years
chasing perfection

OR

sprint towards breakthrough by
embracing the controlled chaos
of overlapping experiments?

14

ARPU (AOV) AS A NORTH STAR METRIC

02

●Direct connection to business goals

●Easily understood by all stakeholders

●Concrete and objective

●Works across contexts / business models

WHY IT MAKES SENSE…
16

ARPU variability

Push for Measuring
Revenue as primary KPI

Despite it being highly variable metric
FLYWHEEL OF DECLINING REVENUE
01
02 03
04
No Winners or Losers

Because revenue is a highly variable
metric, statistics will fail to pickup any
signal from the noise with high signiﬁcance
and power. You will be frustrated.
Users decline and so does
revenue

The future revenue pull to today also starts
diminishing because now the user base
starts declining. Now you spend even more
on marketing to get more users into the
door but still continue to measure revenue.
Extreme experiments

Out of frustration, you come up with more
extreme experiments that will start stealing
revenue from future to today by harming
user experiences. Now you see more
revenue but at the cost of product
enshittiﬁcation

GOODHART’S LAW

●Are you a subscription only business?

●Do you have only handful of SKUs on your store?

●Do customers purchase only once during experiment period?

●Is your buyer conversion rate <40%?

If the answers to above are NO, then use ARPU as guardrail metric NOT
primary KPI. Use surrogate KPIs like “buyer conversion rate” as primary.
FACTORS TO CONSIDER
20

CUPED FOR LOW SAMPLE SIZE EXPERIMENTS

03

WHY IT MAKES SENSE…
●For each user in the experiment, use pre-experiment data to see if it
correlates with experiment data. If it highly correlates then reduce that
users’ impact in the overall variance calculation by a factor.

●A robust and widely-acceptable variance reduction technique

22

CUPED IS NOT A MAGIC PILL

FACTORS DRIVING CORRELATIONS & MISSING DATA

24
Factors affecting Low Correlations

- User behavior changes (data drift)

- Seasonal patterns

- Product updates

- Market events

- Business model

Factors affecting Missing data

- Logged out user experiments

- Involuntary cookie churn

- Voluntary churn

- User mix shift

- Long experiment durations

Variance reduction with
Low correlations and
Missing data

●Are you testing logged-in ﬂows only?

●In logged-out ﬂows, are you retargeting visitors through 3rd party
trackers? If so, what percentage of cookie loss in ﬁrst party data?
(you may not be able to use 3rd party trackers to run experiments. First party cookie loss affects CUPED)

●Do you have high percentage of repeat visitors? (e.g. Netﬂix, Amazon,
Instagram, Apple Music)

●Do you know if you have moderate to high correlations in pre vs
experiment period?

FACTORS TO CONSIDER

26

HOLDOUT GROUPS
04

3 TYPES OF HOLDOUTS
●Experiment holdout (reverse test)

●Feature holdout (combine tests for same feature in one holdout to
measure that program’s ROI e.g. personalization)

●Universal holdout (combine ALL tests in one holdout to measure
experiment program ROI)

28

HOLDOUTS ARE BIASED FROM THE GET GO …..
29
50% 50%
5% 95% (new Forward A/B tests)
Forward A/B test
Holdout Test
Would you reshuffle before holdout test OR take 5% of previous control and hold it back?
Forward A/B test Holdout Test
TIME4 weeks
13 weeks
New A/B Tests were
NOT present in the 4
weeks period

HOW USERS IN FEATURE AND UNIVERSAL HOLDOUTS FEEL

●Are you constantly running reverse tests? If so, is there anything
stopping you from trusting forward A/B tests?

●Do you have infrastructure to holdout feature and universal holdouts for
extended period of time?

●Have you considered business impact of running holdouts for long?

CONSIDERATIONS
31

ADOPTING SUCCESSFUL TACTICS OF OTHERS
05

WHY IT MAKES SENSE…
●For similar competitors, if they’ve invested millions in identifying what
has worked then why reinvent the wheel?

●Easy to test client side web experiment with WYSIWYG editors

●Low cost to execute the test

33

●… quantitative or qualitative studies

●… experiment lifecycles

●… MDE or Sample size requirements

●… tests that were run but NEVER launched and WHY

BUT YOU HAVE NO INSIGHT INTO THEIR…
34

THE LEMMING EFFECT
A phenomenon wherein crowds
of people, across various ﬁelds of
life, exhibit a certain kind of
behaviour for no reason other
than the fact that a majority of
their peers do so.

35

●Benchmarking obsession

●Time pressure to deliver

●Hype from thought leaders

●Ideas distributed in Lead magnets

●Webinars…!!!

WHY IT HAPPENS
36

●Exit-intent popups

●“Favorite Shopify Themes”

●Countdown timers

●Chatbots galore

Every website starts looking the same. So where is the moat and USP?

A NOVEL IDEA QUICKLY BECOMES MEDIOCRE OR SUBPAR
37

●Analytics team to provide quantitative insights

●Research team to provide qualitative insights

●Process to convert qualitative and quantitative into idea pipeline

WHAT CAN YOU DO
38

RECAP
01
OVERLAP EXPERIMENTS EVERYTIME WITH A FEW EXCEPTIONS
02
USE SURROGATE KPIs INSTEAD OF ARPU IN MOST CASES
03
CUPED WORKS BUT FOR SPECIFIC USE CASES
04
ABORT HOLDOUT GROUPS. USE CAUSAL INFERENCE METHODS
FOR PROGRAM ROI
05
DON’T BLINDLY FOLLOW “BEST PRACTICES”. CREATE YOUR
OWN RIGOROUS IDEA PIPELINE
39

5 Myths that Stop you from running more experiments

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

5 Myths that Stop you from running more experiments

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Earthquakes_Type of Faults_Science G8.pptx

Quiz #1 Science 10 in the first quarter for jhs

Astronomy history from long ago till doday

Great history of astronomy from long ago till today

EARTHQUAKE-DRILL.powerpoint.............

History of astronomy from old times to the present times