2011: Daryl Bem published a paper in JPSP
(top journal) claiming to have found evidence
that ESP exists
Post-publication, the significant effects were
attributed to an excessive familywise error
rate
Did not correct alpha for multiple comparisons
(e.g., through a Bonferroni correction)
His evidence for ESP is chalked up to false
positives (i.e., Type I errors)
2011: Cases of scientific fraud
Diederik Stapel busted for
fabricating data
2011-present: Major findings in
social psychology not replicated
Ego depletion, embodied cognition,
power posing
Behavioural priming under fire
2015: Open Science
Collaboration failed to replicate
most studies in psychology
Strack: fake smile boosts
actual happiness
Cuddy: Power posing
boosts testosterone,
decreases cortisol
Replicability of results is essential for science
If results do not replicate, how can we be sure that
they exist at all?
Open Science Collaboration attempted to replicate 100
studies published in 3 top psychology journals in 2008
Replications used materials supplied by original
authors and were high-powered
Results: 39% of the original studies were successfully
replicated
25% of social psychology studies replicated
50% of cognitive psychology studies replicated
Effect sizes overestimated in original studies
Sampling variability: even with a direct
replication, no two samples are exactly the
same
Hidden moderators across studies
E.g., different cultural contexts, time in
history, demographic characteristics
Low statistical power
False positives in original study
False negatives in replication study
Traditionally, journals were biased in favour of
publishing:
1. Significant findings
Non-significant findings relegated to a
researcher’s “file drawer”
2. Novel findings
Counterintuitive, surprising findings more likely to
be published in top journals
Academics are incentivized to publish
“flashy” results in top journals
Earns jobs, promotions, editorships at journals,
traditional media coverage, esteem
Academics may engage in questionable
research practices, or, in the worst-case
scenario, engage in data fraud, all to drive p-
values below .05.
Results in a plethora of false positives in the
research literature that don’t replicate
False positives are Type I errors
Claim an effect exists when it actually doesn’t
i.e., incorrectly reject a null hypothesis that there is
no effect (should accept there is no effect)
p < .05 means that you are willing to accept a 5%
chance of false positives
False negatives are Type II errors
Claim an effect doesn’t exist when it actually does
Simmons, Nelson, & Simonsohn (2011) demonstrated
that it’s easy to statistically support a hypothesis that is
actually false (i.e., find a false positive)
Can be dangerous
E.g., health research: claim that a treatment is effective
when actually it isn’t, or even has negative side effects
Wastes resources
Researchers waste time, effort, and money conducting
research on effects that don’t actually exist
Hard to excise false positives from the literature
Not enough incentives for researchers to conduct
replications that debunk the false positives
Erodes the credibility of psychological science
“Fake science”
Researchers may engage in questionable
research practices not because they
intend to be dishonest
Rather, they are motivated to make
decisions that support their hypotheses
Make decisions so that p-value will be
below .05 and can claim statistical significance
Pervasive among researchers
But with every self-serving decision made,
the chance of false positives increases
Researchers have many decisions to make
when conducting a study:
Choosing a sample size - stopping data collection
How to deal with outliers/illegitimate responses
Which conditions/groups should be compared
Creating variables
Which items? Transformations?
Which variables should be included in analyses
IVs, DVs, controls, mediators, moderators
Making each decision increases the false
positive rate
Common for researchers to collect data,
conduct analyses, and if not significant, to
collect more data
Stop collecting data when p < .05
More often you test for significance, the higher
the likelihood of false positives
The lower the initial sample size, and the fewer
the number of participants added after each
subsequent data collection, the higher the
likelihood of false positives
Lakens: sequential analysis
Researchers may increase the false positive rate even when
not engaging in questionable research practices
They have multiple choices to make when analysing data
depending on the data at hand
The choices follow from their hypotheses, and are not an
egregious “p-hacking/data fishing” expedition
However, the fact that there are so many analytic paths to
take increases the researcher degrees of freedom and
likelihood of false positives
Small sample sizes and measurement error make it likely that
results will not replicate
Can we trust any published findings anymore?
What can we do to reduce false positives?
P-hacking: “Data fishing expedition”
Try different types of analyses until p-value is
driven below .05
P-curve analysis attempts to detect
presence of p-hacking and publication
bias/file drawer problem (Simonsohn, Nelson, &
Simmons, 2014)
P-curve is the distribution of significant p-
values in a body of research
P-curve for p-hacked data will be left-skewed (tail on left side)
More p-values around .04 or .05 than .01 or .02
P-curve for non-p-hacked data will be right-skewed (tail on
right side)
More p-values around .01 or .02 than .04 or .05
Hypothesizing After Results are Known
Look at the data first and then create a post-hoc
hypothesis; present it as if it were developed a
priori
State your hypotheses before collecting data
Make your hypotheses as clear as possible
in your dissertation proposals
Should be directional, comprehensive yet
parsimonious
Bad hypothesis: “There will be a significant
difference in collectivism between Americans
and Indians” (non-directional)
Good hypothesis: “Indians will be significantly
higher in interdependence than Americans”
Even better: build a process model in your
hypotheses (mediation and/or moderation)
Example: “Indians will be significantly higher in
interdependence than Americans and, in turn, will
demonstrate a more holistic cognitive style”
Researchers need to disclose their degrees of
freedom so reviewers/other researchers can fully
evaluate their work
Preregistration
List all variables, materials, hypotheses, and analysis
plans before collect data
Eliminates selective reporting of results, HARKing
Materials and procedure archived online (e.g., Open
Science Framework: https://.osf.io)
Open access to data
Public repository of findings to reduce the file drawer
problem
Simmons et al. (2011) suggest at least N =
20 per group, but this isn’t large enough
Within-subject designs rather than between-
subjects designs
More power for the same number of
participants
Cohen’s effect size guidelines:
Effect size as correlation (r)
Effect size as mean difference (d)
Small effect sizes can only be
accurately detected with high
statistical power
Download G*Power
http://www.gpower.hhu.de/en.html
Exploratory work necessary and important
However, it should be stated in the paper that the
work was exploratory. Do not try to pass off
exploratory work as confirmatory.
Best to do an exact replication of the exploratory
work to see if it can be confirmed
But resource and time-intensive
Two studies: first is exploratory but still based on
theory; second is confirmatory and preregistered
Behavioural (or social) priming: exposing people to incidental cues/primes
(e.g., words, pictures) influences other behaviour without awareness
E.g., expose participants to words related to the elderly and they tend to
walk slower (Bargh, Chen, & Burrows, 1996)
N = 60
Doyen et al. (2012) were able to replicate the effect only when the
experimenters expected participants to walk slower; did not replicate when
experimenters did not expect this. Demand characteristics play a role.
Study 1: N = 120
Within-subject design; supraliminal and subliminal
presentation of primes
Across 6 studies, N = 988 (high power)
Study 1: N = 153 college students
Study 2: N = 219 Mturk users
Study 3: N = 115 Mturk users
Found that primes influenced gambling decisions
So it may be premature to declare that behavioural
priming doesn’t exist. Just need powerful research
designs to detect effects.
Replication should be just as important as
innovation in science
Registered replication reports
Science is a continual process of updating
what we know, self-correcting as we go
along: innovation, doesn’t replicate, figure
out why more innovation, replicate
Psychology has changed in the last 5-6
years. Open science becoming the norm.
Social media instrumental. No excuses!