Rubin (2024) - Questioning Some Metascience Assumptions - Slides.pdf

MarkRubin14 18 views 44 slides Sep 16, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Metascience uses a scientific approach to understand and improve the scientific approach! In this presentation, I question some mainstream assumptions in contemporary metascience. Examples include: (1) exploratory research is more “tentative” than confirmatory research; (2) questionable research...


Slide Content

Mark Rubin
Durham University
2
nd
July 2024
Questioning Some
Metascience Assumptions

The History of the Replication Crisis
•2011 – Bem publishes evidence for precognition in JPSP.
•2011 - Simmons et al. publish their “false positive
psychology” paper showing “undisclosed flexibility in data
collection and analysis allows presenting anything as
significant.”
•2012 – John et al. publish a survey showing “questionable
research practices” are prevalent among psychologists.
•2015 – The Open Science Collaboration finds only 39% of
100 psychology effects can be replicated.

The Causes of the Replication Crisis?
•The replication crisis has
been caused by
researchers’ inappropriate
methodological behaviour.
•To solve the replication
crisis, we need to change
researchers’
methodological behaviour!

What is Metascience?
From Peterson and Panofsky (2023)

Metascience is Not a Monolith!
•Metascience is a broad, heterogeneous, interdisciplinary
subject (Field, 2022).
•Only some metascientists may make some of the
assumptions I discuss today.

We’re All Metascientists!
•Metascientists are scientists!
•Scientists are often “metascientists” (e.g., when they
critique their peers’ research from a general
methodological, statistical, or philosophical perspective).
•Hence, there’s no “us” and “them” here (Bastian, 2021)
because “we” are often “them” (Derksen & Field, 2022, p.
180).

The Mainstream Metascience
Explanation of the Replication Crisis
•Questionable research practices like HARKing and p-
hacking have led to a higher than expected false positive
rate in the literature.
•Publication bias prevents null results from being reported.
•Researcher biases such as the hindsight and confirmation
biases reduce researchers’ awareness of these issues.
•Consequently, replication rates are much lower than
expected (aka. the replication crisis).

Some Metascience Assumptions
1.Prediction is better than postdiction
2.Exploratory research is more “tentative”
3.HARKing is bad
4.Researcher bias is bad
5.The hindsight bias is bad
6.The confirmation bias is bad
7.Overfitting is bad
8.Undisclosed multiple testing is bad
9.Selective analyses are bad
10.Questionable research practices (QRPs) are bad
11.Publication bias is bad
12.Replication failures are bad

Some Related Publications
•Rubin (2024). Inconsistent multiple testing corrections. Methods in
Psychology.
•Rubin (2023). Questionable metascience practices. Journal of Trial and Error.
•Rubin & Donkin (2022). Exploratory hypothesis tests can be more compelling
than confirmatory hypothesis tests. Philosophical Psychology.
•Rubin (2022). The costs of HARKing. British Journal for the Philosophy of
Science.
•Rubin (2021). When to adjust alpha during multiple testing. Synthese.
•Rubin (2020). Does preregistration improve the credibility of research
findings? The Quantitative Methods for Psychology.
•Rubin (2017). When does HARKing hurt? Review of General Psychology.
•Rubin (2017). An evaluation of four solutions to the forking paths problem.
Review of General Psychology.
•Rubin (2017). Do p values lose their meaning in exploratory analyses?
Review of General Psychology.

Philosophy of Science:
Perspectival Realism
•We can only have perspectives on reality (Giere, 2006;
Massimi, 2018).
From jain108academy on Instagram
•“What the searchlight
makes visible will depend
upon its position, upon
our way of directing it,
and upon its intensity,
colour, etc. although it
will, of course, also
depend very largely upon
the things illuminated by
it” (Popper, 1971).

“Prediction is Better than Postdiction”
•Yes! (Naïve or strong predictivism; e.g., Giere, 1983,
1984)
•No! (e.g., Keynes,1921; Mill, 1843; Rubin & Donkin,
2022; Worrall, 2010, 2014)
•Only in some circumstances (weak predictivism; e.g.,
Collins, 1994; Howson, 1988; Harker, 2006; Hitchcock &
Sober, 2004; Lange, 2001; Mahrer, 1988; Schlesinger,
1987; Syrjänen 2023)
•Only for some scientists (natural scientists not social
scientists; Schindler, 2024)

“Exploratory Research is More ‘Tentative’”
•Compared to confirmatory research, exploratory research is
more “uncertain” and “tentative” (Errington et al., 2021, p. 19;
Nelson et al., 2018, p. 519; Nosek & Lakens, 2014, p. 138;
Simmons et al., 2021, p. 154).
•“Exploratory studies cannot be presented as strong
evidence” (Wagenmakers et al., 2012, p. 635).
•OK! But what is “confirmatory” and “exploratory” research?

ConfirmatoryExploratory Problem
Hypothesis
testing
Descriptive
research
But most would agree that an unplanned
hypothesis test is not “confirmatory”
Result-
independent
Result-
dependent
But then is a preregistered decision tree that
“depends” on the current results “exploratory”?
Strong theoryWeak theoryBut then a preregistered test based on weak
theory is “exploratory”?
Planned UnplannedBut then planning to undertake a result-
dependent exploratory analysis based on weak
theory makes it “confirmatory”?
PredictionPostdictionBut if a researcher’s prediction is a “mere
guess,” is it “confirmatory”?
Hypothesis-
testing
Hypothesis-
generating
But what if a researcher views their results and
then retrieves a hypothesis from the literature
that is confirmed by them? They didn’t generate
the hypothesis. So, is their test “confirmatory”?

“HARKing is Bad Because…
It Hides Circular Reasoning”
•“Hypothesizing after the results
are known…is an example of
circular reasoning – generating a
hypothesis based on observing
data, and then evaluating the
validity of the hypothesis based
on the same data” (Nosek et al.,
2018, p. 2600).

“HARKing is Bad Because…
It Hides Circular Reasoning”
•But HARKing doesn’t prevent you from identifying circular
reasoning (Rubin, 2022; Rubin & Donkin, 2022).
•You don’t need to know the timing of a researcher’s
reasoning to know whether that reasoning is circular!
•“The logical properties of statements…[are] timeless: If a
statement is a tautology, then it is a tautology once and
for all” (Popper, 2002, p. 274).

HARKing Doesn’t Hide Circular Reasoning:
An Example
•A researcher secretly HARKs
that “eating apples improves
mood” after observing this result
in their study.
•They then generate a secretly
post hoc theoretical rationale for
their HARKed hypothesis:
“Vitamin C improves mood, and
apples are rich in Vitamin C.”
•We can see there’s no circular
reasoning in this case, because
the result is not used in the
rationale.

“HARKing is Bad because…
It Allows Motivated Reasoning”
•Research results can inspire and motivate the
generation of hypotheses without being formally used in
the theoretical rationale (i.e., they remain “epistemically
independent”; Reichenbach, 1938; Rubin & Donkin, 2022;
Worrall, 2010, 2014).
•Motivated and biased reasoning doesn't necessarily imply
incorrect, invalid, or unsound reasoning (Hahn &
Harris, 2014; Kruglanski & Ajzen, 1983; McArthur &
Baron, 1983).

“HARKing is Bad Because…
It Prevents Falsification”
•HARKing does not
prevent hypotheses from
being falsified.
•Researchers can HARK
disconfirmed hypotheses
just as much as they can
HARK confirmed
hypotheses (Kepes et al.,
2022; Kerr, 1998, p. 198;
Rubin, 2017, 2022).
Adapted from Dirk-Jan Hoek
Disconfirmed
Hypothesis

“HARKing is Bad Because…
It can be used to Predict Anything”
•Yes, HARKing can be used to “predict nearly any pattern
of results in nearly any context” (Kerr, 1998, p. 210).
•But this is an advantage of post hoc theorizing!
•Also, in a process inference to the best explanation, the
question is not whether a researcher can predict a result;
it’s how good is the theoretical explanation for their
prediction relative to other potential explanations
(Haig, 2009; Lakatos, 1978; Popper, 2002; Rubin, 2022;
Szollosi & Donkin, 2021).

“HARKing is Bad Because…
It’s Unethical”
•"HARKing can entail concealment. The question then
becomes whether what is concealed in HARKing can be a
useful part of the ‘truth’...or is instead basically
uninformative (and may, therefore, be safely ignored at an
author’s discretion)” (Kerr, 1998, p. 209, my emphasis).

“HARKing is Bad Because…
It’s Unethical”
(Denny Borsboon, July 2023, MathPsych Conference, Amsterdam; photo
from Marc Jekel on X)

“HARKing is Bad Because…
It’s Unethical”
•“Any truly rigorous approach to psychological science
requires that scientific hypotheses cannot be equated to
personal predictions; hypotheses must instead be
articulated as depersonalized products of some
systematic analysis, and appraised accordingly” (Schaller,
2016. p. 110).

“HARKing is Bad Because…
It’s Unethical”
•The time at which a hypothesis is deduced from a theory
is not “a useful part of the ‘truth’,” because the validity of a
deduction is not affected by time (Keynes, 1921; Mill,
1843; Oberauer & Lewandowsky, 2019; Popper, 1959, p.
274; Rubin & Donkin, 2022; Szollosi & Donkin, 2021).

“HARKing is Bad Because…
It’s Unethical”
•If a hypothesis is
deduced from Theory X
after a result is known,
then it’s not deceptive
to say: “As predicted by
Theory X,…” (Brush,
2015, p. 78).

“Researcher Bias is Bad”
•“The aim of the Registered
Report format is to reduce bias
by eliminating many of the
avenues for undisclosed
flexibility in research” (Vazire
et al., 2022, p. 166).
•“Registered Reports as a vaccine against research bias”
(Chambers, 2018)

“Researcher Bias is Bad”
•Researcher bias influences not only the post hoc
selection of hypotheses, data, analyses, and results (i.e.,
selective reporting), but also the a priori selection of
hypotheses, methods, analyses, evidence thresholds, and
interpretations (i.e., selective questioning).
•A preregistered study may be more biased than a non-
preregistered study, because its selective questioning
may be more problematic than selective reporting.

“Researcher Bias is Bad”
•It’s naïve to assume that we can reduce researcher
biases (Field & Derksen, 2021; Morawski, 2019, p. 228;
Popper, 1983; Wiggins & Christopherson, 2019).
•It’s more realistic to assume that open science practices
can help to reveal different perspectives rather than to
reduce biases (Jamieson et al., 2023; Pownall, 2022).

“Researcher Bias is Bad”
“One of the strengths of science is that it does not require
that scientists be unbiased, only that different scientists
have different biases” (Hull, 1988, p. 22).

“The Hindsight Bias is Bad”
•The hindsight bias
leads researchers to
believe that they “knew
it all along.”
•So, it facilitates science by allowing scientists to update
their beliefs about hypotheses without threatening their
need for self-consistency.

“Confirmation Bias is Bad”
•The confirmation bias is stronger when hypotheses are
more plausible (e.g., Butzer, 2019; Hergovich et al., 2010).
•So, it facilitates science by preventing well-established
(highly plausible) theories from being disconfirmed too
easily (Lakatos, 1978; O'Connor & Gabriel, 2022; Popper,
1963, 1983).

“Overfitting is Bad”
•Overfitting occurs when researchers
construct a hypothesis on the basis of
results from a specific sample but
accidentally accommodate the
idiosyncratic and unrepresentative
properties of that sample in their
hypothesis.

But Underfitting is Also Bad!
•Overfitting is bad!
•OK! But we should also be
careful about underfitting
(e.g., Protzko, 2018).

“Undisclosed Multiple Testing is Bad”
•Single tests of individual hypotheses do not inflate the
Type I error rate for each statistical inference even if
multiple such tests occur side-by-side in the same study!
• But see also:
11.Parker & Weir (2020, p. 564)
12. Parker & Weir (2022, p. 2)
13. Rothman (1990, p. 45)
14. Savitz & Olshan (1995, p. 906)
15. Senn (2007, pp. 150-151)
16. Sinclair et al. (2013, p. 19)
17. Tukey (1953, p. 82)
18. Turkheimer et al. (2004, p. 727)
19. Veazie (2006, p. 809)
20. Wilson (1962, p. 299)
1. Armstrong (2014, p. 505)
2. Cook & Farewell (1996, pp. 96–97)
3. Fisher (1971, p. 206)
4. García-Pérez (2023, p. 15)
5. Greenland (2021, p. 5)
6. Hewes (2003, p. 450)
7. Hitchcock & Sober (2004, pp. 24-25)
8. Hurlbert & Lombardi (2012, p. 30)
9. Matsunaga (2007, p. 255)
10. Molloy et al. (2022, p. 2)

•An alpha adjustment is
necessary when testing a joint
hypothesis: E.g., “Jelly beans
(of one or more colours) cause
acne.”
•An alpha adjustment is not
necessary when testing
multiple individual
hypotheses: E.g.,
•“Green jelly beans cause
acne”;
•“Red jelly beans cause acne”;
etc.

“Selecting an Analysis on the Basis of
Your Data is Bad”
•Will we reject a joint null hypothesis if at least one of Drugs A,
B, C, D, E,…etc. is significant? (see also jelly beans example).
•If not, α level will be valid with individual testing (Rubin, 2021).
•In general, it’s OK to look at your data prior to your analysis
(Devezer et al., 2021; Rubin & Donkin, 2022)!
Starmer (2022)

“Selective Analysis Inflates the Type I Error Rate”
•Conducting multiple tests on
different subsets of data (e.g.,
different counties with Europe)
and then selectively reporting
one subset of that data (e.g.,
the data about France) may
bias claims about the larger
data set (e.g., claims about
Europe), but it does not bias
claims about the reported
subset (i.e., claims about
France).

“Questionable Research Practices (QRPs) are Bad”
•QRPs can be perfectly acceptable research practices
(Fiedler & Schwarz, 2016; Moran et al., 2022, Table 6;
Rubin, 2022, p. 551; Sacco et al., 2019).
•QRPs need to be “questioned” by other researchers and
interpretated in specific research situations before they
can be judged to be potentially problematic.
•A more tenable position is to assume that only some
QRPs are potentially problematic in specific research
situations.

“Publication Bias is Bad”
van Zwet & Cator (2021)
Publication
Bias

“Publication Bias is Bad”
•Null results are not “evidence”:
1.A Bayes factor close to 1.00 is “barely worth mentioning”
(Jeffreys, 1961)
2.A Fisherian null result doesn’t allow you to accept the
null hypothesis. We can “ignore entirely” (Fisher, 1926).
3.A Neyman-Pearson null result with low power doesn’t
allow you to accept the null hypothesis.
4.Even a Neyman-Pearson null result with high power
doesn’t allow you to accept the null hypothesis
(Greenland, 2012).

Publication Bias: A Caveat!
The Effect Exists
Fewer Small
(Nonsignificant)
Instances of the
Effect
More Large
(Significant)
Instances of the
Effect
Selection
for
Significance
The Effect = 0
Significant
Results
Confirming a
Positive
Effect
Significant
Results
Confirming a
Negative
Effect
Selection
for
Significance
Fewer Non-
Significant
Results

“Replication Failures are Bad”
•Successful replications represent scientific progress by
confirming current hypotheses.
•But failed replications also represent scientific progress by
motivating the generation of new hypotheses that explain
why the replications failed (e.g., Firestein, 2012).
•So, although low replication rates may indicate poor
knowledge accumulation, they may also represent
scientific progress vis-à-vis greater specified ignorance
(Merton, 1987).

Critical Metascience Substack

List of Critical Metascience Articles

“The scientist, by the very nature of
[their] commitment, creates more and
more questions, never fewer” (Allport,
1954).
Email: [email protected]