Sti2018 jws

Data-dependent analytical choices
relying on NHST should not be trusted!
Jesper W. Schneider
Danish Centre for Studies in Research and Research Policy,
Aarhus University, Denmark
jws@ps.au.dk

NHST = “null hypothesis significance test”
• A standard mode of inference in social and behavioral science is to establish
stylized facts using statistical significance in quantitative studies
• NHST is the practice of selection on statistical significance using an arbitrary
threshold (p < .05) to decide whether something is “true” or “false”
• NHST is when we try to knock down a strawman hypothesis to be able to say
something about our preferred hypothesis which we do not test

My claim
• Findings in Scientometrics relying on NHST are not to be trusted!
• … or we should be sceptical of claims from individual studies, especially when “decisions”
are based on NHST
• Why?
• At best they are vey susceptible to “garden of forking paths”
• At worst they have been deliberately p-hacked
• So my arguments are primarily in the statistical realm

But is this really relevant for
our “field”?

Knowledge production modes in
Scientometrics
• The field is diverse when it comes to methodology and methods
• Its core and boundaries are somewhat elusive
• The quantitative methodological domain is seemingly descriptive (exploratory) and often
case based
• … at least on the surface, as many studies are tacitly confirmatory studying explicit or
implicit hypotheses relying on NHST
• … and here we should watch out!
‘‘… the data did not show a significant relationship between the proportion of female
authors and the number of citations received, controlled by the number of authors who
signed each paper and the journal impact factor (r = -0.085, p = 0.052)”

Checked observed data
Log-transform skewed
data
X 1 is correlated with y
include as covariate
Use OLS regression P < .05
This is usually what is presented … the straight way to
the “truth”, and seemingly the only way to the “truth”
- A pre-determined path!!

Researcher-degrees-
of-freedom
Seemingly an endless number of choices,
so go ahead and pick what you want and
make your own recipe

data
… and your recipe gave you this!

But if our recipe or the choices
we made under way were
slightly different

data
Data are not skewed
Use OLS regresion P < .??
We chose a slightly different path?

data
Data are not skewed
X 1 is not correlated
with y and is excluded
Or yet another?

Same data, many potential
paths leading to potentially
different outcomes!

The “garden of forking paths”

Is this a problem?
• Yes
• Because selection on significance (confirmatory)
• Multi-comparisons problem leading to many false-positive claims
• No
• If NHST is dropped = exploratory analysis
• If pre-registered
• If all data and choices are open and transparent
• Take away message:
• Different stories can be spun from the same data
• A p-value is specific for the path you chose and thus variable with choices

P-hacking
• Looking at data and changing
paths a long the way chasing
significance
• Not exactly a pre-determined
path
• Leads to many problems, not
least abundant false-positive
claims
• P-values are uninterpretable
Average power ≈ 24%
in social and
behavioural sciences,
but 90% of studies
find an “effect”
Everybody is a p-hacker but some do it unintentionally

So what is the relation to
reproducibility?

Reproducibility
• It is no wonder that “effects” or “findings” do not “replicate”, or go in all directions when
reproducibility is judged on a statistical basis
• The basic fact is that “effects” vary across conditions – “we cannot step into the same
river twice”
• We do not have they required “controlled settings”
• We usually have indirect and very noisy measurements
• Effects are generally small, “the low hanging fruits have been grabbed”
• Researcher-degrees-of-freedom are countless
• Data analytical choices are many and arbitrary
• A main culprit is the labelling of “replications” as success or failures!
• … this is NHST with its arbitrary thresholds for success or failure and it leads to “false-
positives”

My overall message
• No replication is truly direct, and I recommend moving away from the classification of
replications as “direct” or “conceptual” to a framework in which we accept that effects
vary across conditions
• Relatedly, we should stop labeling replications as successes or failures and instead use
continuous measures to compare different studies
• But the general impression of “replication” or the “potential of replication” is quite clear
and central
• For example, if effects can vary by context, this provides more reason why
“replication” is necessary for scientific progress
• Full transparency of data collection and processing procedures, of actual data, of all
analyses done and how they are done, providing code are essential
• … and stop ‘over-selling’ your selective results

My overall message
• A statistically significant result does not mean that we have found a “true” or “real”
effect
• Few studies provide definitive evidence … although the stories they tell are often framed
in that way
• We need many studies of the same phenomenon, call it “replication” but we should
express the difference between old and new studies in terms of the expected variation
in the effect between conditions, not whether it is a success or failure
Quantitative studies in Scientometrics
relying on NHST is no exception to this!!

Sti2018 jws

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sti2018 jws

Similar to Sti2018 jws (20)

Recently uploaded

Recently uploaded (15)

Sti2018 jws