Replication of Uzzi (2013) Science study on atypical combinations, with additional work to show that journal and disciplinary effects are not insignificant.
Atypical combinations are confounded by disciplinary effects (Boyack & Klavans)
1. SCITECH STRATEGIES
Better Maps ● Better Solutions
Physics Chemistry Engineering Biology Disease Medicine Computer Earth Brain Health Social Humanities
Atypical combinations are confounded by
disciplinary effects
STI 2014
Leiden, The Netherlands
Sept. 3-5, 2014
Kevin W. Boyack & Richard Klavans
SciTech Strategies, Inc.
www.mapofscience.com
2. Better Maps SCITECH STRATEGIES ● Better Solutions
2
BACKGROUND
We have long been interested
in indicators of innovative
research
Uzzi et al. (UMSJ) recently
published an article
correlating high impact papers
(innovation) with “atypical
combinations” (novelty) of
reference journals
Intriguing results; we decided
to investigate further – to
replicate the study and then
further explore this idea of
novelty
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
3. Better Maps SCITECH STRATEGIES ● Better Solutions
3
UZZI STUDY
Hypothesis: “The highest-impact science is primarily grounded in
exceptionally conventional combinations of prior work yet
simultaneously features an intrusion of unusual combinations”
Data: Used 17.9M articles (1950-2000) from WOS, containing 302M
references to 15,613 cited journals
Method:
» Journals are used as proxy for “areas of knowledge”
» Determine which co-cited journal combinations are “conventional” and which are
“unusual” or “novel”
» Develop indicators of “convention” and “novelty” from co-citation statistics
» Calculate “convention” and “novelty” for each paper using indicators
» Test indicators to see how they correlate with highly cited papers
Finding: Papers with high convention AND high novelty are twice as
likely to be highly cited as the average paper
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
4. Better Maps SCITECH STRATEGIES ● Better Solutions
4
UMSJ METHOD (1)
To determine which co-cited journal combinations are “conventional”
and which are “novel”, UMSJ calculated Z-scores for each co-cited
journal pair, where Z is defined:
Z = (Nact – Nexp) / Nvar
Nact is the actual number of journal co-citation counts
Nexp is an expected number of journal co-citation counts
Nvar is the variance of Nexp
Nexp and Nvar were estimated by calculating (10) randomized citation
networks where all citation links were switched using a Monte Carlo
technique, keeping citing/cited distributions constant at the paper level
A negative Z-score indicates that a journal pair is co-cited less often
than expected; thus is an “atypical combination” of journals
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
5. Better Maps SCITECH STRATEGIES ● Better Solutions
5
UMSJ METHOD (2)
Using the computed Z-scores
for each co-cited journal pair,
the set of Z-scores can then
be located for each paper
Two summary statistics were
calculated for each paper
from its Z-score distribution:
» Median Z-score – to characterize
central tendency or “convention”
» 10th percentile (left tail) Z-score –
to characterize “novelty”
Distributions of these
summary statistics were
analyzed
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
6. Better Maps SCITECH STRATEGIES ● Better Solutions
6
UMSJ METHOD (3)
Distributions of these paper-level
summary statistics were
analyzed
Indicators based on these
summary statistics were
created
» Novelty
HIGH – 10th Pctl Z-score < 0
LOW – 10th Pctl Z-score > 0
» Conventionality
HIGH – median Z-score > Avg
LOW – median Z-score < Avg
Each paper classified in terms
of convention and novelty
Low
Convention
High
Convention
High
Novelty
Low
Novelty
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
7. Better Maps SCITECH STRATEGIES ● Better Solutions
7
UMSJ RESULTS
“Hit” papers defined as the
top-5% highly cited papers
Using indicators:
» Probability of a (N+C+)
HIGH NOVELTY,
HIGH CONVENTION
paper being a hit paper is 0.0911
» Probability of a (N-C-)
LOW NOVELTY,
LOW CONVENTION
paper being a hit paper is 0.0205
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
8. Better Maps SCITECH STRATEGIES ● Better Solutions
8
UMSJ ISSUES
“Analyses in the supplementary materials (fig. S6) show that these
empirical regularities for the WOS taken as a whole are largely
replicated on a field-by-field basis and across time”
» Across time – YES
» Across fields or disciplines – NOT REALLY! – UMSJ supplemental results show that
the N+C+ bin has the highest probability (of the 4 bins) of containing a hit paper for
only 64% of the 243 subject categories
The fact that the N+C+ bin is not ranked first in 36% of subject
categories is troubling, suggesting potentially large field effects, or even
individual journal effects
Top-5% highly cited not sampled by field
Journals may not be the right proxy for “areas of knowledge”
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
9. Better Maps SCITECH STRATEGIES ● Better Solutions
9
REPLICATION
We used a different, but parallel, methodology to replicate the UMSJ
distributions and results
Scopus data (2001-2010) – 12M articles, 226M references
Included conference papers along with articles
K50 statistics for co-cited journal pairs rather than Z-scores and Monte
Carlo simulations
» K50 has the same conceptual formulation as the Z-score:
(Nact – Nexp) / Normalization
» Expected values and normalization are based on row and column sums
UMSJ procedures for calculating distributions, etc. were all followed
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
10. Better Maps SCITECH STRATEGIES ● Better Solutions
10
REPLICATION
For the left tail, we used the 5th
percentile rather than the 10th
percentile to more closely
match UMSJ distributions
Indicator distributions for the
median and left tail percentile
values are very similar to the
UMSJ distributions
» Differences in the tail percentile
curves have no effect on
indicators since the fractions of
articles at the zero point of all
curves are the same
Low
Convention
High
Convention
High
Novelty
Low
Novelty
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
11. Better Maps SCITECH STRATEGIES ● Better Solutions
11
REPLICATION
Probabilities of hit papers 2001-2005 (top-5% highly cited) as of 2011
UMSJ (1990-2000) This study (2001-2005)
% sample Prob % sample Prob
High Novelty, High Convention (N+C+) 6.7% 0.0911 9.5% 0.0959
High Novelty, Low Convention (N+C-) 26% 0.0533 30.6% 0.0659
Low Novelty, High Convention (N-C+) 44% 0.0582 40.5% 0.0433
Low Novelty, Low Convention (N-C-) 23% 0.0205 19.4% 0.0205
Our results are similar to the UMSJ results
» Higher probability for N+C+ (0.0959 to 0.0911) coupled with a higher fraction within
that bin (9.5% to 6.7%) suggest that our method does even a bit better at locating
highly cited papers.
» High novelty is accentuated overall using our method (N+C- is 0.0659 rather than
0.0533)
Replication was successful, and reproduces the major features of the
UMSJ study
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
12. Better Maps SCITECH STRATEGIES ● Better Solutions
12
FIELD EFFECTS?
2x2 matrix probabilities for the
top-5% sampled by field were
compared to the 2x2 matrix
probabilities using the top-5%
overall
The bins are in the same
order using top-5% by field,
but the differences between
bins are smaller
» N+C+ (0.0834 vs 0.0959)
» N-C- (0.0335 vs. 0.0205)
This suggests that “atypical
combinations” are influenced
by field effects
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
13. Better Maps SCITECH STRATEGIES ● Better Solutions
13
FIELD EFFECTS?
Top 20 largest journals (by
numbers of co-citations) are
plotted in terms of convention
and novelty
» These 20 journals account for
15.9% of all co-citations
Reminder note: Journal are
plotted here based on how
they are co-cited, not what is
published in them !
% co-citations above overall median
% co-citations below zero
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
14. Better Maps SCITECH STRATEGIES ● Better Solutions
14
FIELD EFFECTS?
Three groups appear
» PHYSICS (6 journals) – cited as
conventional, but not novel
» BIOMED (9 journals) – cited as
both conventional and novel
» MULTI (5 journals) – cited as
novel and not conventional
Nature, Science, and PNAS
account for 9.4% of ALL
atypical co-citation pairs
» Multidisciplinary journals are
obviously not good proxies for
“areas of knowledge”
» They contribute the most to the
notion of “atypical”, suggesting
that journals are a poor basis for
this study
% co-citations above overall median
% co-citations below zero
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
15. Better Maps SCITECH STRATEGIES ● Better Solutions
15
SUMMARY
We have replicated the UMSJ study and primary finding that
» Papers with high convention AND high novelty are twice as likely to be highly cited
as the average paper
This is a real finding! There seems to be something to the notion of
“atypical combinations” that is meaningful and could be predictive
However …
Field and journal effects are not insignificant, and given that these
studies were based on journal co-citation, journals and fields may be
driving “atypical combinations”
Journals are the wrong proxy for “areas of knowledge”; we need an
alternative proxy for “areas of knowledge”
Other potential measurements of “atypical-ness” or “novelty” that are
relatively independent of field or journal effects should be proposed and
tested
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities
16. Better Maps SCITECH STRATEGIES ● Better Solutions
16
QUESTIONS
Thank-you for your attention !
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities