Atypical combinations are confounded by disciplinary effects (Boyack & Klavans)

SCITECH STRATEGIES
Better Maps ● Better Solutions
Physics Chemistry Engineering Biology Disease Medicine Computer Earth Brain Health Social Humanities
Atypical combinations are confounded by
disciplinary effects
STI 2014
Leiden, The Netherlands
Sept. 3-5, 2014
Kevin W. Boyack & Richard Klavans
SciTech Strategies, Inc.
www.mapofscience.com

Better Maps SCITECH STRATEGIES ● Better Solutions
2
BACKGROUND
 We have long been interested
in indicators of innovative
research
 Uzzi et al. (UMSJ) recently
published an article
correlating high impact papers
(innovation) with “atypical
combinations” (novelty) of
reference journals
 Intriguing results; we decided
to investigate further – to
replicate the study and then
further explore this idea of
novelty
Physics Computer Chemistry Engineering Earth Biology Disease Medicine Brain Health Social Humanities

3
UZZI STUDY
 Hypothesis: “The highest-impact science is primarily grounded in
exceptionally conventional combinations of prior work yet
simultaneously features an intrusion of unusual combinations”
 Data: Used 17.9M articles (1950-2000) from WOS, containing 302M
references to 15,613 cited journals
 Method:
» Journals are used as proxy for “areas of knowledge”
» Determine which co-cited journal combinations are “conventional” and which are
“unusual” or “novel”
» Develop indicators of “convention” and “novelty” from co-citation statistics
» Calculate “convention” and “novelty” for each paper using indicators
» Test indicators to see how they correlate with highly cited papers
 Finding: Papers with high convention AND high novelty are twice as
likely to be highly cited as the average paper

4
UMSJ METHOD (1)
 To determine which co-cited journal combinations are “conventional”
and which are “novel”, UMSJ calculated Z-scores for each co-cited
journal pair, where Z is defined:
Z = (Nact – Nexp) / Nvar
 Nact is the actual number of journal co-citation counts
 Nexp is an expected number of journal co-citation counts
 Nvar is the variance of Nexp
 Nexp and Nvar were estimated by calculating (10) randomized citation
networks where all citation links were switched using a Monte Carlo
technique, keeping citing/cited distributions constant at the paper level
 A negative Z-score indicates that a journal pair is co-cited less often
than expected; thus is an “atypical combination” of journals

5
UMSJ METHOD (2)
 Using the computed Z-scores
for each co-cited journal pair,
the set of Z-scores can then
be located for each paper
 Two summary statistics were
calculated for each paper
from its Z-score distribution:
» Median Z-score – to characterize
central tendency or “convention”
» 10th percentile (left tail) Z-score –
to characterize “novelty”
 Distributions of these
summary statistics were
analyzed

6
UMSJ METHOD (3)
 Distributions of these paper-level
analyzed
 Indicators based on these
created
» Novelty
 HIGH – 10th Pctl Z-score < 0
 LOW – 10th Pctl Z-score > 0
» Conventionality
 HIGH – median Z-score > Avg
 LOW – median Z-score < Avg
 Each paper classified in terms
of convention and novelty
Low
Convention
High
Convention
High
Novelty
Low
Novelty

7
UMSJ RESULTS
 “Hit” papers defined as the
top-5% highly cited papers
 Using indicators:
» Probability of a (N+C+)
HIGH NOVELTY,
HIGH CONVENTION
paper being a hit paper is 0.0911
» Probability of a (N-C-)
LOW NOVELTY,
LOW CONVENTION
paper being a hit paper is 0.0205

8
UMSJ ISSUES
 “Analyses in the supplementary materials (fig. S6) show that these
empirical regularities for the WOS taken as a whole are largely
replicated on a field-by-field basis and across time”
» Across time – YES
» Across fields or disciplines – NOT REALLY! – UMSJ supplemental results show that
the N+C+ bin has the highest probability (of the 4 bins) of containing a hit paper for
only 64% of the 243 subject categories
 The fact that the N+C+ bin is not ranked first in 36% of subject
categories is troubling, suggesting potentially large field effects, or even
individual journal effects
 Top-5% highly cited not sampled by field
 Journals may not be the right proxy for “areas of knowledge”

9
REPLICATION
 We used a different, but parallel, methodology to replicate the UMSJ
distributions and results
 Scopus data (2001-2010) – 12M articles, 226M references
 Included conference papers along with articles
 K50 statistics for co-cited journal pairs rather than Z-scores and Monte
Carlo simulations
» K50 has the same conceptual formulation as the Z-score:
(Nact – Nexp) / Normalization
» Expected values and normalization are based on row and column sums
 UMSJ procedures for calculating distributions, etc. were all followed

10
REPLICATION
 For the left tail, we used the 5th
percentile rather than the 10th
percentile to more closely
match UMSJ distributions
 Indicator distributions for the
median and left tail percentile
values are very similar to the
UMSJ distributions
» Differences in the tail percentile
curves have no effect on
indicators since the fractions of
articles at the zero point of all
curves are the same
Low
Convention
High
Convention
High
Novelty
Low
Novelty

11
REPLICATION
 Probabilities of hit papers 2001-2005 (top-5% highly cited) as of 2011
UMSJ (1990-2000) This study (2001-2005)
% sample Prob % sample Prob
High Novelty, High Convention (N+C+) 6.7% 0.0911 9.5% 0.0959
High Novelty, Low Convention (N+C-) 26% 0.0533 30.6% 0.0659
Low Novelty, High Convention (N-C+) 44% 0.0582 40.5% 0.0433
Low Novelty, Low Convention (N-C-) 23% 0.0205 19.4% 0.0205
 Our results are similar to the UMSJ results
» Higher probability for N+C+ (0.0959 to 0.0911) coupled with a higher fraction within
that bin (9.5% to 6.7%) suggest that our method does even a bit better at locating
highly cited papers.
» High novelty is accentuated overall using our method (N+C- is 0.0659 rather than
0.0533)
 Replication was successful, and reproduces the major features of the
UMSJ study

12
FIELD EFFECTS?
 2x2 matrix probabilities for the
top-5% sampled by field were
compared to the 2x2 matrix
probabilities using the top-5%
overall
 The bins are in the same
order using top-5% by field,
but the differences between
bins are smaller
» N+C+ (0.0834 vs 0.0959)
» N-C- (0.0335 vs. 0.0205)
 This suggests that “atypical
combinations” are influenced
by field effects

13
FIELD EFFECTS?
 Top 20 largest journals (by
numbers of co-citations) are
plotted in terms of convention
and novelty
» These 20 journals account for
15.9% of all co-citations
 Reminder note: Journal are
plotted here based on how
they are co-cited, not what is
published in them !
% co-citations above overall median
% co-citations below zero

14
FIELD EFFECTS?
 Three groups appear
» PHYSICS (6 journals) – cited as
conventional, but not novel
» BIOMED (9 journals) – cited as
both conventional and novel
» MULTI (5 journals) – cited as
novel and not conventional
 Nature, Science, and PNAS
account for 9.4% of ALL
atypical co-citation pairs
» Multidisciplinary journals are
obviously not good proxies for
“areas of knowledge”
» They contribute the most to the
notion of “atypical”, suggesting
that journals are a poor basis for
this study
% co-citations above overall median
% co-citations below zero

15
SUMMARY
 We have replicated the UMSJ study and primary finding that
» Papers with high convention AND high novelty are twice as likely to be highly cited
as the average paper
 This is a real finding! There seems to be something to the notion of
“atypical combinations” that is meaningful and could be predictive
 However …
 Field and journal effects are not insignificant, and given that these
studies were based on journal co-citation, journals and fields may be
driving “atypical combinations”
 Journals are the wrong proxy for “areas of knowledge”; we need an
alternative proxy for “areas of knowledge”
 Other potential measurements of “atypical-ness” or “novelty” that are
relatively independent of field or journal effects should be proposed and
tested

16
QUESTIONS
Thank-you for your attention !

Atypical combinations are confounded by disciplinary effects (Boyack & Klavans)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Atypical combinations are confounded by disciplinary effects (Boyack & Klavans)

Similar to Atypical combinations are confounded by disciplinary effects (Boyack & Klavans) (20)

Recently uploaded

Recently uploaded (20)

Atypical combinations are confounded by disciplinary effects (Boyack & Klavans)