Advanced citation matching and large-scale cited reference extraction
1. Advanced citation matching and
large-scale cited reference extraction
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
EXCITE Workshop 2017: “Challenges in Extracting and Managing References”
Cologne, Germany, March 30, 2017
2. Outline
• Citation matching
– Comparison of the accuracy of the Web of Science, CWTS, and
iFQ citation matching algorithms
• Cited reference extraction
– Assessment of the accuracy of cited references in Web of Science
based on Elsevier ScienceDirect data
1
3. Accuracy of the WoS,
CWTS, and iFQ
citation matching
algorithms
2
5. Citation matching problem
4
…
References
[1] Hirsch, JE (2005)
PNAS, 102, p.16569
[2] Egghe, L (2006)
Scientist, 20, p.15
…
An index to quantify an
individual's scientific
research output
Hirsch, JE
PNAS, 102(46), p.16569-72
UT: 000233462900010
Abstract
…
How to improve the h-
index
Egghe, L
The Scientist, 20(3), p.15
UT: 000235634200013
Abstract
…
Bibliographic database
WoS, Scopus
A
B
C
6. Why is citation matching difficult?
• ‘Big data’ problem
– No. of publications: 50 million
– No. of cited references: 1 billion
• Little data available on cited references in WoS
– First author (last name and initials)
– Source title (abbreviated)
– Publication year
– Volume number
– First page number
– (DOI)
• Errors in data
– Citation extraction errors
• OCR errors
• Interpretation errors due to different citation styles
– Typos and other human errors
5
/A Olensky, M
/Y 2015
/W J ASS INFORM SCI TEC
/V 67
/P 2550
7. Citation matching algorithms of WoS
• Little is known about the citation matching
algorithm used in WoS
• Larsen et al. (2007) concluded from their
investigation of missed matches in WoS that the
algorithm is quite conservative and does not allow
for any variations
6
8. Citation matching algorithm of CWTS
• The aim is to overcome the problem of missed
citation matches in WoS
• Iterative, rule-based algorithm:
1. Preprocessing
2. Start with the most restrictive matching rules
3. Continue with less restrictive matching rules
• Less restrictive matching rules allow for various
types of inaccuracies in the cited reference data
7
9. Example matching rules
• Most restrictive matching rule:
– Exact match on
• first author
• publication year
• publication name
• volume number
• starting page number
• DOI
• Less restrictive matching rule:
– Match on
• Soundex encoding of the last name of the first author
• publication year plus or minus one
• volume number
• starting page number
8
10. Citation matching algorithms of iFQ
• Iterative, rule-based algorithm
• Allows non-unique matches of a single cited
reference with several target articles
9
11. Data collection (1)
• Builds on data collected by Olensky (2015)
• Sample of 300 publications (cited pubs)
– 2 science domains
– 6 disciplines
– 2 languages
– 2 publication years
• 3975 corresponding cited references in WoS
– Times cited used to find cited references that are linked in WoS
– Cited reference search used to find cited references that are not
linked in WoS
10
15. Changes in CWTS citation matching
algorithm
• Introduction of a matching rules in which:
1. Volume and issue number are interchanged
2. Volume and first page number are interchanged
• Small change in the order in which the matching
rules are applied
14
17. Conclusions
• A significant number of citation matches are
missing in WoS
• Substantial improvement in recall is possible, but at
the cost of a small decrease in precision
• Citation matching algorithm of CWTS performs
quite well
• During the analysis, various problems were
detected in WoS cited reference extraction
16
19. Introduction
• Aim: To determine the accuracy of WoS cited
references data
• Approach: Comparison of the cited references
extracted from the full text of Elsevier publications
with the cited references available in WoS
18
20. Data
• Elsevier full text data
– ScienceDirect API
– Subscription-based journal publications in the period 1987-2016
• WoS meta data
– Document types ‘article’ and ‘review’
• Matching of Elsevier full-text data and WoS meta
data at the level of individual publications
19
39. Incorrect cited references in WoS (6)
38
WoS cited reference Original cited reference in publication
WANG J, 2006, CHINESE
CHEM LETT, V17, P49
J. Wang, J.K. Carson, M.F. North, D.J.
Cleland, Int. J. Heat Mass Transfer 49 (17)
(2006) 3075–3083.
KANBER B, 2013,
CEREBROVASC DIS S2, V35,
P21
Kanber B, Hartshorne TC, Horsfield MA,
Naylor AR, Robinson TG, Ramnarine KV.
Dynamic variations in the ultrasound gray-
scale median of carotid artery plaques.
Cardiovasc Ultrasound 2013a;11:21.
EVANS P, 2010, TLS-TIMES
LIT S 0326, P30
Evans PD, Chowdhury MJA. Photoprotection of
wood using polyester-type UVabsorbers
derived from the reaction of 2 hydroxy-
4(2,3-epoxypropoxy)-benzophenone with
dicarboxylic acid anhydrides. J Wood Chem
Technol 2010;30:186e204.
40. Incorrect cited references in WoS (7)
39
WoS cited reference Original cited reference in publication
CAO X, 2010, IEEE
GLOBECOMM 2010, V2010,
P1
Cao, X., Zong, Z., Ju, X., Sun, Y., Dai, C.,
Liu, Q., Jiang, J., 2010. Molecular cloning,
characterization and function analysis of
the gene encoding HMG-CoA reductase from
Euphorbia Pekinensis Rupr. Mol. Biol. Rep.
37, 1559e1567.
LI XY, 2013, NANJING
NONGYE DAXUE, V36, P36
X. Li, S. Wang, Y. Chen, G. Liu, X. Yang,
Overexpression of CD40 in sacral chordomas
and its correlation with low tumor
recurrence, Onkologie 36 (10) (2013) 567–571
ZHANG K, 2014, IEEE T
PATTERN ANAL, V1, P1
K. Zhang, H. Chen, G. Wu, K. Chen, H. Yang,
High expression of SPHK1 in sacral chordoma
and association with patients’ poor
prognosis, Med. Oncol. 31 (11) (2014) 247.
47. Conclusions
• About 0.3% of cited references are missing in WoS
• About 0.2% of cited references in WoS have minor
errors (e.g., incorrect publication year or volume
number)
• About 0.1% of cited references in WoS have major
errors (i.e., reference to completely incorrect target
document)
• WoS does a good job in handling references
pointing to multiple target documents
• These results are based on Elsevier publications
only; publications from other publishers may yield
different outcomes 46