1. :
,
W G, B T, S H
Centre for R&D Monitoring and Dept MSI, KU Leuven
2. Introduction
The validity of bibliometric results for research evaluation stands and
falls with the quality of the underlying data originated from various
sources:
• bibliographic databases,
• CVs or institutional publication lists,
• geographic information,
• funding information,
• project application, etc.
☛ These sources are usually created for other purposes than the use
within the framework of bibliometrics.
G ., Pre-bibliometric data processing, Cape Town, 2014 2/27
3. Introduction
Background
Therefore an own sub-discipline of bibliometrics, that can be described as
pre-bibliometric data processing, has emerged.
☞ This task is not merely kind of traditional “bibliometric technology” since it
requires specific research methodology that aims at processing and
preparing most heterogeneous data from different sources for application to
bibliometrics and research evaluation.
Objective
The objective of the paper is to describe three typical tasks representing
this type of ‘pre-bibliometrics’.
• Author identification
• Institute assignment
• Address regionalisation
G ., Pre-bibliometric data processing, Cape Town, 2014 3/27
4. General issues
In principle, there are two possible ways to meet the challenge of
‘pre-bibliometrics’:
1. Using the database only (internal approach)
2. Using supplementary (external) sources such as publication lists or
CVs (extended approach)
☛ It goes without saying that the validity of results of the second
approach exceeds that of the first one as it uses additional and in
oen more reliable data sources than in bibliographic data can be
found.
G ., Pre-bibliometric data processing, Cape Town, 2014 4/27
5. I. The challenge of author identification
Correct author identification is indispensable in micro-level studies, such
as studies of scientific careers, researchers’ mobility or in monitoring
constitution and performance of research teams.
• In order to meet the quality requirements that are necessary at this
level, all (co-)authors of relevant publications need to be identified
and assigned to their affiliation.
◦ The laer task is supported by corresponding features of the two large
citation databases Thomson Reuters’ W S (since 2007) and
Elsevier’s S.
G ., Pre-bibliometric data processing, Cape Town, 2014 5/27
6. Basic problems of author identification
Some basic problems of author identification in bibliographic databases
are related to their names:
Synonyms
• Spelling variants (e.g., umlauts, transliteration, name particles,
double names)
• Misprints
• Initials
• Name changes
• Database standard
G ., Pre-bibliometric data processing, Cape Town, 2014 6/27
7. Intensifying co-authorship
The problem of synonyms
Variant 1 Variant 2 Variant 3
Umlaut Glänzel Glanzel Glaenzel
Transliteration 王悦 Wang, Y
Name particles Van De Broek, I Broek, I Vande /
Broek, IV
Vandebroek, I
Initials Wemans, Andre Wemans, ADV Wemans, A
Name change Petre, Camelia Stanciu, Camelia Camelia, Stanciu
Database VANRAAN, AFJ VanRaan, AFJ Van Raan, AFJ
Source: H ., ISSI 2013, 2013
G ., Pre-bibliometric data processing, Cape Town, 2014 7/27
8. Basic problems of author identification
Further problems of author identification in bibliographic databases
• Homonyms
• Incomplete records
◦ Incomplete double names or first names/initials
◦ Missing links with affiliation
◦ Missing, incomplete or incorrect corporate address
◦ Ambiguous or changing email address
• Mobility
• Institutional restructuring
G ., Pre-bibliometric data processing, Cape Town, 2014 8/27
9. Possible solutions
The “internal” approach
Author IDs
Database providers made very early aempts to identify authors
uniquely. There are two basic approaches.
1. Identification by the database provider, for example,
◦ M R Author ID (since 1940, first manually, from
1985 on automated process)
◦ Elsevier’s AuthorID (since 2006, automated process with author
feedback)
2. Identification by authors, for example,
◦ TR ResearcherID (authors are fully responsible for their IDs)
◦ Open Researcher & Contributor ID – ORCID (online October 2012)
Compatible with other IDs (WoS, Scopus, PubMed) and various links
☞ Both approaches have advantages and disadvantages but ambiguity and
incorrectness cannot be completely excluded in neither of these approaches.
G ., Pre-bibliometric data processing, Cape Town, 2014 9/27
10. The “internal” approach
Thomson Reuters ResearcherID
Several issues are known:
• RIDs are not only used by individual authors. Some institutes and
author groups mark their publications by an RID.
• Some RIDs claim several papers while the researcher name does not
match any of the authors.
• An RID is not always unique. Several authors are using different
RIDs to claim the same papers with these different RIDs.
G ., Pre-bibliometric data processing, Cape Town, 2014 10/27
11. The “internal” approach
Some basic statistics on RIDs (WoS: 2009 – 2011) according to H . (2013)
Some conclusions
- The example of China shows the
necessity of name disambiguation.
- RID cannot yet be used to substitute
author names.
- Email addresses, affiliation and
co-authors can be used for further
identification with severe limitations
(e.g., mobility, institutional
restructuring, etc.).
Relative frequency of publication activity of RID authors
(bars) vs. all authors (line)
G ., Pre-bibliometric data processing, Cape Town, 2014 11/27
12. The “internal” approach
Automated identification on a large scale
Most techniques are based on a combination of name standardisation
and additional components that are considered characteristic for an
author.
• Combinations of name and affiliation/location or subject, shared
co-authors/partners (T ., 2006)
• Some of the combinations are used to form identity records (B,
2011).
• Approximate Structural Equivalence (T W, 2010) uses
authors’ “bibliometric fingerprints”. In fact, kind of bibliographic
coupling and direct citations are used.
☞ None of these methods can eliminate all Type I and II errors.
G ., Pre-bibliometric data processing, Cape Town, 2014 12/27
13. The “extended” approach
The “extended” approach
External sources can be used to identify authors in bibliographic
databases.
The most convenient way to use external sources is matching the
author’s publication list with database records.
• On a large scale manual identification has become extremely
difficult.
☞ Since 1985 the process of author identification of the Mathematical
Reviews database has been automated.
Automated processes can essentially facilitate the matching of
external sources with bibliographic databases, but also result in false
positive and negative errors.
☞ The objective is to exclude false positives and to considerably reduce
the number of false negatives by final manual correction.
G ., Pre-bibliometric data processing, Cape Town, 2014 13/27
14. The “extended” approach
CVs are reliable sources of the authors’ publication records. However,
(author) names alone are not sufficient for a correct match because of the
issues described in the previous section.
Main issues to be tackled when matching external data with database
records:
• Different standards of the two sources
• Incomplete, erroneous or censored data in the external source
• Database errors
G ., Pre-bibliometric data processing, Cape Town, 2014 14/27
15. The “extended” approach
Similar errors can be found in external reference sources as well, and CVs
do oen not consequently follow their own standard.
This problem can be solved by including more components such as
complete title, journal title, co-authors and by using similarity measures
and scores instead of exact matches.
• Since name sequences and texts are sensitive to errors and
misprints, similarity measures can be used to assess the
“probability” of a positive match.
• One possible solution is the use of character N-grams in conjunction
with edit (Levenshtein) distance (A T, 2013).
☞ N-gram based edit distances are sensitive to the order of
components.
G ., Pre-bibliometric data processing, Cape Town, 2014 15/27
16. The “extended” approach
3-gram based similarity scores for the CV entry ‘Zhang, L., Thijs, B., Glänzel, W.,
The diffusion of H-related literature. JOI, 2011, 5 (4), 583-593’
WoS Record Variable Score
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, JOURNAL
OF INFORMETRICS, 5, 583, 2011
1 0.67
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, 5, 583, 2011 2 0.73
The diffusion of H-related literature, JOURNAL OF INFORMETRICS, 5, 583, 2011 3 0.34
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature 4 0.65
The diffusion of H-related literature 5 0.36
The diffusion of H-related literature, JOURNAL OF INFORMETRICS 6 0.41
Source: A T, ISSI 2013, 2013
G ., Pre-bibliometric data processing, Cape Town, 2014 16/27
17. The “extended” approach
The method based on 3-grams achieved 95% accuracy in a sample set.
A T, ISSI 2013, 2013
Two problems remain
• False negatives are the result of unsuccessful match. Their authors
are thus incompletely identified.
• False positives are a major issue since they indicate that the match
was incorrect. Their authors are thus incorrectly identified.
☞ Similar methods are used in spam detection (cf. K .,
International Journal on Artificial Intelligence Tools, 2006).
G ., Pre-bibliometric data processing, Cape Town, 2014 17/27
18. II. Institutional assignment
Data quality provided in the address fields of bibliographic databases
does in practice not meet the requirement of institutional assessment
with strong bibliometric components.
Typical tasks are
• Studies of institutional research performance
• University ranking
• Funding formulas with bibliometric components for universities or
research institutes
• Regional performance indicators
Tasks in Flanders that require correctness of address data at the level
of main institutions
• One major additional funding source and mechanism for
Flemish universities is the “Bijzonder Onderzoeksfonds” (BOF).
• A periodical task is the evaluation of “Strategic Research
Centres” (SOCs).
G ., Pre-bibliometric data processing, Cape Town, 2014 18/27
19. Institutional assignment
Example of the variety of spelling variances in the WoS database
(Example)
UNIV ERLANGEN NURNBERG
FRIEDRICH ALEXANDER UNIV
UNIV HOSP ERLANGEN
UNIV ERLANGEN
UNIV ERLANGEN NUREMBERG
FRIEDRICH ALEXANDER UNIV ERLANGEN NURNBERG
UNIV HOSP ERLANGEN NURNBERG
UNIVERSITAT ERLANGEN NURNBERG
KLINIKUM UNIV ERLANGEN NURNBERG
FRIEDRICH ALEXANDER UNIVERSITAT ERLANGEN NURNBERG
UNIV ERLANGEN NURNBERG POLIKLIN
POLIKLIN UNIV ERLANGEN NURNBERG
UNIV ERLANGEN NURNBERG KLINIKUM
FRIEDRICH ALEXANDER UNIV ERLANGEN
ERLANGEN UNIV HOSP
FAU UNIV ERLANGEN
KLINIKUM FRIEDRICH ALEXANDER UNIV ERLANGEN NURNBE
UNIV ERLANGEN NURNBERG KLIN
UNIV KLINIKUM ERLANGEN NURNBERG
ERLANGEN UNIV
UNIV ERLANGEN NURNBERG HOSP
FRIEDRICH ALEXANDER UNIV ERLANGEN NUREMBERG
FRIEDRICH ALEXANDER UNIV POLIKLIN
UNIV CLIN ERLANGEN NURNBERG
UNIV ERLANGEN NURNEMBERG
UNIV ERLANGER NURNBERG
DR REMEIS STERNWARTE UNIV ERLANGEN NURNBERG
ERLANGEN NUREMBERG UNIV
ERLANGEN NUREMBURG UNIV
FAU UNIV
FREDRICH ALEXANDER UNIV ERLANGEN NURNBERG
FRIEDRICH ALEXANDER UNIV ERLANGEN NUERNBERG
FRIEDRICH ALEXANDER UNIV NURNBERG
FRIEDRICH ALEXANDER UNIV NURNBERG ERLANGEN
HOSP UNIV ERLANGEN NURNBERG
KLIN FRIEDRICH ALEXANDER UNIV ERLANGEN NURNBERG
KOPFKLIN UNIV ERLANGEN NURNBERG
KUNIV ERLANGEN NURNBERG
UB ERLANGEN NURNBERG
UNIV ERLANDGEN NURNBERG
UNIV ERLANGEN NURNBERG KINDERKLIN
UNIV ERLANGEN NURNBERG KOPFKLINIKUM
UNIV ERLANGEN NURNBERG PSYCHIAT KLIN
UNIV ERLANGEN NURNBURG
UNIV ERLANTEN NURNBERG
UNIV EYE CLIN ERLANGEN NURNBERG
UNIV EYE HOSPITAL ERLANGEN NURNBERG
UNIV GRLANGEN NURNBERG
UNIV KLIN ERLANGEN NURNBERG
UNIV LIB ERLANGEN NURNBERG
UNIV NURNBERG ERLANGEN
FRIEDRICH ALEXANDER UNIV ERLANGEN NURNBERG (#69)
Data source: Thomson Reuters Web of Science
G ., Pre-bibliometric data processing, Cape Town, 2014 19/27
20. Institutional assignment
The BOF and SOCs cleaning procedure
1. Preparation of a complete set of spelling variances of university (or institute)
names to assign publications to the affiliation with each of the universities.
2. For further automatisation of the filtering and assignment process, all street
and city addresses, at which research groups of any of the
universities/institutes under study are located, are added.
3. If the two first filters are not sufficient to assign papers correctly, author
names can be used in addition for a final (and oen manual) check of paper
assignment.
4. Finally all publications are validated, corrected and supplemented by the
corresponding universities (institutes). This is especially important if a
corporate address is not (completely) recorded in the database.
G ., Pre-bibliometric data processing, Cape Town, 2014 20/27
21. Institutional assignment
Further possibilities of automatisation in the “internal” and “extended” approach:
• Matching identified spelling variances with address data for obtaining unifying and
coding address fields.
• Matching external sources using identification keys with bibliographic components.
◦ Publication lists provided by institutions obey standards usually more strictly than authors
in their CVs.
Example of building bibliographic identification keys and matching those with WoS data
nanophysics group
Lehrstuhl für Festkörperphysik Jörg P. Kotthaus
Ludwig-Maximilians-Universität München
K. Karrai, R. J. Warburton, C. Schulhauser, A. Högele, B. Urbaszek, E.J. McGhee, A. Govorov, J. M. Garcia, B. D. Gerardot, and
P. M. Petroff
"Hybridization of electronic states in quantum dots through photon emission"
Nature 427, 135-138 (2004) ⇒ 2004-KARR-427-135-N
E. M. Weig, R. H. Blick, T. Brandes, J. Kirschbaum, W. Wegscheider, M. Bichler, and J. P. Kotthaus
"Single-Electron-Phonon Interaction in a Suspended Quantum Dot Phonon Cavity"
Phys. Rev. Lett. 92, 046804-1-4 (2004) ⇒ 2004-WEIG-92-46804-P
S. de Haan, A. Lorke, J. P. Kotthaus, W.Wegscheider, and M. Bichler
"Rectification in Mesoscopic Systems with Broken Symmetry: Quasiclassical Ballistic Versus Classical Transport"
Phys. Rev. Lett. 92, 056806-1 -4 (2004) ⇒ 2004-DE H-92-56806-P
A. Högele, B. Alen, F. Bickel, R. J. Warburton, P. M. Petroff, and K. Karrai
"Exciton fine structure splitting of single InGaAs self-assembled quantum dots"
Physica E 21, 175 – 179 (2004) ⇒ 2004-HOGE-21-175-P
Wendy U. Dittmer and Friedrich C. Simmel
"Transcriptional Control of DNA-Based Nanomachines"
Nano Letters 4, 689 - 691 (2004) ⇒ 2004-WEND-4-689-N
Dominik V. Scheible, and Robert H. Blick
"Silicon nanopillars for mechanical single-electron transport
App. Phys. Lett. 84, 4632 - 4634 (2004) ⇒ 2004-DOMI-84-4634-A
nanophysics group
Lehrstuhl für Festkörperphysik Jörg P. Kotthaus
Ludwig-Maximilians-Universität München
K. Karrai, R. J. Warburton, C. Schulhauser, A. Högele, B. Urbaszek, E.J. McGhee, A. Govorov, J. M. Garcia, B. D. Gerardot, and
P. M. Petroff
"Hybridization of electronic states in quantum dots through photon emission"
Nature 427, 135-138 (2004) ⇒ UT ISI:000187863900030
E. M. Weig, R. H. Blick, T. Brandes, J. Kirschbaum, W. Wegscheider, M. Bichler, and J. P. Kotthaus
"Single-Electron-Phonon Interaction in a Suspended Quantum Dot Phonon Cavity"
Phys. Rev. Lett. 92, 046804-1-4 (2004) ⇒ UT ISI:000188747600042
S. de Haan, A. Lorke, J. P. Kotthaus, W.Wegscheider, and M. Bichler
"Rectification in Mesoscopic Systems with Broken Symmetry: Quasiclassical Ballistic Versus Classical Transport"
Phys. Rev. Lett. 92, 056806-1 -4 (2004) ⇒ UT ISI:000188785200045
A. Högele, B. Alen, F. Bickel, R. J. Warburton, P. M. Petroff, and K. Karrai
"Exciton fine structure splitting of single InGaAs self-assembled quantum dots"
Physica E 21, 175 – 179 (2004) ⇒ UT ISI:000220873300005
Wendy U. Dittmer and Friedrich C. Simmel
"Transcriptional Control of DNA-Based Nanomachines"
Nano Letters 4, 689 - 691 (2004) ⇒ UT ISI:000220857800030
Dominik V. Scheible, and Robert H. Blick
"Silicon nanopillars for mechanical single-electron transport
App. Phys. Lett. 84, 4632 - 4634 (2004) ⇒ UT ISI:000221656900012
G ., Pre-bibliometric data processing, Cape Town, 2014 21/27
22. III. Regional assignment
Dealing with data at regional level is faced with several challenging
issues:
• The definition or demarcation of a region is not always clear as
administrative, geographical and economic regions might differ.
• Also similar names might refer to different entities.
• The validity of benchmarking studies requires the definition of
regions at the same or at least a comparable level of aggregation.
• Regional information is rarely included in the bibliographic
databases. This results in the necessity of an “extended approach”
using external sources to assign addresses to a particular region.
G ., Pre-bibliometric data processing, Cape Town, 2014 22/27
23. Regional assignment
First main issue:
Analogously to the case of subject classification schemes, comparability of results
might easily be endangered when using different hierarchical or aggregation
levels.
Some solutions
- Use predefined classification
schemes like NUTS (OECD) or
UN Geoscheme.
- Precompile (or use existing)
classification tables, e.g.
http://en.wikipedia.org/wiki/List_
of_postal_codes_in_South_Korea
Application at ECOOM
- “Specialization profiles and
diagnostic tools” within the OECD
Smart Specialization Project
G ., Pre-bibliometric data processing, Cape Town, 2014 23/27
24. Regional assignment
Because of the lack of regional information in bibliographic data the
“extended” approach using external sources is the most promising
solution.
Novel techniques through web services allow the retrieval regional
information based on address information (city/postal code, street and
country). Above all, the following three services are useful for this
purpose.
• Geonames API (http://www.geonames.org)
• Google Places API (https://developers.google.com/places/)
• Wikipedia Infobox (accessible through http://dbpedia.org/)
G ., Pre-bibliometric data processing, Cape Town, 2014 24/27
25. Regional assignment
The first and second service (Geonames and Google Places, respectively)
provide a boom-up approach accepting individual addresses as input
and providing a broad range of information (coordinates, postal code,
country codes, administrative levels). Both provide an output in XML
format.
Google Places has, however, restrictions as it is limited to a small number
of daily requests. Geonames is more open and has a more generous
restriction of daily 30.000 requests.
The third source (Wikipedia Infobox) allows a top-down approach.
These web services can return incorrect or inadequate results. These
services proved to be useful for requests that contain at least three
different address components like country, city, postal code or street name.
G ., Pre-bibliometric data processing, Cape Town, 2014 25/27
26. Conclusions
• All automated processes produce errors.
• Computerised techniques can considerably facilitate author
identification by name disambiguation and using their publications.
• Significant manual effort is still necessary to resolve all remaining
issues and to complete author identification.
• The same issues also applies mutatis mutandis for the institutional
and regional cleaning, identification and assignment.
G ., Pre-bibliometric data processing, Cape Town, 2014 26/27
27. Thank you very much for your aention.
Vielen Dank ür Ihre Aufmerksamkeit!
Hartelijk dank voor uw aandacht!
¡Muchísimas gracias por su atención!
Köszönöm szépen a figyelmüket!
Molte grazie per la vostra aenzione.
Muito obrigado pela vossa atenção.