Senior Researcher at Centre for Science and Technology Studies
Oct. 8, 2018•0 likes•654 views
1 of 44
Scientometric approaches to classification
Oct. 8, 2018•0 likes•654 views
Download to read offline
Report
Science
Presentation at the Colloquium Research Information Systems and Science Classifications: Revisiting the NARCIS Classification, Museum Meermanno, The Hague, The Netherlands, September 28, 2018.
1. Scientometric approaches to classification
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
Colloquium Research Information Systems and Science Classifications: Revisiting the NARCIS Classification
Museum Meermanno, The Hague, The Netherlands
September 28, 2018
2. Outline
• Bibliographic databases
• Classification systems of scientific literature
• CWTS publication-level classification system of science
– Methodology
– Structure
– Applications
• Quality of classification systems
1
7. Classification systems of scientific literature
• Mono-disciplinary vs. multidisciplinary
• Journal-level vs. publication-level
• Manual vs. algorithmic
6
8. Classification systems of scientific literature
• Mono-disciplinary:
– Chemical Abstracts: 80 different sections and 5 broad headings
– EconLit: Journal of Economic Literature (JEL) classification system
– PubMed: Medical Subject Headings (MeSH)
• Multidisciplinary:
– Web of Science: 250 categories
– Scopus (ASJC): bottom level has 304 categories and top level includes 27 categories
– Science-Metrix: 176 categories
– National Science Foundation (NSF): 125 categories
– University of California, San Diego (UCSD): more than 500 categories
– Australian and New Zealand Standard Research Classification (FoR): 3 hierarchical levels
7
10. Algorithmic classification system of science
• First version created in 2012
• Publications (not journals) are clustered into research areas based on citation
relations
• Research areas are defined at different levels of granularity and are
organized hierarchically
• Clustering is performed using the smart local moving algorithm (improved
Louvain algorithm; Waltman & Van Eck, 2013)
9
11. Objectives
To create a classification system
• in a fully algorithmic manner
• covering all sciences and social sciences
• at the level of individual publications
• with a hierarchical structure
• using transparent, freely available algorithms
• without excessive computational requirements
10
12. Main challenges
• Dealing with huge volumes of data
• Avoiding disciplinary biases
• Reaching a high level of accuracy
• Being flexible in terms of number of hierarchical levels and size of research
areas
• Obtaining proper labels for the research areas
• Keeping the methodology reasonably simple and transparent
11
13. Dealing with huge volumes of data
• Linking publications based on direct citations only; no co-citations,
bibliographic coupling, or word co-occurrences
• Efficient clustering algorithm based on ideas taken from:
– Newman (2004): Modularity-based clustering
– Blondel et al. (2008): ‘Louvain method’
– Waltman et al. (2010): VOS clustering technique
– Rotta & Noack (2011): Multilevel local search algorithms
12
14. Avoiding disciplinary biases
• cij: Relatedness of publications i and j, i.e., 1 if there is a direct citation
relation between i and j, 0 otherwise
• aij: Normalized relatedness of publications i and j, defined as
• Similar to fractional citation counting (Small & Sweeney, 1985)
k ik
ij
ij
c
c
a
13
15. Reaching a high level of accuracy
• Clustering technique based on maximization of a quality function:
• xi denotes the cluster (research area) to which publication i is assigned
• (xi, xj) = 1 if xi = xj and 0 otherwise
• r denotes a resolution parameter
• Quality function is maximized with respect to x1, ..., xn
i j
ijji raxx ))(,(
14
16. Being flexible in terms of number of hierarchical levels
and size of research areas
• Three types of parameters:
– Number of hierarchical levels
– Each level’s resolution parameter
– Each level’s minimum number of publications per research area
15
17. Obtaining proper labels for the research areas
1. Identification of terms in titles and abstracts of articles using part-of-speech
tagging
2. Calculation of term relevance scores based on a combination of a term’s
absolute and relative frequency of occurrence
3. Selection of the most relevant terms based on term relevance scores
combined with a filter for removing similar terms
16
18. CWTS publication-level classification system of
science
• 21.2 million publications from the period 2000–2017 indexed in Web of
Science
• 374.1 million citation relations
• Classification system of 3 hierarchical levels:
– 22 broad disciplines
– 868 fields
– 4,047 subfields
• Computational performance: less than 2 hours
17
19. 18
Breakdown of scientific literature into 22 broad
disciplines
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
21. 20
Breakdown of scientific literature into 868 fields
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
22. 21
Breakdown of scientific literature into 4,047 subfields
Social sciences
and humanities
Biomedical and
health sciences Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
23. 22
Breakdown of scientific literature into 4,047 subfields
Social sciences
and humanities
Biomedical and
health sciences Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
Scientometrics
24. Summary of scientometrics subfield
23
Cluster: 145
No. publications: 16,312
Top 5 terms No. pubs
bibliometric analysis 852
impact factor 495
h index 264
peer review 515
citation 642
Top 5 publications No. cits
hirsch, je (2005). an index to quantify an individual's scientific research output. p natl acad sci usa, 102(46), 16569-16572. 2,635
wuchty, s; et al. (2007). the increasing dominance of teams in production of knowledge. science, 316(5827), 1036-1039. 699
egghe, l (2006). theory and practise of the g-index. scientometrics, 69(1), 131-152. 609
king, da (2004). the scientific impact of nations. nature, 430(6997), 311-316. 496
newman, mej (2004). coauthorship networks and patterns of scientific collaboration. p natl acad sci usa, 101, 5200-5205. 488
Top 5 authors No. pubs Top 5 journals No. pubs
bornmann, l 221 scientometrics 2,865
thelwall, m 202 journal of informetrics 700
leydesdorff, l 175 journal of the american society for information science and technology 613
rousseau, r 161 plos one 339
egghe, l 133 research evaluation 324
Top 5 institutes No. pubs Top 5 departments No. pubs
univ granada 316 sch lib & informat sci (indiana univ) 106
kathol univ leuven 256 amsterdam sch commun res ascor (univ amsterdam) 97
leiden univ 249 ctr sci & technol studies (leiden univ) 90
indiana univ 246 sch publ policy (georgia inst technol - atlanta) 88
univ wolverhampton 216 trend res ctr (asia univ) 84
0
200
400
600
800
1,000
1,200
1,400
1,600
2000 2002 2004 2006 2008 2010 2012 2014 2016
No.publications
26. 25
Term map of scientometrics subfield
Peer review,
OA, careers,
and gender
CollaborationScientometric
indicators and
networks
Medical research
Country-level
analyses
29. Time trend
28
Social sciences
and humanities
Biomedical and
health sciences Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
31. Summary of graphene subfield
30
Cluster: 9
No. publications: 27,771
Top 5 terms No. pubs
bilayer graphene 836
epitaxial graphene 491
silicene 401
graphene nanoribbon 1,035
graphene field effect transistor 207
Top 5 publications No. cits
novoselov, ks; et al. (2004). electric field effect in atomically thin carbon films. science, 306(5696), 666-669. 27,743
geim, ak; et al. (2007). the rise of graphene. nat mater, 6(3), 183-191. 20,073
novoselov, ks; et al. (2005). two-dimensional gas of massless dirac fermions in graphene. nature, 438(7065), 197-200. 11,359
castro neto, ah; et al. (2009). the electronic properties of graphene. rev mod phys, 81(1), 109-162. 11,368
zhang, yb; et al. (2005). experimental observation of the quantum hall effect and berry's phase in graphene. nature, 438(7065), 201-204. 8,110
Top 5 authors No. pubs Top 5 journals No. pubs
watanabe, k 249 physical review b 4,013
taniguchi, t 240 applied physics letters 1,834
peeters, fm 233 carbon 994
lin, mf 178 nano letters 906
katsnelson, mi 177 journal of applied physics 841
Top 5 institutes No. pubs Top 5 departments No. pubs
chinese acad sci 1,394 dept phys (natl univ singapore) 257
russian acad sci 778 inst phys (chinese acad sci) 226
peking univ 557 inst mol & mat (radboud univ nijmegen) 216
natl univ singapore 482 dept phys (mit) 209
tsing hua univ 458 dept phys (univ calif berkeley and berkeley national lab) 206
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
2000 2002 2004 2006 2008 2010 2012 2014 2016
No.publications
32. Open access
31
Social sciences
and humanities
Biomedical and
health sciences Life and earth
sciences
Mathematics and
computer science
Physical
sciences and
engineering
34. Applications
• Field normalization
– CWTS Leiden Ranking/U-Multirank
– Dutch University Medical Centers
• Field delineation
– European research funders
• High-resolution research strengths analysis
– European universities
– European research funders
• Identification of interdisciplinary and emerging research areas
– UK Engineering and Physical Sciences Research Council
33
35. Adopters and potential adopters
• Adopters:
– CWTS
– SciTech Strategies (e.g. SciVal)
– Royal School of Technology (KTH) Stockholm
• Potential adopters:
– Chinese Academy of Sciences
– European Research Council
– Max Planck
34
37. Empirical micro study using papers on overall water
splitting
• Haunschild et al. (2018)
• Case study comparing CWTS classification to
journal-based and manually constructed
classifications
• Ability of CWTS classification to distinguish
between fields is questioned
36
38. Accuracy of the journal classification systems of Web
of Science and Scopus
• Wang and Waltman (2016)
• Two criteria to identify journals with questionable
classifications:
– journals that have weak connections with their assigned
categories
– journals that are not assigned to categories with which they
have strong connections
• Web of Science performs significantly better than
Scopus
37
39. Field classification of publications in Dimensions
• Bornmann (2018)
• Field classification in Dimensions:
– Based on Fields of Research (FOR) from Australian and New
Zealand Standard Research Classification (ANZSRC)
– Machine learning approach
– Each publication is assigned to at least one field
• Based on Bornmann’s own publications
• Questions reliability and validity of Dimensions
classification
38
40. Response from Dimensions
• Herzog and Lunn (2018)
• Implementation at launch was first step and
requires improvements:
– Improvement of training sets
– Adding new subcategories to FOR system
39
41. Large-scale system to organize publications into
hierarchical concept structure
• Shen et al. (2018)
• Core component in Microsoft Academic
• Iterative approach to:
– concept discovery (Wikipedia)
– concept tagging to publications (both textual data and graph
structure are considered)
– concept hierarchy construction
• Based on 2000 initial seed concepts, over 228K
concepts have been identified
• Concepts are organized in six-level hierarchy
• 1 billion publication-concept relations
40
43. Conclusions
• Algorithmic approaches can be used to construct large-scale classifications
• Algorithmic classifications at the level of publications gain popularity
• Algorithmic possibilities depend on data availability
• Algorithmic classifications may have the disadvantage of mixing up different
principles for classifying items (e.g., research topic, research method,
scientific community, theoretical tradition, basic vs. applied)
42