Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scientometric approaches to classification

14 views

Published on

Presentation at the Colloquium Research Information Systems and Science Classifications: Revisiting the NARCIS Classification, Museum Meermanno, The Hague, The Netherlands, September 28, 2018.

Published in: Science
  • Be the first to comment

Scientometric approaches to classification

  1. 1. Scientometric approaches to classification Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University Colloquium Research Information Systems and Science Classifications: Revisiting the NARCIS Classification Museum Meermanno, The Hague, The Netherlands September 28, 2018
  2. 2. Outline • Bibliographic databases • Classification systems of scientific literature • CWTS publication-level classification system of science – Methodology – Structure – Applications • Quality of classification systems 1
  3. 3. Bibliographic databases 2
  4. 4. Bibliographic databases 3
  5. 5. Bibliographic databases 4 Web of Science Scopus Journals 20,000 24,000 Publications 55 million 45 million Citations 1.2 billion 1.2 billion
  6. 6. Classification systems of scientific literature 5
  7. 7. Classification systems of scientific literature • Mono-disciplinary vs. multidisciplinary • Journal-level vs. publication-level • Manual vs. algorithmic 6
  8. 8. Classification systems of scientific literature • Mono-disciplinary: – Chemical Abstracts: 80 different sections and 5 broad headings – EconLit: Journal of Economic Literature (JEL) classification system – PubMed: Medical Subject Headings (MeSH) • Multidisciplinary: – Web of Science: 250 categories – Scopus (ASJC): bottom level has 304 categories and top level includes 27 categories – Science-Metrix: 176 categories – National Science Foundation (NSF): 125 categories – University of California, San Diego (UCSD): more than 500 categories – Australian and New Zealand Standard Research Classification (FoR): 3 hierarchical levels 7
  9. 9. CWTS publication- level classification system of science 8
  10. 10. Algorithmic classification system of science • First version created in 2012 • Publications (not journals) are clustered into research areas based on citation relations • Research areas are defined at different levels of granularity and are organized hierarchically • Clustering is performed using the smart local moving algorithm (improved Louvain algorithm; Waltman & Van Eck, 2013) 9
  11. 11. Objectives To create a classification system • in a fully algorithmic manner • covering all sciences and social sciences • at the level of individual publications • with a hierarchical structure • using transparent, freely available algorithms • without excessive computational requirements 10
  12. 12. Main challenges • Dealing with huge volumes of data • Avoiding disciplinary biases • Reaching a high level of accuracy • Being flexible in terms of number of hierarchical levels and size of research areas • Obtaining proper labels for the research areas • Keeping the methodology reasonably simple and transparent 11
  13. 13. Dealing with huge volumes of data • Linking publications based on direct citations only; no co-citations, bibliographic coupling, or word co-occurrences • Efficient clustering algorithm based on ideas taken from: – Newman (2004): Modularity-based clustering – Blondel et al. (2008): ‘Louvain method’ – Waltman et al. (2010): VOS clustering technique – Rotta & Noack (2011): Multilevel local search algorithms 12
  14. 14. Avoiding disciplinary biases • cij: Relatedness of publications i and j, i.e., 1 if there is a direct citation relation between i and j, 0 otherwise • aij: Normalized relatedness of publications i and j, defined as • Similar to fractional citation counting (Small & Sweeney, 1985)   k ik ij ij c c a 13
  15. 15. Reaching a high level of accuracy • Clustering technique based on maximization of a quality function: • xi denotes the cluster (research area) to which publication i is assigned • (xi, xj) = 1 if xi = xj and 0 otherwise • r denotes a resolution parameter • Quality function is maximized with respect to x1, ..., xn   i j ijji raxx ))(,( 14
  16. 16. Being flexible in terms of number of hierarchical levels and size of research areas • Three types of parameters: – Number of hierarchical levels – Each level’s resolution parameter – Each level’s minimum number of publications per research area 15
  17. 17. Obtaining proper labels for the research areas 1. Identification of terms in titles and abstracts of articles using part-of-speech tagging 2. Calculation of term relevance scores based on a combination of a term’s absolute and relative frequency of occurrence 3. Selection of the most relevant terms based on term relevance scores combined with a filter for removing similar terms 16
  18. 18. CWTS publication-level classification system of science • 21.2 million publications from the period 2000–2017 indexed in Web of Science • 374.1 million citation relations • Classification system of 3 hierarchical levels: – 22 broad disciplines – 868 fields – 4,047 subfields • Computational performance: less than 2 hours 17
  19. 19. 18 Breakdown of scientific literature into 22 broad disciplines Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  20. 20. 22 broad disciplines 19
  21. 21. 20 Breakdown of scientific literature into 868 fields Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  22. 22. 21 Breakdown of scientific literature into 4,047 subfields Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  23. 23. 22 Breakdown of scientific literature into 4,047 subfields Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering Scientometrics
  24. 24. Summary of scientometrics subfield 23 Cluster: 145 No. publications: 16,312 Top 5 terms No. pubs bibliometric analysis 852 impact factor 495 h index 264 peer review 515 citation 642 Top 5 publications No. cits hirsch, je (2005). an index to quantify an individual's scientific research output. p natl acad sci usa, 102(46), 16569-16572. 2,635 wuchty, s; et al. (2007). the increasing dominance of teams in production of knowledge. science, 316(5827), 1036-1039. 699 egghe, l (2006). theory and practise of the g-index. scientometrics, 69(1), 131-152. 609 king, da (2004). the scientific impact of nations. nature, 430(6997), 311-316. 496 newman, mej (2004). coauthorship networks and patterns of scientific collaboration. p natl acad sci usa, 101, 5200-5205. 488 Top 5 authors No. pubs Top 5 journals No. pubs bornmann, l 221 scientometrics 2,865 thelwall, m 202 journal of informetrics 700 leydesdorff, l 175 journal of the american society for information science and technology 613 rousseau, r 161 plos one 339 egghe, l 133 research evaluation 324 Top 5 institutes No. pubs Top 5 departments No. pubs univ granada 316 sch lib & informat sci (indiana univ) 106 kathol univ leuven 256 amsterdam sch commun res ascor (univ amsterdam) 97 leiden univ 249 ctr sci & technol studies (leiden univ) 90 indiana univ 246 sch publ policy (georgia inst technol - atlanta) 88 univ wolverhampton 216 trend res ctr (asia univ) 84 0 200 400 600 800 1,000 1,200 1,400 1,600 2000 2002 2004 2006 2008 2010 2012 2014 2016 No.publications
  25. 25. Publications in scientometrics subfield 24
  26. 26. 25 Term map of scientometrics subfield Peer review, OA, careers, and gender CollaborationScientometric indicators and networks Medical research Country-level analyses
  27. 27. 26 Time-line map of highly cited scientometrics publications
  28. 28. 27 Overlay visualizations Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  29. 29. Time trend 28 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  30. 30. Time trend 29 MicroRNA Graphene
  31. 31. Summary of graphene subfield 30 Cluster: 9 No. publications: 27,771 Top 5 terms No. pubs bilayer graphene 836 epitaxial graphene 491 silicene 401 graphene nanoribbon 1,035 graphene field effect transistor 207 Top 5 publications No. cits novoselov, ks; et al. (2004). electric field effect in atomically thin carbon films. science, 306(5696), 666-669. 27,743 geim, ak; et al. (2007). the rise of graphene. nat mater, 6(3), 183-191. 20,073 novoselov, ks; et al. (2005). two-dimensional gas of massless dirac fermions in graphene. nature, 438(7065), 197-200. 11,359 castro neto, ah; et al. (2009). the electronic properties of graphene. rev mod phys, 81(1), 109-162. 11,368 zhang, yb; et al. (2005). experimental observation of the quantum hall effect and berry's phase in graphene. nature, 438(7065), 201-204. 8,110 Top 5 authors No. pubs Top 5 journals No. pubs watanabe, k 249 physical review b 4,013 taniguchi, t 240 applied physics letters 1,834 peeters, fm 233 carbon 994 lin, mf 178 nano letters 906 katsnelson, mi 177 journal of applied physics 841 Top 5 institutes No. pubs Top 5 departments No. pubs chinese acad sci 1,394 dept phys (natl univ singapore) 257 russian acad sci 778 inst phys (chinese acad sci) 226 peking univ 557 inst mol & mat (radboud univ nijmegen) 216 natl univ singapore 482 dept phys (mit) 209 tsing hua univ 458 dept phys (univ calif berkeley and berkeley national lab) 206 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 2000 2002 2004 2006 2008 2010 2012 2014 2016 No.publications
  32. 32. Open access 31 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  33. 33. University profiles 32 Delft University of TechnologyLeiden University
  34. 34. Applications • Field normalization – CWTS Leiden Ranking/U-Multirank – Dutch University Medical Centers • Field delineation – European research funders • High-resolution research strengths analysis – European universities – European research funders • Identification of interdisciplinary and emerging research areas – UK Engineering and Physical Sciences Research Council 33
  35. 35. Adopters and potential adopters • Adopters: – CWTS – SciTech Strategies (e.g. SciVal) – Royal School of Technology (KTH) Stockholm • Potential adopters: – Chinese Academy of Sciences – European Research Council – Max Planck 34
  36. 36. Quality of classification systems 35
  37. 37. Empirical micro study using papers on overall water splitting • Haunschild et al. (2018) • Case study comparing CWTS classification to journal-based and manually constructed classifications • Ability of CWTS classification to distinguish between fields is questioned 36
  38. 38. Accuracy of the journal classification systems of Web of Science and Scopus • Wang and Waltman (2016) • Two criteria to identify journals with questionable classifications: – journals that have weak connections with their assigned categories – journals that are not assigned to categories with which they have strong connections • Web of Science performs significantly better than Scopus 37
  39. 39. Field classification of publications in Dimensions • Bornmann (2018) • Field classification in Dimensions: – Based on Fields of Research (FOR) from Australian and New Zealand Standard Research Classification (ANZSRC) – Machine learning approach – Each publication is assigned to at least one field • Based on Bornmann’s own publications • Questions reliability and validity of Dimensions classification 38
  40. 40. Response from Dimensions • Herzog and Lunn (2018) • Implementation at launch was first step and requires improvements: – Improvement of training sets – Adding new subcategories to FOR system 39
  41. 41. Large-scale system to organize publications into hierarchical concept structure • Shen et al. (2018) • Core component in Microsoft Academic • Iterative approach to: – concept discovery (Wikipedia) – concept tagging to publications (both textual data and graph structure are considered) – concept hierarchy construction • Based on 2000 initial seed concepts, over 228K concepts have been identified • Concepts are organized in six-level hierarchy • 1 billion publication-concept relations 40
  42. 42. Conclusions 41
  43. 43. Conclusions • Algorithmic approaches can be used to construct large-scale classifications • Algorithmic classifications at the level of publications gain popularity • Algorithmic possibilities depend on data availability • Algorithmic classifications may have the disadvantage of mixing up different principles for classifying items (e.g., research topic, research method, scientific community, theoretical tradition, basic vs. applied) 42
  44. 44. Thank you for your attention! 43

×