Scientific information retrieval:
Challenges and opportunities
Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University
17th Dutch-Belgian Information Retrieval Workshop (DIR2018)
Leiden, The Netherlands
November 23, 2018
Centre for Science and Technology Studies (CWTS),
Leiden University
• Quantitative science studies
• Bibliometrics and scientometrics
• Research management and science policy
• Lots of commissioned research for research
institutions, funders, governments, companies, etc.
1
Information retrieval vs. scientometrics
2
Scientific
document
retrieval
Web page
retrieval
Image
retrieval
...
Sciento-
metric
analysis
Sciento-
metrics
Information
retrieval
Sound
retrieval
Individual
users
Research
managers
Policy makers
Researchers
Outline
• Historical connections between information retrieval and scientometrics
• Scientometric perspective on information retrieval
• Scientific information retrieval
3
Historical connections
between information
retrieval and
scientometrics
4
Author co-citation analysis of information science
researchers (1980–1987)
5
White and McCain (1998)
Scientometrics Information
retrieval
Author bibliographic coupling analysis of information
science researchers (2001–2005)
6
Zhao and Strotmann (2008)
Scientometrics
Information
retrieval
PageRank
7Brin and Page (1998) Pinski and Narin (1976)
PageRank
8
Scientometric
perspective on
information retrieval
9
VOSviewer
10
Identifying micro-level fields of science
• Based on all articles in reviews in Web of Science between 2000 and 2017
• 21.2 million publications
• 374.1 million citation links
• Clustering of publications into about 4000 micro-level fields of science using
the Leiden algorithm
11
Leiden algorithm
12
Traag et al. (2018)
Structure of science based on 4000 micro-level fields
13
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical sciences
and engineering
Size of a field is proportional
to the number of publications
in the field
Temporal dynamics in micro-level structure of science
14
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical sciences
and engineering
Network science
Electric vehicles
Image processing
Multi-agent systems
Average publication year of
the publications in a field
Position of scientometrics in micro-level structure of
science
15
Proportion of publications with
‘bibliometrics’ or
‘scientometrics’ in title or
abstract
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical sciences
and engineering
Scientometrics
Position of information retrieval in micro-level structure
of science
• What are the main subfields of information retrieval?
• Which broad scientific disciplines do these subfields relate to?
• Which are the main developments in information retrieval in recent years?
16
Position of information retrieval in micro-level structure
of science
17
Proportion of publications with
‘information retrieval’ in title or
abstract
Social sciences
and humanities
Biomedical and
health sciences
Life and earth
sciences
Mathematics and
computer science
Physical sciences
and engineering
IR subfield 2
Scientometrics
IR subfield 1
IR subfield 3
Term map of information retrieval subfield 1
18
Average publication year of
the publications in which a
term occurs
Size of a term is proportional
to the number of publications
in which the term occurs
Term map of information retrieval subfield 2
19
Term map of information retrieval subfield 3
20
Information retrieval subfields
• Subfield 1:
– Computer science perspective: ‘Hard’ information retrieval
– Strong recent emphasis on recommender systems, social media, and sentiment analysis
• Subfield 2:
– Information science perspective: ‘Soft’ information retrieval
– Connections between information retrieval, library science, and information behavior and
information literacy research
• Subfield 3:
– Bioinformatics perspective: Information retrieval in the biomedical and health science domain
• These subfields do not exhaustively cover all information retrieval research
21
Term map of scientometrics field
22
Term map of scientometrics field
23
Proportion of publications with
‘information retrieval’ in title or
abstract
Workshops on bibliometric-enhanced information
retrieval
24
Scientific information
retrieval
25
Tools for scientific information retrieval
26
Google Scholar
27
Web of Science
28
Web of Science
29
Approaches in scientific information retrieval
• Semantic search
• Similar articles
• Advanced full-text search
• Clustering
• Highly influential citations
• Citation-based expansion
30
Semantic search (Microsoft Academic)
31
Identifying fields of study (Microsoft Academic)
32Shen et al. (2018)
Similar articles (PubMed)
33
Similar articles (PubMed)
34
Lin and Wilbur (2007)
Advanced full-text search (Europe PMC)
35
Clustering (Open Knowledge Maps)
36
Clustering (Open Knowledge Maps)
37
Clustering (ongoing work)
38
In-text references
39
Boyack et al. (2018)
40
Highly influential citations (Semantic Scholar)
Highly influential citations (Semantic Scholar)
41
Valenzuela et al. (2015)
Citation-based expansion (CitNetExplorer)
42
Van Eck and Waltman (2014)
Expansion with all citing
and cited publications
Expansion with citing and
cited publications having 3
or more citation links
Selection of publications
(in blue)
Citation-based expansion (CitNetExplorer)
43
Citation-based expansion (CitNetExplorer)
44
Literature reviewing using citation-based expansion
45
Open citations
46
Shotton (2018)
Open citations
47
Citation-based expansion (Citation Gecko)
48
Conclusions
• Significant inefficiencies in current information retrieval practices of
researchers
• Lots of room for innovation
• Take advantage of open science developments
• Join forces between information retrieval and scientometrics
49
Workshop on Bibliometric-enhanced Information
Retrieval at ECIR 2019
50
Thank you for your attention!
51
References
Boyack, K.W., Van Eck, N.J., Colavizza, G., & Waltman, L. (2018). Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics, 12(1),
59–73.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
Lin, J., & Wilbur, W.J. (2007). PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics, 8(1), 423.
Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing &
Management, 12(5), 297–312.
Shen, Z., Ma, H., & Wang, K. (2018). A Web-scale system for scientific knowledge exploration. arXiv:1805.12216.
Shotton, D. (2018). Funders should mandate open citations. Nature, 553, 129.
Traag, V.A., Waltman, L., & Van Eck, N.J. (2018). From Louvain to Leiden: Guaranteeing well-connected communities. arXiv:1810.08473.
Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. AAAI Workshop: Scholarly Big Data.
Van Eck, N.J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.
Van Eck, N.J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. Journal of Informetrics, 8(4), 802–823.
Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378–2392.
White, H.D., & McCain, K.W. (1998). Visualizing a discipline: An author co‐citation analysis of information science, 1972–1995. JASIS, 49(4), 327–355.
Zhao, D., & Strotmann, A. (2008). Evolution of research activities and intellectual influences in information science 1996–2005: Introducing author bibliographic-coupling
analysis. JASIST, 59(13), 2070–2086.
52

Scientific information retrieval: Challenges and opportunities

  • 1.
    Scientific information retrieval: Challengesand opportunities Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University 17th Dutch-Belgian Information Retrieval Workshop (DIR2018) Leiden, The Netherlands November 23, 2018
  • 2.
    Centre for Scienceand Technology Studies (CWTS), Leiden University • Quantitative science studies • Bibliometrics and scientometrics • Research management and science policy • Lots of commissioned research for research institutions, funders, governments, companies, etc. 1
  • 3.
    Information retrieval vs.scientometrics 2 Scientific document retrieval Web page retrieval Image retrieval ... Sciento- metric analysis Sciento- metrics Information retrieval Sound retrieval Individual users Research managers Policy makers Researchers
  • 4.
    Outline • Historical connectionsbetween information retrieval and scientometrics • Scientometric perspective on information retrieval • Scientific information retrieval 3
  • 5.
  • 6.
    Author co-citation analysisof information science researchers (1980–1987) 5 White and McCain (1998) Scientometrics Information retrieval
  • 7.
    Author bibliographic couplinganalysis of information science researchers (2001–2005) 6 Zhao and Strotmann (2008) Scientometrics Information retrieval
  • 8.
    PageRank 7Brin and Page(1998) Pinski and Narin (1976)
  • 9.
  • 10.
  • 11.
  • 12.
    Identifying micro-level fieldsof science • Based on all articles in reviews in Web of Science between 2000 and 2017 • 21.2 million publications • 374.1 million citation links • Clustering of publications into about 4000 micro-level fields of science using the Leiden algorithm 11
  • 13.
  • 14.
    Structure of sciencebased on 4000 micro-level fields 13 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering Size of a field is proportional to the number of publications in the field
  • 15.
    Temporal dynamics inmicro-level structure of science 14 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering Network science Electric vehicles Image processing Multi-agent systems Average publication year of the publications in a field
  • 16.
    Position of scientometricsin micro-level structure of science 15 Proportion of publications with ‘bibliometrics’ or ‘scientometrics’ in title or abstract Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering Scientometrics
  • 17.
    Position of informationretrieval in micro-level structure of science • What are the main subfields of information retrieval? • Which broad scientific disciplines do these subfields relate to? • Which are the main developments in information retrieval in recent years? 16
  • 18.
    Position of informationretrieval in micro-level structure of science 17 Proportion of publications with ‘information retrieval’ in title or abstract Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering IR subfield 2 Scientometrics IR subfield 1 IR subfield 3
  • 19.
    Term map ofinformation retrieval subfield 1 18 Average publication year of the publications in which a term occurs Size of a term is proportional to the number of publications in which the term occurs
  • 20.
    Term map ofinformation retrieval subfield 2 19
  • 21.
    Term map ofinformation retrieval subfield 3 20
  • 22.
    Information retrieval subfields •Subfield 1: – Computer science perspective: ‘Hard’ information retrieval – Strong recent emphasis on recommender systems, social media, and sentiment analysis • Subfield 2: – Information science perspective: ‘Soft’ information retrieval – Connections between information retrieval, library science, and information behavior and information literacy research • Subfield 3: – Bioinformatics perspective: Information retrieval in the biomedical and health science domain • These subfields do not exhaustively cover all information retrieval research 21
  • 23.
    Term map ofscientometrics field 22
  • 24.
    Term map ofscientometrics field 23 Proportion of publications with ‘information retrieval’ in title or abstract
  • 25.
    Workshops on bibliometric-enhancedinformation retrieval 24
  • 26.
  • 27.
    Tools for scientificinformation retrieval 26
  • 28.
  • 29.
  • 30.
  • 31.
    Approaches in scientificinformation retrieval • Semantic search • Similar articles • Advanced full-text search • Clustering • Highly influential citations • Citation-based expansion 30
  • 32.
  • 33.
    Identifying fields ofstudy (Microsoft Academic) 32Shen et al. (2018)
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    Highly influential citations(Semantic Scholar) 41 Valenzuela et al. (2015)
  • 43.
    Citation-based expansion (CitNetExplorer) 42 VanEck and Waltman (2014) Expansion with all citing and cited publications Expansion with citing and cited publications having 3 or more citation links Selection of publications (in blue)
  • 44.
  • 45.
  • 46.
    Literature reviewing usingcitation-based expansion 45
  • 47.
  • 48.
  • 49.
  • 50.
    Conclusions • Significant inefficienciesin current information retrieval practices of researchers • Lots of room for innovation • Take advantage of open science developments • Join forces between information retrieval and scientometrics 49
  • 51.
    Workshop on Bibliometric-enhancedInformation Retrieval at ECIR 2019 50
  • 52.
    Thank you foryour attention! 51
  • 53.
    References Boyack, K.W., VanEck, N.J., Colavizza, G., & Waltman, L. (2018). Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics, 12(1), 59–73. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. Lin, J., & Wilbur, W.J. (2007). PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics, 8(1), 423. Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing & Management, 12(5), 297–312. Shen, Z., Ma, H., & Wang, K. (2018). A Web-scale system for scientific knowledge exploration. arXiv:1805.12216. Shotton, D. (2018). Funders should mandate open citations. Nature, 553, 129. Traag, V.A., Waltman, L., & Van Eck, N.J. (2018). From Louvain to Leiden: Guaranteeing well-connected communities. arXiv:1810.08473. Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. AAAI Workshop: Scholarly Big Data. Van Eck, N.J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538. Van Eck, N.J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. Journal of Informetrics, 8(4), 802–823. Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378–2392. White, H.D., & McCain, K.W. (1998). Visualizing a discipline: An author co‐citation analysis of information science, 1972–1995. JASIS, 49(4), 327–355. Zhao, D., & Strotmann, A. (2008). Evolution of research activities and intellectual influences in information science 1996–2005: Introducing author bibliographic-coupling analysis. JASIST, 59(13), 2070–2086. 52