Slides of the Lecture at the 5th International School on Applied Probability Theory,Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
Applying machine learning techniques to big data in the scholarly domain
1. Applying Machine Learning Techniques to Big Data in
the Scholarly Domain
Angelo A. Salatino
Knowledge Media Institute, The Open University, UK
@angelosalatino
5th International School on Applied Probability Theory,
Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
2. Agenda
What is Scholarly Data?
Computer Science Ontology
How it has been produced?
What can we do with it?
• Topic classification
• Research trends forecast
• Metadata extraction
• Recommendation of books
• Analyse conferences
3. About me – Angelo Salatino
Research Associate and Associate Lecturer at the Open University
Research Interests: i) new technologies for classifying scientific
papers according to their relevant research topics, and ii) how the
research output of academia fosters innovation in the industry
At the SKM3 team we produce innovative approaches leveraging
large-scale data mining, semantic technologies, machine learning,
and visual analytics to extract meaning from scholarly data and
shed light on the research dynamic
angelo.salatino@open.ac.uk https://salatino.org @angelosalatino
4. Science of Science
“The science of science places the practice of science itself under the
microscope, leading to a quantitative understanding of the genesis of
scientific discovery, creativity, and practice and developing tools and
policies aimed at accelerating scientific progress.”
Fortunato, Santo, et al. "Science of science." Science 359.6379 (2018).
Picture from the cover of Science Vol 361, Issue 6408
5. The Computer Science Ontology Framework
This solution supports a variety of high-level
tasks:
i. categorising proceedings in digital
libraries
ii. enhancing semantically the metadata
of scientific publications
iii. generating recommendations
iv. producing smart analytics
v. detecting research trends …
Each layer exploits the underneath layers
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
8. Scholarly Data
Improving Editorial Workflow and
Metadata Quality at Springer Nature.
Identifying the research topics that best describe the scope of a scientific publication is a
crucial task for editors, in particular because the quality of these annotations determine how
effectively users are able to discover the right content in online libraries. For this reason,
Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this
task to their most expert editors. These editors manually analyse all new books, possibly
including hundreds of chapters, and produce a list of the most relevant topics. Hence, this
process has traditionally been very expensive, time-consuming, and confined to a few senior
editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-
driven application that assists the Springer Nature editorial team in annotating the volumes of
all books covering conference proceedings in Computer Science. Since then STM has been
regularly used by editors in Germany, China, Brazil, India, and Japan, …
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
The 18th International Semantic Web Conference (ISWC 2019)
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract
Keywords
Scholarly data, Bibliographic metadata, Topic classification, Topic detection, …
9. Big Scholarly Datasets
• Web of Science
• Scopus
• Google Scholar
• Microsoft Academic Graph
• MA-KG, ma-graph.org
• PubMed
• Dimensions
• Semantic Scholar
• DBLP
• Open Academic Graph
• ScholarlyData
• PID Graph
• Open Research Knowledge Graph
• OpenCitations
• OpenAIRE research graph
• Crossref
• Academy/Industry Dynamics KG
10. Differences between datasets
All these datasets are different
from each other:
• size
• scope
• quality
• mistakes, author disambiguation
• WoS > Scopus > MAG
• index vs. scraping
• comprehensiveness
• integration with other sources
• access to data: license
Picture from Martijn Visser, Nees Jan van Eck, and Ludo Waltman. "Large-scale comparison of bibliographic
data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic." (2020).
“The comparison considers all scientific documents from the
period 2008–2017 covered by these data sources.”
12. What is a Taxonomy?
A taxonomy is a categorization or classification according to discrete sets.
Taxonomies are typically organised in a hierarchical structure.
In practice, it’s a tree structure, with a root node at the top: a single
classification that applies to all objects.
Nodes below are more specific classifications applying to subsets of
objects.
The progress of reasoning proceeds from the general to the more specific.
Example of taxonomies:
• Taxonomy to classify organisms
• Plant taxonomy
• Phylogenetic tree
• Virus classification
• Taxonomies of Science
13. Why do we need Taxonomies of Science?
Also called Knowledge Organization Systems, they help to organise digital
libraries.
They represent the structure of disciplines by naming all their sub-disciplines
and research topics
14. Taxonomies of Research Areas
Mathematics Subject
Classification – MSC2010
Physics and Astronomy
Classification Scheme
(PACS)
JEL Classification
System
Library of Congress
Classification (LCC)
Computing
Classification System
(CCS)
15. Problem with state-of-the-art taxonomies
These taxonomies are:
• Manually curated
• Tend to outdate quickly
• Coarse-grained
• Low completeness
They are unable to reflect the complex structure and the depth of a
discipline
16. The Computer Science Ontology (CSO)
• Ontology of research areas*, automatically generated using Klink-2**
algorithm, on a dataset of 16 million publications mainly in Computer
Science
• Current version of CSO includes 14K topics and 159K relationships
• Main roots include Computer Science, Linguistic, Mathematics,
Geometry, Semantics and so on.
• Download CSO from https://cso.kmi.open.ac.uk
* Angelo A Salatino, Thiviyan Thanapalasingam, Andrea Mannocci, Francesco Osborne, Enrico Motta. "The Computer Science
Ontology: A Large-Scale Taxonomy of Research Areas." In ISWC 2018, Monterey, CA (USA).
** Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to generate semantic topic networks." In
ISWC 2015, Bethlehem, PA (USA).
17. Why CSO is an Ontology and not a Taxonomy? Differences
An ontology is a formal description of
knowledge as a set of concepts within a
domain and the relationships that hold
between them
Taxonomies identify hierarchical
relationships within a category
Ontologies take taxonomy a stepfurther
by providing richer information,
including information about the
relationships between entities.
was_born
cristiano_ronaldo
“Cristiano Ronaldo”“1985”
juventus
real_madrid
“Juventus F.C.”
“Real Madrid C.F.”
was_born has_name
current_teamformer_team
has_name
has_name
“7”
jersey_number
zinedine_zidane
former_teamcurrent_manager_of
“1972”
has_name
“Zinedine Zidane”
18. Data Model of the Computer Science Ontology
The CSO data model includes eight semantic relations:
• superTopicOf, which indicates that a topic is a sub-area of another one (e.g., Linked Data, Semantic Web).
• relatedEquivalent, which indicates that two topics can be treated as equivalent for the purpose of exploring research
data (e.g., Ontology Matching, Ontology Alignment).
• contributesTo, which indicates that the research outputs of one topic contributes to another. For instance, research
in Ontology Engineering contributes to the Semantic Web, but arguably Ontology Engineering is not a sub-area of the
Semantic Web – but arguably Ontology Engineering is not a sub-area of Semantic Web – that is, there is plenty of
research in Ontology Engineering outside the Semantic Web area.
• owl:sameAs, this relation indicates that a research concepts is identical to an external resource. We used DBpedia
Spotlight to connect research concepts to Dbpedia.
• primaryLabel, this relation is used to state the main label for topics belonging to a cluster of relatedEquivalent. For
instance, the topics Ontology Matching and Ontology Alignment will both have their primaryLabel set to Ontology
Matching.
• rdf:type, this relation is used to state that a resource is an instance of a class. For example, a resource in our ontology
is an instance of topic.
• rdfs:label, this relation is used to provide a human-readable version of a resource’s name.
• schema:relatedLink, which links CSO concepts to related web pages that either describe the research topics
(Wikipedia articles) or provide additional information about the research domains (Microsoft Academic).
19. Computer Science Ontology
Very fine grained, organised in 13 levels
Spans from general areas:
• Computer Science
• Artificial Intelligence
• Human Computer Interaction
• Software Engineering …
To specific areas:
• Deep Belief Networks
• Dynamic Bayesian Networks
• Neuro-fuzzy Controller …
21. Klink-2 Algorithm
Klink-2 is an approach for learning large-scale
ontologies of research topics from corpora of
scientific articles and knowledge sources on
the web.
Given a pair of keywords it infers their
semantic relationship:
• superTopicOf
• contributesTo
• relatedEquivalent
Picture from Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to
generate semantic topic networks." In ISWC 2015, Bethlehem, PA (USA).
relatedEquivalent
skos:broaderGeneric
contributesTo
22. Klink-2 Algorithm
Given x and y being two topics:
hierarchical relationship (superTopicOf, contributesTo)
relatedEquivalent relationship
IR(x, y) is the number of
papers associated with
both x and y
cR(x, y) measures how
similar are the distributions
of topics with which both
topic x and y co-occur
n(x, y) defines the string
similarity between the two
topics using the normalised
Levenshtein distance
super = super topics
sib = siblings
24. Topic Classification
Aims at identifying the relevant subjects of a set of documents.
In the scholarly domain: identifying research topics within scientific
articles.
State of the art:
• Topic Models (i.e. LDA)
• Machine Learning
• Citation Networks
• Natural Language Processing (CSO Classifier)
25. Topic Models
Latent Dirichlet Allocation* and many of it
derivatives
Represent each document as a mixture of
topics, and a topic is a multinomial distribution
over words characterised as a discrete
probability distribution defining the likelihood
that each word will appear in a given topic
You need to set some hyperparameters and
pre-define the number of topics
* Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (2003). "Latent Dirichlet Allocation". Journal
of Machine Learning Research. 3 (4–5): pp. 993–1022.
Picture from Kim, Taewoo et al. (2019). Insider Threat Detection Based on User
Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences. 9.
4018.
26. Topic Models - Test
Research Paper
Topic 1 Topic 2 Topic 3 Topic 4
0.11504013 0.10714792 0.09840762 0.060187772
28. Machine Learning
CHALLENGES: labelled corpus with many
instances and balanced across classes
(subjects).
You can mitigate these challenges working
with less classes at high level of granularity
Picture from Radha, Suja & Bandaru, Rama Krishna Rao. (2011). Taxonomy construction
techniques - issues and challenges. Civil Eng.. 2.
29. Citation Networks
Standing on the shoulder of giants:
• Science progresses by building on the work of
previous scientists and we credit previous work
through references
• Assumption: we tend to cite works from the same
field
• This information (papers + citations) is organised
in a network structure
• Topics, or scientific areas, or “hot fields” are
identified by clustering such network
30. Citation Networks in practice
Paper
A
Paper
B
Time
Paper
C
Paper
D
Paper
E
Paper
G
Paper
H
Paper
I
Paper
F
Paper
J
Paper
K
Paper
L
Paper
M
Paper A cites:
• Paper C
• Paper D
• Paper G
Node
Vertex
Item
Object
Paper
Edge
Link
Tie
Arc
Relation
Cites
31. Citation Networks - Clustering
• Clusters are cohesive groups of nodes.
• Clustering algorithm (community detection) looks at the topology of the
network identifying areas in which nodes are more connected between
themselves than to the rest of the network.
32. Citation Networks – Clustering Algorithms
• Edge Betweenness
• Fast Greedy (greedy optimization of modularity)
• Info map
• Label propagation
• Leading eigenvector (eigenvector of the community matrix)
• Louvain (multilevel optimization of modularity)
• Leiden (local optimization of modularity)
• Spinglass (statistical mechanics)
• Walktrap (short random walks)
• Clique percolation method
• … and many others
33. Citation Networks - CitNetExplorer
• Papers arranged on a timeline
• Each colour represents a cluster
• Identify the characteristic terms for
each cluster
Picture from Van Eck, Nees Jan, and Ludo Waltman. "Citation-based clustering of
publications using CitNetExplorer and VOSviewer." Scientometrics 111.2 (2017): 1053-1070.
34. Citation Networks - VOSViewer
• Each colour represents a co-
citation cluster which defines a
scientific area
Picture from https://www.tudelft.nl/library/actuele-themas/research-analytics/case-12-citation-
networks-2/
35. Citation Network
• Widely adopted in literature
Problem:
• Given a cluster one needs to identify the characteristic topic
• Research papers defining new topics might need some years to gather
citations.
• This approach is ineffective for the early detection of new research topics
36. Natural Language Processing - CSO Classifier
Uses state-of-the-art technologies to parse documents and recognise
research entities. As input, it takes the metadata associated with a
research paper (title, abstract, keywords) and returns a selection of
research concepts drawn from the Computer Science Ontology
Salatino, Angelo A., et al. "The CSO classifier:
Ontology-driven detection of research topics in
scholarly articles." International Conference on Theory
and Practice of Digital Libraries. Springer, Cham, 2019.
38. Syntactic Module
• We split the text in unigrams, bigrams and trigrams
• For each n-gram we measure the Levenshtein similarity with the topics
in CSO
• We select CSO topics having similarity above or equal to 0.94 with n-
grams
• Helps handling plurals and hyphenated topics, such as:
• “knowledge based systems” and “knowledge-based systems”
• “database” and “databases”
40. Semantic Module
Word Embedding model
• We used titles and abstracts from 4.5M papers in Computer Science
• Pre-processed text:
• Topic replacement – “digital libraries” → “digital_libraries”
• Collocation analysis – “highest_accuracies”, “highly_cited_journals”
• Trained word2vec model
method
skipgram
emb. size
128
window size
10
negative
5
max iter.
5
min-count cutoff
10
41. Word Embedding model
“king” = [0.32, 0.76,…]
“queen” = [0.42, 0.76,…]
“woman” = [0.56, 0.43,…]
“man” = [0.59, 0.42,...]
king + (woman – man) = queen
It locates synonyms
(related topics) close to
each other in this vector
space: high cosine
similarity
42. Semantic Module
Entity Extraction
• POS tagger, and grammar-based chunk parser <JJ.*>*<NN.*>+
“digital libraries”
CSO concept identification
• Selects all CSO topics found in the top-10 similar words of the resulting
n-grams (with cosine similarity > 0.7)
43. Semantic Module
Concept ranking
• We assign a score to each identified topic:
• Frequency – number of times it was inferred
• Diversity – number of unique text chunks from which it was inferred
Concept Selection
• Elbow method
CSO Topic score
domain ontologies 40
semantic web 40
ontology learning 40
data mining 40
heterogeneous resources 24
semantics 24
world wide web 10
network architecture 6
scholarly communication 6
ontology matching 6
… …
45. Post Processing
Combination of output
Semantic enhancement
• We use the superTopicOf to enhance the output set
• E.g., if “machine learning” then also “artificial intelligence”
• Provides wider context for the analysed paper
• Enables analytics on high-level abstract topics (e.g., digital libraries)
47. Metadata Extraction – Springer Nature Use Case
• Traditionally, editors choose a list of related keywords and categories in
relevant taxonomies according to:
• their own experience of similar conferences
• a visual exploration of titles and abstracts
• a list of terms given by the curators or derived by calls for papers
Salatino, A. A., Osborne, F., Birukou, A., & Motta, E. (2019, October). Improving editorial workflow and metadata quality at springer nature. In International Semantic Web
Conference (pp. 507-525). Springer, Cham.
48. Classification of Proceedings – A Complex Problem
Classify publications manually presents a number of issues for a large
editor such as Springer Nature.
• It a complex process that require expert editors
• It is time-consuming process which can hardly scale
• It is easy to miss the emergence of new topics
• It is easy to assume that some traditional topics are still popular when
this is no longer the case
• The keywords used in the call of papers are often a reflection of what a
venue aspires to be, rather than the real contents of the proceedings.
49. Smart Topic Miner Architecture
Demo of STM: http://stm-demo.kmi.open.ac.uk
SN Editors
HTML - GUI
Parser
Generate
Visualizations
STM Engine
CSO
SNCs
Historical
Data
i) CSO Classifier
ii) Topic Explanation
iii) Taxonomy Generation
iv) SN Tags Inference
v) Previous Classification
word2vec model
50. Business Value
• STM halves the time needed for classifying proceedings from 30 to 15
minutes
• It allows also junior editors to work on the classification of proceedings,
distributing the load and reducing costs
• It achieved an overall 75% cost reduction
• The adoption of a controlled vocabulary makes the process more robust
and facilitates the identification of related editorial products
52. Recommendation of Books
Identifying SN Books to be marketed at specific events, such as academic conferences
• Manual book selection has some limitations:
• Requires years of experience and domain-specific knowledge
• Requires browsing through large catalogue of information - syntactic
• Prone to biases
• Aim:
• Provide a more effective way to support the book selection process
• Help drive the operating costs down
• Semi-automated selection of the most appropriate books, journals, and proceedings to
market at a scientific event
• We developed the Smart Book Recommender: http://rexplore.kmi.open.ac.uk/SBR-demo
Thanapalasingam, T., Osborne, F., Birukou, A., & Motta, E. (2018, October). Ontology-based recommendation of editorial products. In International Semantic Web
Conference (pp. 341-358). Springer, Cham.
53. • We characterised Books and Conference Proceedings through their
research topics:
• The metadata of chapters/papers (i.e. keywords, title and
abstract) are mapped to research topics in CSO
• It returns a set of research topics
Workflow (1)
Classifying
Conferences &
Editorial
Products
Computing
Pairwise
Similarity
Querying &
Visualizing
results
54. • Computing the cosine similarity of two editorial products using vectors of
research topics and their weights
Workflow (2)
Classifying
Conferences &
Editorial
Products
Computing
Pairwise
Similarity
Querying &
Visualizing
results
58. Research Trends Forecast
• We created a new approach for predicting the impact of a topic on
industry.
• It uses four temporal time-series: i) publications from academia, ii) publications
from industry, iii) patents from academia, and iv) patents from industry.
• We tested it on the task of predicting if an emergent research topic will
have a significant impact on industry (> 50 patents) in the following 10
years.
• This evaluation substantiates the hypothesis that considering the four
timeseries separately is conducive to higher quality predictions and
suggests that RI and RA are good indicators for PI.
59. Data modelling pipeline
Research
Papers
Patents
Fine-grained
representation of
research topics
Computer
Science
Ontology
Filtering
documents
Filtering
documents
CSO
Classifier
Extraction of
affiliation types
Peoples' Friendship University of Russia
Salatino, A., Osborne, F., & Motta, E. (2020, September).
Researchflow: Understanding the knowledge flow between
academia and industry. In International Conference on
Knowledge Engineering and Knowledge Management (pp.
219-236). Springer, Cham.
60. Scholarly Data++
Improving Editorial Workflow and
Metadata Quality at Springer Nature.
Identifying the research topics that best describe the scope of a scientific publication is a
crucial task for editors, in particular because the quality of these annotations determine how
effectively users are able to discover the right content in online libraries. For this reason,
Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this
task to their most expert editors. These editors manually analyse all new books, possibly
including hundreds of chapters, and produce a list of the most relevant topics. Hence, this
process has traditionally been very expensive, time-consuming, and confined to a few senior
editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-
driven application that assists the Springer Nature editorial team in annotating the volumes of
all books covering conference proceedings in Computer Science. Since then STM has been
regularly used by editors in Germany, China, Brazil, India, and Japan, …
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
The 18th International Semantic Web Conference (ISWC 2019)
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract, Keywords
scholarly data, semantic web, data mining, ontology, digital libraries, …
Topics
Affiliation Types
Academia
Industry
Keywords
Scholarly data, Bibliographic metadata, Topic classification,
61. Research Topic
Each research topic is represented through 4 signals:
Papers from Academia (RA)
Papers from Industry (RI)
Patents from Academia (PA)
Patents from Industry (PI)
62. Machine Learning approach
We used:
• Logistic Regression (LR)
• Random Forest (RF)
• AdaBoost (AB)
• Convoluted Neural Network (CNN)
• Long Short-term Memory Neural Network (LSTM)
On several combinations of time-series: RA, RI, PA and PI
69. Conclusions
• We have seen how effective is the Computer Science
Ontology Framework
• It enables us to combine machine learning
algorithms and semantic technologies to produce
high-level applications for gaining insights in the field
of Science of Science
• Future work: applying this framework in other
domains of Science Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications