NLP and Knowledge Graphs

NLP
and Knowledge Graphs
Carlos Badenes-Olmedo
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
2023/06/15
carlos.badenes@upm.es
@carbadol

“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
2
▪ Context (~10min)
▪ Ontology Engineering Group
▪ Personal Background
▪ Relevant Outcomes (~30min)
▪ Cross-lingual Document Similarity (librAIry)
▪ Multiple and Heterogenous QA (MuHeQA)
▪ Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
▪ Ongoing Research (~5min)
▪ Multi-Dimensional Fake News Identification**
▪ Inconsistency Detection
▪ Conversational-assisted Topic Labelling
▪ Multi-hop KGQA

Ontology Engineering Group
3
▪ Directors: Asunción Gómez-Pérez, Oscar Corcho
▪ Position: 8º ranking UPM (200 groups)
▪ Research group (~30 people)
▪ 170+ Collaborations
▪ 50+ Visitors
https://oeg.fi.upm.es
https://github.com/oeg-upm
@oeg-upm

“NLP and Knowledge Graphs“- carlos.badenes@upm.es 4
Ontologies LOT: Industrial Methodology

Ontologies
Knowledge
Graphs
▪ Sync. or Async. integration of heterogeneous data sources
▪ Data quality, cleaning and linking functions
▪ Linked Data Service publishing data or maven dependency

Ontologies
Knowledge
Graphs
▪ Linguistic Linked Open Data
▪ Word Sense Disambiguation
▪ Named Entity Recognition
▪ Question - Answering
NLP
Information
Extraction
Knowledge-driven
Exploration
Entity
Linking
▪ Probabilistic Topic Models
▪ Taxonomies from corpora
▪ Large-scale Searching
▪ Classification of Out-of-Knowledge-based
Entities
KeyQ

Ontologies
Knowledge
Graphs
▪ Creating KGs of Research Software Metadata
▪ Tracking FAIR principles in Research Software
NLP
Open
Science

2016 2021
2020
Probabilistic
Topic
Models
Clinical
Knowledge
Graphs
Hybrid
QA
2022
Fake
News
Detection
Industry Academy Academy Academy
Academy
Personal Background
http://librairy.eu

Call for Papers
9
▪ https://kgsum.github.io
▪ Topics
▪ Methods to summarize KGs
▪ KGs features related to summaries
▪ Scope and Impact of KG summaries
▪ Call for Papers:
▪ Paper Submission: Jul 7 (23:59 AoE), 2023
▪ Notification to Authors: Jul 24, 2023
▪ Workshop Dates: Nov 6-7, 2023 at Athens, Greece.

Call for Workshop and Tutorials
10
▪ https://www.k-cap.org/2023
▪ Topics
▪ Knowledge representation
▪ Knowledge acquisition
▪ Problem-solving and reasoning …
▪ Call for:
▪ Papers: Aug 13 (23:59 AoE), 2023
▪ Workshop/Tutorials: Jul 9, 2023
▪ Conference Dates: Dec 5-7, 2023 at Florida, USA.
▪ Steering Commitee:
▪ Jose Manuel Gómez-Pérez
▪ Anna Lisa Gentile
▪ Ilaria Tiddi
▪ Krzysztof Janowicz
▪ Raphael Troncy
▪ Daniel Garijo
▪ Valentia Tamma
▪ Marieke Van Erp
▪ Rafael Goncalves
▪ Oscar Corcho
▪ Organising Commitee:
▪ K. Brent Venable
▪ Daniel Garijo
▪ Brian Jalaian
▪ Blerina Spahiu
▪ Niranjan Suri
▪ Marieke van Erp
▪ Carlos Badenes-Olmedo
▪ Alan Ordway

Outline
11
▪ Personal Background
▪ Multi-hop KGQA

Cross-lingual Document Similarity
12

13
▪Three challenges to perform large-scale retrieval of documents in multi-lingual corpora:
▪ C1: Content representation
▪ C2: High-dimensional correlation matrix
▪ C3: Multi-lingual Comparison
?
EN
Patents
PhD
Thesis
ES FR

14
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022.
▪ Probabilistic Topic Models [Blei et al,
2003]
▪ Each topic is a distribution over words
▪ Each word is drawn from one of those
topics
▪ Each document is a mixture of corpus-
wide topics
▪ Fixed Vector of topic distributions

15
▪ similar documents do not necessarily share the most relevant topic for
each of them.
a) simJS = 0.74 b) simJS = 0.71
Distance Metrics

16
▪ Hashing Topic Distributions
[Badenes-Olmedo et al, 2019]
▪ hierarchical set of topics
based on their relevance
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of
Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.

17
▪ Computation can be an
approximate nearest
neighbour (ANN) search
problem [Mao et al, 2017]
based on topic clusters.
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of
Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.

18
▪ Multi-Lingual Topic Models [Viulic et al. 2015]:
▪ language-specific features of topics
▪ requires:
▪ parallel corpus
(sentence-aligned documents)
▪ or comparable corpus
(theme-aligned documents)
A
‘communication
system’
A
‘sistema de
comunicación’
A
‘systeme de
communication’
radio equipo communications
equipment red reseaux
network comunicación electroniques
communication espectro acces
regulatory electromagnético telecommunications
EN ES FR
B
C
D
E
A
B
C
D
E
A
B
C
D
E
A

19
▪ Multi-Lingual Dictionaries [Hao and Paul, 2018]
▪ more widely available than parallel corpora
(e.g PANLEX or Wiktionary)
▪ models are built from words in a target language
▪ dictionaries as supervised method to align topics
▪ topics conditioned by pre-established language
relations
A
B
C
D
E

20
Wordnet:
▪ It is a semantic network.
▪ Synonymous words are grouped into synsets.
▪ These synsets are then linked to other synsets via semantic
relations
▪ e.g. hypernym or hyponym.
Bond, Francis, P. Vossen, John P. McCrae and Christiane D. Fellbaum. “CILI: the Collaborative Interlingual Index.” Global WordNet Conference (2016).

21
We propose an unsupervised algorithm:
▪ based on the Open Multilingual Wordnet (OMW)
Knowledge Base
(no translations required)
▪ that creates language-specific concept hierarchies
(no parallel or comparable corpora required)
▪ uses only the most relevant topics
(no density-based distance metrics)
A G K
radio.n.01 kit.n.02 access.n.02
equipment.n.01 equipment.n.01 approach.n.07
network.n.02 net.n.02 entree.n.02
net.n.06 web.n.06 communication.n.02
communication.n.02 communication.n.02 bout.n.02
EN ES FR
A
B
C
D
E
G
H
I
J K
L
M
Badenes-Olmedo, Carlos, José Luis Redondo García and Óscar Corcho. “Scalable Cross-lingual Document Similarity through
Language-speci
fi
c Concept Hierarchies.” Proceedings of the 10th International Conference on Knowledge Capture (2019)

22
▪ hierarchical-set of topics from relevance
▪ nearest neighbour searches (k-d tree)
▪ Boolean Similarity (Jaccard Index)
radio.n.01?
buy.v.01?
bout.n.02?
net.n.01?
1st hierarchy level
2nd
hierarchy
level

23
https://github.com/librairy/demo

24
▪ Document Classification Task
▪ Metrics: precision, recall and f-measure
▪ Data: ~1k docs
(monolingual, bi-lingual or multilingual
documents )
▪ Methodology: comparison of clusters based
on EUROVOC categories and annotations
created by the model:
▪ supervised = labeledLDA
▪ unsupervised = LDA
Precision
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
supervised unsupervised
Recall
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r

25
▪ Document Retrieval Task
▪ Metrics: precision@3, precision@5 and
precision@10
▪ Data: ~1k docs
(monolingual, bi-lingual or multilingual
documents )
▪ Methodology: comparison of clusters based
on EUROVOC categories and annotations
created by the model:
▪ supervised = labeledLDA
▪ unsupervised = LDA
Precision@3
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
Precision@10
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r

Outline
26
▪ Background
▪ Multi-hop KGQA

Multiple and Heterogenous QA (MuHeQA)
27
▪ Objective: facilitate access to information
from KGs and unstructured data.

28
Badenes-Omedo, and Corcho, O. (2023).MuHeQA: Zero-shot Question Answering over Multiple and
Heterogeneous Knowledge Bases. Semantic Web Journal.

29
https://github.com/librairy/muheqa

▪ SPARQL queries used to extract the properties of a KG resource
Wikidata: DBpedia:

▪ Performance identifing keywords in a question:
•Our method identifies the entities mentioned in a question along with the relevant terms discovered using PoS
annotations:

▪ Performance when discovering Wikidata or DBpedia resources :
Wikidata: DBpedia:
•We discard the creation of vector spaces where each resource is represented by its labels [7], since one of our
assumptions is to avoid the creation of supervised models that perform specific classification tasks over the KG (i.e.
prior training)
•Our proposal does not require training datasets since it performs textual searches based on the terms identified in the
query using an inverse index of the labels associated with the resources .

▪ Performance based on Knowledge Graph-oriented QA:
•The results show that our approach offers a performance close to the best system, STaF-QA, and better than other
approaches specific to KGQA.
•However, one of the weak points is the recall, which means that our approach has to improve in response
elaboration. The answer is perhaps too straight forward, and we should be concerned with constructing more
complex responses.

▪ Performance based on Document-oriented QA:
•The answers created by our algorithm are not as elaborate as those in the evaluation dataset, which were created
manually, and this penalises the performance of our system.
•For example, given the question "How many children were infected by HIV-1 in 2008-2009, worldwide?", the answer
inferred by our system is "more than 400,000", while the correct answer is "more than 400,000 children were infected
worldwide, mostly through MTCT and 90% of them lived in sub-Saharan Africa"

Outline
35
▪ Academic and Industrial Background
▪ Personal Motivation
▪ Experiments in progress (~5min)
▪ Multi-hop KGQA

Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
36
▪ Release of scientific documents on coronaviruses (useful in doc retrieval, IE, and knowledge
management task)
▪ First goal: make the scientific literature around coronaviruses useful in some of the immediate
needs of hospital pharmacies (e.g. drug shortages, or interactions between chemical substances)
▪ After: provide an up-to-date knowledge base on coronaviruses extracted from scientific
publications
Badenes-Olmedo, Carlos and Óscar Corcho. “Lessons learned to enable question answering on knowledge graphs extracted from
scienti
fi
c publications: A case study on the coronavirus literature.” Journal of Biomedical Informatics 142 (2023): 104382 - 104382.

37
▪RQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
▪RQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
▪RQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
▪RQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
▪RQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?

38
▪There is no common methodology for the construction of knowledge graphs from the
biomedical literature, but rather a series of steps or stages that coincide among existing works.
▪We propose a work
fl
ow that is also valid when information update cycles are short
oE.g. one-week update cycles
▪This work
fl
ow addresses the research question RQ1

39
fi
graphs?
fi
c texts?
fi
c
article?
mentioned?
fi

40
•Fine-tuned the BioBERT [Lee et al., 2020]
model to identify the following biomedical
classes:
-Diseases: BC5CDR-Diseases and NCBI-Diseases
-Chemicals: BC4CHEMD and BC5CDR-Drugs
-Genetics: JNLPBA and BC2GM
•Unique representation of the concept from a set
of related terms composed using multiple
information sources
- 7.4 million annotations in JSON format
•Multiple sources were taken into account to
create a database for each of the biomedical entities
•With the creation of these models we address the
research question RQ2

41
fi
graphs?
fi
c texts?
fi
c
article?
mentioned?
fi

42
• First, identify the requirements to express biomedical
concepts and the associations between them.
-Uni
fi
ed Medical Language System (UMLS)
-Several efforts have been made to integrate
biomedical knowledge into a single shared
representation space (e.g. DISNET platform)
• Then, create an ontology that describes:
- (i) biomedical concepts and associations between
them and
-(ii) the evidence supporting these associations.
• Challenges:
-Biomedical taxonomies and vocabularies with reduced
semantics (e.g. SNOMED, ICD-10, UMLS...)
-how are the concepts related?

43
• We created the Evidences for BiOmedical
Concepts Association (EBOCA) ontology to describe:
-biomedical concepts
-associations between them
-evidence supporting these associations.
• It de
fi
nes the conceptual model on which the
Drugs4Covid knowledge graph is built
• It is composed of two modules, one oriented toward
describing biomedical concepts and associations,
EBOCA SEM-DISNET, and the other focused on
representing evidence of these associations with
metadata and provenance information, EBOCA
Evidences https://w3id.org/eboca/portal
Perez, Andrea Alvarez, Ana Iglesias-Molina, Lucia Prieto Santamaria, Maria Poveda-Villalon, Carlos Badenes-Olmedo and Alejandro Rodriguez-
Gonzalez. “EBOCA: Evidences for BiOmedical Concepts Association Ontology.” International Conference Knowledge Engineering and
Knowledge Management (2022).

44
EBOCA SIM-DISNET module:
•Designed to represent associations of common
biomedical concepts, such as: diseases,
phenotypes, genes, genetic variants, biological
pathways, drugs, proteins, and targets.
• Associations link pairs of concepts, for
example, the gene-disease or drug-disease
association
• Adds semantics to the DISNET structure
-phenotypic layer
-biological layer
-pharmacological layer

45
EBOCA EVIDENCES module:
•Extends the associations between
biomedical concepts of the SEM-DISNET
module with metadata and provenance
information
• These evidences of associations may
come from known curated sources, or may
be drawn or inferred directly from the texts.
• Describes in more detail the type of
evidence supported by the association, the
agents involved in its extraction and
publication

¡
46
• The EBOCA ontology address the research question RQ2.
• The evaluation of an ontology usually seeks to identify inconsistencies or formal errors that
invalidate its de
fi
nition. However, since this work is oriented to the creation of a knowledge
graph from a collection of scienti
fi
c articles, it is more interesting to focus on an evaluation
that measures the coverage of the ontology to a set of competency questions
»15 questions associated with the EBOCA SEM-DISNET module
» and 10 questions associated with the EBOCA Evidences module

47
fi
graphs?
fi
c texts?
fi
c
article?
mentioned?
fi

48
▪Objective: annotate the entities, their relationships
and the evidence according to the EBOCA ontology
▪Methodology:
-The mapping rules between the annotations
and the ontology resources were created with
Mapeathor and carried out by the Morph-kgc
library.
-The articles were then described by the
ontology in an RDF
fi
le.
-Finally, GraphDB was chosen to store the
RDF content and Helio to provide a SparQL
access interface

49

50

51
fi
graphs?
fi
c texts?
fi
c
article?
mentioned?
fi

▪Objective: facilitate access to the
resources created in this work, especially
the knowledge graph.
▪Methodology:
o There are multiple ways to exploit a KG:
prede
fi
ned SPARQL queries, guided
document searches, etc.
o Our proposal is a question-answer
(QA) system based on natural language,
MuHeQA
»combines ExtractiveQA and NLP with KGs
»Summarization, Evidence Extraction and Answer
Generation
▪
https://drugs4covid.oeg.fi.upm.es/services/bio-qa

Outline
53
▪ Academic and Industrial Background
▪ Personal Motivation
▪ Multi-hop KGQA

..In this context the specialist said that what we
eat is the main cause of cancer risk, with a
*NUMBER* percent; in front of tobacco
consumption, responsible in a *NUMBER*
percent; and infections, with *NUMBER*
percent…”
Multi-Dimensional Fake News Identification
54
▪ Multi-Dimensional Fake News Identification:
Knowledge
Graph
5Ws evidences
Dictionary Social
Network
impact

Inconsistency Detection
55
▪ Inconsistency Detection:
social
media
political
parties
specific
issues
over
time
Guillen-Pancho, Ibai, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023)

Conversational-assisted Topic Labeling
56
▪ Conversational-assisted Topic Labeling:
Words
Selection
Question
Composition
Question
Answering
Label
Retrieval
Ramón-Ferrer, Virginia, Badenes-Olmedo, Carlos and Óscar Corcho. “Automatic Topic Label Generation Using Conversational Models.”(2023)

Multi-hop KGQA
57
Liu-Chen, Teng, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023)
Min, Sewon, Victor Zhong, Luke
Zettlemoyer and Hannaneh
Hajishirzi. “Multi-hop Reading
Comprehension through Question
Decomposition and
Rescoring.” ArXiv abs/1906.02916
(2019)
▪ Multi-hop KGQA:

NLP and Knowledge
Graphs
Carlos Badenes-Olmedo
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
2023/06/15
carlos.badenes@upm.es
@carbadol

Additional Content
59
▪ Ontology Engineering Framework
▪ Knowledge Graph Tools
▪ NLP Resources
▪ Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science

OEG Ontology Engineering Framework
LOT industrial methodology
60
http://lot.linkeddata.es
+20
projects
More details at: Poveda-Villalón, M., Fernández-Izquierdo, A., Fernández-López, M., & García-Castro,
R. (2022). LOT: An industrial oriented ontology engineering framework. Engineering Applications of
Artificial Intelligence, 111, 104755. https://doi.org/10.1016/j.engappai.2022.104755

LOT adoption
61
▪ + 20 projects (internal and external)
▪ https://lot.linkeddata.es/#stories (selected examples)

Technology landscape
62

63
• openly available corpus https://coralcorpus.linkeddata.es/
• 834 ontological requirements annotated
• 29 lexico-syntactic patterns
• HTML, CSV and RDF
• "Creative Commons Attribution 4.0 International" license

64
• https://chowlk.linkeddata.es
• Notation and converter
• Hosted by OEG
• Developed by OEG

65
• https://lov.linkeddata.es
• Vocabulary registry and index
• Hosted by OEG
• Not developed by OEG

66
• http://oops.linkeddata.es/
• Check for pitfalls (41
defined 31 automated)
• Online app
• Web service
• http://themis.linkeddata.es
• Ontology unit tests based validation
• Online app
• Web service

67
• https://github.com/dgarijo/Widoco/
• Ontology documentation
• Desktop app (maintained by Daniel
Garijo, ISI, California)
• Web service (OEG)

68
• https://github.com/oeg-upm/vocab.linkeddata.es
• Ontology portal generation
• jar distribution

69

Handle versions and distributed environments
70
Evaluation reports
HTML documentation
Diagrams
Permanent Ids
Content negotiation
Bundle
Pre-view
http://ontoology.linkeddata.es

Additional Content
71
▪ NLP Resources
▪ Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science

72
• https://morph.oeg-upm.net/
• OBDA (Ont. Based Data Access)
• Morph-RDB
• Morph-GraphQL
• Morph-CSV

73
• https://helio.linkeddata.es/
• RDF from heterogeneous data
sources
• Java ? Web service? Api?
•
• https://helio.linkeddata.es/
• Sync. or Async. integration of
heterogeneous data sources
• Data quality, cleaning and linking
functions
• Liked Data Service publishing data
or maven dependency

74
• https://astrea.linkeddata.es
• Generation of SHACL shapes from
ontologies
• Validation of data using SHACL
shapes
• Online service

Additional Content
75
▪ NLP Resources
▪ Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science

librAIry
76
http://librairy.eu
@librairy_eu
https://github.com/librairy

Drugs4Covid
77
https://drugs4covid.oeg.fi.upm.es/
https://github.com/drugs4covid

Añotador
78
https://annotador.oeg.fi.upm.es/
https://github.com/mnavasloro/Annotador

TermitUp
79
https://termitup.oeg.fi.upm.es/

KeyQ
80
https://aiproc.linkeddata.es/

Methods for Knowledge-Based Systems and Deep Learning Integration
Semantic-based Initialization for OOKB entities
1
KG
dumps
Endpoint
PHASE 1
Entity
embedding
Ontology
embedding
Entity
Entity
ontological
information
KG
ontology
PHASE 2
c
c
Embedding
composition
Embedding
composition
Ontological
information
embeddings
Ontology
embedding
Entity
embedding
PHASE 3
Initial embedding
Class
Hierarchy
Thing
Place
Country
Class
Hierarchy
Thing
Person
Amador-Domínguez, E., Serrano, E., Manrique, D., Hohenecker, P., and Lukasiewicz, T. (2021). An ontology-based deep learning approach for triple classification
with out-of-knowledge-base entities. Information Sciences, 564, 85–102.

Additional Content
82
▪ NLP Resources
▪ Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science

scalable cross-lingual document similarity 83
▪ Readme Analysis
o Supervised
classification
o Regular expressions
o Header analysis
▪ File exploration
o Notebooks
o Dockerfiles
o Documentation
▪ GitHub API
Repository Extraction Results (Metadata)
https://github.com/KnowledgeCaptureAndDiscovery/somef/
1
Kelley, A., & Garijo, D. (2021). A framework for creating knowledge graphs of scientific software metadata. Quantitative Science Studies, 1-37.
Open Science: Creating KGs of Research Software Metadata

scalable cross-lingual document similarity 84 2
Open Science: Tracking FAIR principles in Research Software
- Continuous updates to
Research Software catalogs
of tools
- From software metadata
extraction automated
feedback on compliance with
best practices
- Linking research articles with
their corresponding software
tools
https://software.oeg.fi.upm.es/

NLP and Knowledge Graphs

Recommended

Recommended

More Related Content

Similar to NLP and Knowledge Graphs

Similar to NLP and Knowledge Graphs (20)

More from Carlos Badenes-Olmedo

More from Carlos Badenes-Olmedo (8)

Recently uploaded

Recently uploaded (20)

NLP and Knowledge Graphs