SlideShare a Scribd company logo
1 of 84
Download to read offline
NLP
and Knowledge Graphs
Carlos Badenes-Olmedo
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
2023/06/15
carlos.badenes@upm.es
@carbadol
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
2
â–Ș Context (~10min)
â–Ș Ontology Engineering Group
â–Ș Personal Background
â–Ș Relevant Outcomes (~30min)
â–Ș Cross-lingual Document Similarity (librAIry)
â–Ș Multiple and Heterogenous QA (MuHeQA)
â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
â–Ș Ongoing Research (~5min)
â–Ș Multi-Dimensional Fake News Identification**
â–Ș Inconsistency Detection
â–Ș Conversational-assisted Topic Labelling
â–Ș Multi-hop KGQA
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Ontology Engineering Group
3
â–Ș Directors: AsunciĂłn GĂłmez-PĂ©rez, Oscar Corcho
â–Ș Position: 8Âș ranking UPM (200 groups)
â–Ș Research group (~30 people)
â–Ș 170+ Collaborations
â–Ș 50+ Visitors
https://oeg.fi.upm.es
https://github.com/oeg-upm
@oeg-upm
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 4
Ontology Engineering Group
Ontologies LOT: Industrial Methodology
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 5
Ontology Engineering Group
Ontologies
Knowledge
Graphs
â–Ș Sync. or Async. integration of heterogeneous data sources
â–Ș Data quality, cleaning and linking functions
â–Ș Linked Data Service publishing data or maven dependency
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 6
Ontologies
Knowledge
Graphs
â–Ș Linguistic Linked Open Data
â–Ș Word Sense Disambiguation
â–Ș Named Entity Recognition
â–Ș Question - Answering
NLP
Ontology Engineering Group
Information
Extraction
Knowledge-driven
Exploration
Entity
Linking
â–Ș Probabilistic Topic Models
â–Ș Taxonomies from corpora
â–Ș Large-scale Searching
â–Ș Classification of Out-of-Knowledge-based
Entities
KeyQ
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 7
Ontologies
Knowledge
Graphs
â–Ș Creating KGs of Research Software Metadata
â–Ș Tracking FAIR principles in Research Software
NLP
Ontology Engineering Group
Open
Science
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 8
2016 2021
2020
Probabilistic
Topic
Models
Clinical
Knowledge
Graphs
Hybrid
QA
2022
Fake
News
Detection
Industry Academy Academy Academy
Academy
Personal Background
http://librairy.eu
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Call for Papers
9
â–Ș https://kgsum.github.io
â–Ș Topics
â–Ș Methods to summarize KGs
â–Ș KGs features related to summaries
â–Ș Scope and Impact of KG summaries
â–Ș Call for Papers:
â–Ș Paper Submission: Jul 7 (23:59 AoE), 2023
â–Ș Notification to Authors: Jul 24, 2023
â–Ș Workshop Dates: Nov 6-7, 2023 at Athens, Greece.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Call for Workshop and Tutorials
10
â–Ș https://www.k-cap.org/2023
â–Ș Topics
â–Ș Knowledge representation
â–Ș Knowledge acquisition
â–Ș Problem-solving and reasoning 

â–Ș Call for:
â–Ș Papers: Aug 13 (23:59 AoE), 2023
â–Ș Workshop/Tutorials: Jul 9, 2023
â–Ș Conference Dates: Dec 5-7, 2023 at Florida, USA.
â–Ș Steering Commitee:
â–Ș Jose Manuel GĂłmez-PĂ©rez
â–Ș Anna Lisa Gentile
â–Ș Ilaria Tiddi
â–Ș Krzysztof Janowicz
â–Ș Raphael Troncy
â–Ș Daniel Garijo
â–Ș Valentia Tamma
â–Ș Marieke Van Erp
â–Ș Rafael Goncalves
â–Ș Oscar Corcho
â–Ș Organising Commitee:
â–Ș K. Brent Venable
â–Ș Daniel Garijo
â–Ș Brian Jalaian
â–Ș Blerina Spahiu
â–Ș Niranjan Suri
â–Ș Marieke van Erp
â–Ș Carlos Badenes-Olmedo
â–Ș Alan Ordway
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
11
â–Ș Context (~10min)
â–Ș Ontology Engineering Group
â–Ș Personal Background
â–Ș Relevant Outcomes (~30min)
â–Ș Cross-lingual Document Similarity (librAIry)
â–Ș Multiple and Heterogenous QA (MuHeQA)
â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
â–Ș Ongoing Research (~5min)
â–Ș Multi-Dimensional Fake News Identification**
â–Ș Inconsistency Detection
â–Ș Conversational-assisted Topic Labelling
â–Ș Multi-hop KGQA
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
12
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
13
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
?
EN
Patents
PhD
Thesis
ES FR
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
14
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022.
â–Ș Probabilistic Topic Models [Blei et al,
2003]
â–Ș Each topic is a distribution over words
â–Ș Each word is drawn from one of those
topics
â–Ș Each document is a mixture of corpus-
wide topics
â–Ș Fixed Vector of topic distributions
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
15
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș similar documents do not necessarily share the most relevant topic for
each of them.
a) simJS = 0.74 b) simJS = 0.71
Distance Metrics
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
16
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș Hashing Topic Distributions
[Badenes-Olmedo et al, 2019]
â–Ș hierarchical set of topics
based on their relevance
Badenes-Omedo, C., Redondo-GarcĂ­a, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of
Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
17
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș Computation can be an
approximate nearest
neighbour (ANN) search
problem [Mao et al, 2017]
based on topic clusters.
Badenes-Omedo, C., Redondo-GarcĂ­a, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of
Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
Cross-lingual Document Similarity
18
â–Ș Multi-Lingual Topic Models [Viulic et al. 2015]:
â–Ș language-specific features of topics
â–Ș requires:
â–Ș parallel corpus
(sentence-aligned documents)
â–Ș or comparable corpus
(theme-aligned documents)
A
‘communication
system’
A
‘sistema de
comunicación’
A
‘systeme de
communication’
radio equipo communications
equipment red reseaux
network comunicaciĂłn electroniques
communication espectro acces
regulatory electromagnético telecommunications
EN ES FR
B
C
D
E
A
B
C
D
E
A
B
C
D
E
A
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
19
â–Ș Multi-Lingual Dictionaries [Hao and Paul, 2018]
â–Ș more widely available than parallel corpora
(e.g PANLEX or Wiktionary)
â–Ș models are built from words in a target language
â–Ș dictionaries as supervised method to align topics
â–Ș topics conditioned by pre-established language
relations
A
B
C
D
E
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
20
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
Wordnet:
â–Ș It is a semantic network.
â–Ș Synonymous words are grouped into synsets.
â–Ș These synsets are then linked to other synsets via semantic
relations
â–Ș e.g. hypernym or hyponym.
Bond, Francis, P. Vossen, John P. McCrae and Christiane D. Fellbaum. “CILI: the Collaborative Interlingual Index.” Global WordNet Conference (2016).
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
21
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
We propose an unsupervised algorithm:
â–Ș based on the Open Multilingual Wordnet (OMW)
Knowledge Base
(no translations required)
â–Ș that creates language-specific concept hierarchies
(no parallel or comparable corpora required)
â–Ș uses only the most relevant topics
(no density-based distance metrics)
A G K
radio.n.01 kit.n.02 access.n.02
equipment.n.01 equipment.n.01 approach.n.07
network.n.02 net.n.02 entree.n.02
net.n.06 web.n.06 communication.n.02
communication.n.02 communication.n.02 bout.n.02
EN ES FR
A
B
C
D
E
G
H
I
J K
L
M
Badenes-Olmedo, Carlos, JosĂ© Luis Redondo GarcĂ­a and Óscar Corcho. “Scalable Cross-lingual Document Similarity through
Language-speci
fi
c Concept Hierarchies.” Proceedings of the 10th International Conference on Knowledge Capture (2019)
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
22
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș hierarchical-set of topics from relevance
â–Ș nearest neighbour searches (k-d tree)
â–Ș Boolean Similarity (Jaccard Index)
radio.n.01?
buy.v.01?
bout.n.02?
net.n.01?
1st hierarchy level
2nd
hierarchy
level
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
23
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
https://github.com/librairy/demo
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
24
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș Document Classification Task
â–Ș Metrics: precision, recall and f-measure
â–Ș Data: ~1k docs
(monolingual, bi-lingual or multilingual
documents )
â–Ș Methodology: comparison of clusters based
on EUROVOC categories and annotations
created by the model:
â–Ș supervised = labeledLDA
â–Ș unsupervised = LDA
Precision
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
supervised unsupervised
Recall
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
supervised unsupervised
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Cross-lingual Document Similarity
25
â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora:
â–Ș C1: Content representation
â–Ș C2: High-dimensional correlation matrix
â–Ș C3: Multi-lingual Comparison
â–Ș Document Retrieval Task
â–Ș Metrics: precision@3, precision@5 and
precision@10
â–Ș Data: ~1k docs
(monolingual, bi-lingual or multilingual
documents )
â–Ș Methodology: comparison of clusters based
on EUROVOC categories and annotations
created by the model:
â–Ș supervised = labeledLDA
â–Ș unsupervised = LDA
Precision@3
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
supervised unsupervised
Precision@10
0
25
50
75
100
e
n
e
s
f
r
e
n
-
e
s
e
n
-
f
r
e
s
-
f
r
e
n
-
e
s
-
f
r
supervised unsupervised
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
26
â–Ș Context (~10min)
â–Ș Ontology Engineering Group
â–Ș Background
â–Ș Relevant Outcomes (~30min)
â–Ș Cross-lingual Document Similarity (librAIry)
â–Ș Multiple and Heterogenous QA (MuHeQA)
â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
â–Ș Ongoing Research (~5min)
â–Ș Multi-Dimensional Fake News Identification**
â–Ș Inconsistency Detection
â–Ș Conversational-assisted Topic Labelling
â–Ș Multi-hop KGQA
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Multiple and Heterogenous QA (MuHeQA)
27
â–Ș Objective: facilitate access to information
from KGs and unstructured data.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Multiple and Heterogenous QA (MuHeQA)
28
Badenes-Omedo, and Corcho, O. (2023).MuHeQA: Zero-shot Question Answering over Multiple and
Heterogeneous Knowledge Bases. Semantic Web Journal.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Multiple and Heterogenous QA (MuHeQA)
29
https://github.com/librairy/muheqa
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 30
â–Ș SPARQL queries used to extract the properties of a KG resource
Wikidata: DBpedia:
Multiple and Heterogenous QA (MuHeQA)
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 31
â–Ș Performance identifing keywords in a question:
Multiple and Heterogenous QA (MuHeQA)
‱Our method identifies the entities mentioned in a question along with the relevant terms discovered using PoS
annotations:
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 32
â–Ș Performance when discovering Wikidata or DBpedia resources :
Multiple and Heterogenous QA (MuHeQA)
Wikidata: DBpedia:
‱We discard the creation of vector spaces where each resource is represented by its labels [7], since one of our
assumptions is to avoid the creation of supervised models that perform specific classification tasks over the KG (i.e.
prior training)
‱Our proposal does not require training datasets since it performs textual searches based on the terms identified in the
query using an inverse index of the labels associated with the resources .
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 33
â–Ș Performance based on Knowledge Graph-oriented QA:
Multiple and Heterogenous QA (MuHeQA)
‱The results show that our approach offers a performance close to the best system, STaF-QA, and better than other
approaches specific to KGQA.
‱However, one of the weak points is the recall, which means that our approach has to improve in response
elaboration. The answer is perhaps too straight forward, and we should be concerned with constructing more
complex responses.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 34
â–Ș Performance based on Document-oriented QA:
Multiple and Heterogenous QA (MuHeQA)
‱The answers created by our algorithm are not as elaborate as those in the evaluation dataset, which were created
manually, and this penalises the performance of our system.
‱For example, given the question "How many children were infected by HIV-1 in 2008-2009, worldwide?", the answer
inferred by our system is "more than 400,000", while the correct answer is "more than 400,000 children were infected
worldwide, mostly through MTCT and 90% of them lived in sub-Saharan Africa"
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
35
â–Ș Context (~10min)
â–Ș Ontology Engineering Group
â–Ș Academic and Industrial Background
â–Ș Personal Motivation
â–Ș Relevant Outcomes (~30min)
â–Ș Cross-lingual Document Similarity (librAIry)
â–Ș Multiple and Heterogenous QA (MuHeQA)
â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
â–Ș Experiments in progress (~5min)
â–Ș Multi-Dimensional Fake News Identification**
â–Ș Inconsistency Detection
â–Ș Conversational-assisted Topic Labelling
â–Ș Multi-hop KGQA
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
36
â–Ș Release of scientific documents on coronaviruses (useful in doc retrieval, IE, and knowledge
management task)
â–Ș First goal: make the scientific literature around coronaviruses useful in some of the immediate
needs of hospital pharmacies (e.g. drug shortages, or interactions between chemical substances)
â–Ș After: provide an up-to-date knowledge base on coronaviruses extracted from scientific
publications
Badenes-Olmedo, Carlos and Óscar Corcho. “Lessons learned to enable question answering on knowledge graphs extracted from
scienti
fi
c publications: A case study on the coronavirus literature.” Journal of Biomedical Informatics 142 (2023): 104382 - 104382.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
37
â–ȘRQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
â–ȘRQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
38
â–ȘThere is no common methodology for the construction of knowledge graphs from the
biomedical literature, but rather a series of steps or stages that coincide among existing works.
â–ȘWe propose a work
fl
ow that is also valid when information update cycles are short
oE.g. one-week update cycles
â–ȘThis work
fl
ow addresses the research question RQ1
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
39
â–ȘRQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
â–ȘRQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
40
‱Fine-tuned the BioBERT [Lee et al., 2020]
model to identify the following biomedical
classes:
-Diseases: BC5CDR-Diseases and NCBI-Diseases
-Chemicals: BC4CHEMD and BC5CDR-Drugs
-Genetics: JNLPBA and BC2GM
‱Unique representation of the concept from a set
of related terms composed using multiple
information sources
- 7.4 million annotations in JSON format
‱Multiple sources were taken into account to
create a database for each of the biomedical entities
‱With the creation of these models we address the
research question RQ2
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
41
â–ȘRQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
â–ȘRQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
42
‱ First, identify the requirements to express biomedical
concepts and the associations between them.
-Uni
fi
ed Medical Language System (UMLS)
-Several efforts have been made to integrate
biomedical knowledge into a single shared
representation space (e.g. DISNET platform)
‱ Then, create an ontology that describes:
- (i) biomedical concepts and associations between
them and
-(ii) the evidence supporting these associations.
‱ Challenges:
-Biomedical taxonomies and vocabularies with reduced
semantics (e.g. SNOMED, ICD-10, UMLS...)
-how are the concepts related?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
43
‱ We created the Evidences for BiOmedical
Concepts Association (EBOCA) ontology to describe:
-biomedical concepts
-associations between them
-evidence supporting these associations.
‱ It de
fi
nes the conceptual model on which the
Drugs4Covid knowledge graph is built
‱ It is composed of two modules, one oriented toward
describing biomedical concepts and associations,
EBOCA SEM-DISNET, and the other focused on
representing evidence of these associations with
metadata and provenance information, EBOCA
Evidences https://w3id.org/eboca/portal
Perez, Andrea Alvarez, Ana Iglesias-Molina, Lucia Prieto Santamaria, Maria Poveda-Villalon, Carlos Badenes-Olmedo and Alejandro Rodriguez-
Gonzalez. “EBOCA: Evidences for BiOmedical Concepts Association Ontology.” International Conference Knowledge Engineering and
Knowledge Management (2022).
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
44
EBOCA SIM-DISNET module:
‱Designed to represent associations of common
biomedical concepts, such as: diseases,
phenotypes, genes, genetic variants, biological
pathways, drugs, proteins, and targets.
‱ Associations link pairs of concepts, for
example, the gene-disease or drug-disease
association
‱ Adds semantics to the DISNET structure
-phenotypic layer
-biological layer
-pharmacological layer
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
45
EBOCA EVIDENCES module:
‱Extends the associations between
biomedical concepts of the SEM-DISNET
module with metadata and provenance
information
‱ These evidences of associations may
come from known curated sources, or may
be drawn or inferred directly from the texts.
‱ Describes in more detail the type of
evidence supported by the association, the
agents involved in its extraction and
publication
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
ÂĄ
46
‱ The EBOCA ontology address the research question RQ2.
‱ The evaluation of an ontology usually seeks to identify inconsistencies or formal errors that
invalidate its de
fi
nition. However, since this work is oriented to the creation of a knowledge
graph from a collection of scienti
fi
c articles, it is more interesting to focus on an evaluation
that measures the coverage of the ontology to a set of competency questions
»15 questions associated with the EBOCA SEM-DISNET module
» and 10 questions associated with the EBOCA Evidences module
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
47
â–ȘRQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
â–ȘRQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
48
â–ȘObjective: annotate the entities, their relationships
and the evidence according to the EBOCA ontology
â–ȘMethodology:
-The mapping rules between the annotations
and the ontology resources were created with
Mapeathor and carried out by the Morph-kgc
library.
-The articles were then described by the
ontology in an RDF
fi
le.
-Finally, GraphDB was chosen to store the
RDF content and Helio to provide a SparQL
access interface
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
49
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
50
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
51
â–ȘRQ1: How to systematize the processing of scienti
fi
c corpora to build knowledge
graphs?
â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins
mentioned in scienti
fi
c texts?
â–ȘRQ3: How to formally describe evidence based on the association of drugs,
diseases, genes and proteins mentioned in the same paragraph of a scienti
fi
c
article?
â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are
mentioned?
â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a
collection of scienti
fi
c publications, through natural language queries?
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 52
â–ȘObjective: facilitate access to the
resources created in this work, especially
the knowledge graph.
â–ȘMethodology:
o There are multiple ways to exploit a KG:
prede
fi
ned SPARQL queries, guided
document searches, etc.
o Our proposal is a question-answer
(QA) system based on natural language,
MuHeQA
»combines ExtractiveQA and NLP with KGs
»Summarization, Evidence Extraction and Answer
Generation
â–Ș
https://drugs4covid.oeg.fi.upm.es/services/bio-qa
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Outline
53
â–Ș Context (~10min)
â–Ș Ontology Engineering Group
â–Ș Academic and Industrial Background
â–Ș Personal Motivation
â–Ș Relevant Outcomes (~30min)
â–Ș Cross-lingual Document Similarity (librAIry)
â–Ș Multiple and Heterogenous QA (MuHeQA)
â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid)
â–Ș Ongoing Research (~5min)
â–Ș Multi-Dimensional Fake News Identification**
â–Ș Inconsistency Detection
â–Ș Conversational-assisted Topic Labelling
â–Ș Multi-hop KGQA
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
..In this context the specialist said that what we
eat is the main cause of cancer risk, with a
*NUMBER* percent; in front of tobacco
consumption, responsible in a *NUMBER*
percent; and infections, with *NUMBER*
percent
”
Multi-Dimensional Fake News Identification
54
â–Ș Multi-Dimensional Fake News Identification:
Knowledge
Graph
5Ws evidences
Dictionary Social
Network
impact
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Inconsistency Detection
55
â–Ș Inconsistency Detection:
social
media
political
parties
specific
issues
over
time
Guillen-Pancho, Ibai, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023)
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Conversational-assisted Topic Labeling
56
â–Ș Conversational-assisted Topic Labeling:
Words
Selection
Question
Composition
Question
Answering
Label
Retrieval
Ramón-Ferrer, Virginia, Badenes-Olmedo, Carlos and Óscar Corcho. “Automatic Topic Label Generation Using Conversational Models.”(2023)
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Multi-hop KGQA
57
Liu-Chen, Teng, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023)
Min, Sewon, Victor Zhong, Luke
Zettlemoyer and Hannaneh
Hajishirzi. “Multi-hop Reading
Comprehension through Question
Decomposition and
Rescoring.” ArXiv abs/1906.02916
(2019)
â–Ș Multi-hop KGQA:
NLP and Knowledge
Graphs
Carlos Badenes-Olmedo
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
2023/06/15
carlos.badenes@upm.es
@carbadol
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Additional Content
59
â–Ș Ontology Engineering Framework
â–Ș Knowledge Graph Tools
â–Ș NLP Resources
â–Ș Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science
OEG Ontology Engineering Framework
LOT industrial methodology
60
http://lot.linkeddata.es
+20
projects
More details at: Poveda-VillalĂłn, M., FernĂĄndez-Izquierdo, A., FernĂĄndez-LĂłpez, M., & GarcĂ­a-Castro,
R. (2022). LOT: An industrial oriented ontology engineering framework. Engineering Applications of
Artificial Intelligence, 111, 104755. https://doi.org/10.1016/j.engappai.2022.104755
OEG Ontology Engineering Framework
LOT adoption
61
â–Ș + 20 projects (internal and external)
â–Ș https://lot.linkeddata.es/#stories (selected examples)
OEG Ontology Engineering Framework
Technology landscape
62
OEG Ontology Engineering Framework
Technology landscape
63
‱ openly available corpus https://coralcorpus.linkeddata.es/
‱ 834 ontological requirements annotated
‱ 29 lexico-syntactic patterns
‱ HTML, CSV and RDF
‱ "Creative Commons Attribution 4.0 International" license
OEG Ontology Engineering Framework
Technology landscape
64
‱ https://chowlk.linkeddata.es
‱ Notation and converter
‱ Hosted by OEG
‱ Developed by OEG
OEG Ontology Engineering Framework
Technology landscape
65
‱ https://lov.linkeddata.es
‱ Vocabulary registry and index
‱ Hosted by OEG
‱ Not developed by OEG
OEG Ontology Engineering Framework
Technology landscape
66
‱ http://oops.linkeddata.es/
‱ Check for pitfalls (41
defined 31 automated)
‱ Online app
‱ Web service
‱ http://themis.linkeddata.es
‱ Ontology unit tests based validation
‱ Online app
‱ Web service
OEG Ontology Engineering Framework
Technology landscape
67
‱ https://github.com/dgarijo/Widoco/
‱ Ontology documentation
‱ Desktop app (maintained by Daniel
Garijo, ISI, California)
‱ Web service (OEG)
OEG Ontology Engineering Framework
Technology landscape
68
‱ https://github.com/oeg-upm/vocab.linkeddata.es
‱ Ontology portal generation
‱ jar distribution
OEG Ontology Engineering Framework
Technology landscape
69
OEG Ontology Engineering Framework
Handle versions and distributed environments
70
Evaluation reports
HTML documentation
Diagrams
Permanent Ids
Content negotiation
Bundle
Pre-view
http://ontoology.linkeddata.es
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Additional Content
71
â–Ș Ontology Engineering Framework
â–Ș Knowledge Graph Tools
â–Ș NLP Resources
â–Ș Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science
OEG Ontology Engineering Framework
Technology landscape
72
‱ https://morph.oeg-upm.net/
‱ OBDA (Ont. Based Data Access)
‱ Morph-RDB
‱ Morph-GraphQL
‱ Morph-CSV
OEG Ontology Engineering Framework
Technology landscape
73
‱ https://helio.linkeddata.es/
‱ RDF from heterogeneous data
sources
‱ Java ? Web service? Api?
‱
‱ https://helio.linkeddata.es/
‱ Sync. or Async. integration of
heterogeneous data sources
‱ Data quality, cleaning and linking
functions
‱ Liked Data Service publishing data
or maven dependency
OEG Ontology Engineering Framework
Technology landscape
74
‱ https://astrea.linkeddata.es
‱ Generation of SHACL shapes from
ontologies
‱ Validation of data using SHACL
shapes
‱ Online service
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Additional Content
75
â–Ș Ontology Engineering Framework
â–Ș Knowledge Graph Tools
â–Ș NLP Resources
â–Ș Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
librAIry
76
http://librairy.eu
@librairy_eu
https://github.com/librairy
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Drugs4Covid
77
https://drugs4covid.oeg.fi.upm.es/
https://github.com/drugs4covid
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Añotador
78
https://annotador.oeg.fi.upm.es/
https://github.com/mnavasloro/Annotador
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
TermitUp
79
https://termitup.oeg.fi.upm.es/
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
KeyQ
80
https://aiproc.linkeddata.es/
“NLP and Knowledge Graphs“- carlos.badenes@upm.es 81
Methods for Knowledge-Based Systems and Deep Learning Integration
Semantic-based Initialization for OOKB entities
1
KG
dumps
Endpoint
PHASE 1
Entity
embedding
Ontology
embedding
Entity
Entity
ontological
information
KG
ontology
PHASE 2
c
c
Embedding
composition
Embedding
composition
Ontological
information
embeddings
Ontology
embedding
Entity
embedding
PHASE 3
Initial embedding
Class
Hierarchy
Thing
Place
Country
Class
Hierarchy
Thing
Person
Amador-DomĂ­nguez, E., Serrano, E., Manrique, D., Hohenecker, P., and Lukasiewicz, T. (2021). An ontology-based deep learning approach for triple classification
with out-of-knowledge-base entities. Information Sciences, 564, 85–102.
“NLP and Knowledge Graphs“- carlos.badenes@upm.es
Additional Content
82
â–Ș Ontology Engineering Framework
â–Ș Knowledge Graph Tools
â–Ș NLP Resources
â–Ș Open Science
Ontologies
Knowledge
Graphs
NLP
Open
Science
scalable cross-lingual document similarity 83
â–Ș Readme Analysis
o Supervised
classification
o Regular expressions
o Header analysis
â–Ș File exploration
o Notebooks
o Dockerfiles
o Documentation
â–Ș GitHub API
Repository Extraction Results (Metadata)
https://github.com/KnowledgeCaptureAndDiscovery/somef/
1
Kelley, A., & Garijo, D. (2021). A framework for creating knowledge graphs of scientific software metadata. Quantitative Science Studies, 1-37.
Open Science: Creating KGs of Research Software Metadata
scalable cross-lingual document similarity 84 2
Open Science: Tracking FAIR principles in Research Software
- Continuous updates to
Research Software catalogs
of tools
- From software metadata
extraction automated
feedback on compliance with
best practices
- Linking research articles with
their corresponding software
tools
https://software.oeg.fi.upm.es/

More Related Content

Similar to NLP and Knowledge Graphs

Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPariadnenetwork
 
The P4 of Networkacy
The P4 of NetworkacyThe P4 of Networkacy
The P4 of NetworkacyDmitry Zinoviev
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Information Science in the Curriculum of Library and Information Studies in C...
Information Science in the Curriculum of Library and Information Studies in C...Information Science in the Curriculum of Library and Information Studies in C...
Information Science in the Curriculum of Library and Information Studies in C...Infodays
 
Explorations in the Digital Humanities Case studies & Problem-solving
Explorations in the Digital Humanities Case studies & Problem-solvingExplorations in the Digital Humanities Case studies & Problem-solving
Explorations in the Digital Humanities Case studies & Problem-solvingeraser Juan José Calderón
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesAshis Chanda
 
HIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyHIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyJusto Hidalgo
 
Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationStuart Shulman
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...Linking Historical Sources to Established Knowledge Bases in Order to Inform ...
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...Annalina Caputo
 
New trends in ontological engineering, practices and tools
New trends in ontological engineering, practices and toolsNew trends in ontological engineering, practices and tools
New trends in ontological engineering, practices and toolsMarĂ­a Poveda VillalĂłn
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018SusanMRob
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkMonaDiab7
 
Disrupting Digital Monolingualism
Disrupting Digital MonolingualismDisrupting Digital Monolingualism
Disrupting Digital MonolingualismPaul Spence
 
Research Support @ Erasmus University Rotterdam
Research Support @ Erasmus University RotterdamResearch Support @ Erasmus University Rotterdam
Research Support @ Erasmus University RotterdamMarlon Domingus
 
The future of scholarly publishing
The future of scholarly publishingThe future of scholarly publishing
The future of scholarly publishingBjörn Brembs
 

Similar to NLP and Knowledge Graphs (20)

Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLP
 
The P4 of Networkacy
The P4 of NetworkacyThe P4 of Networkacy
The P4 of Networkacy
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
Information Science in the Curriculum of Library and Information Studies in C...
Information Science in the Curriculum of Library and Information Studies in C...Information Science in the Curriculum of Library and Information Studies in C...
Information Science in the Curriculum of Library and Information Studies in C...
 
Explorations in the Digital Humanities Case studies & Problem-solving
Explorations in the Digital Humanities Case studies & Problem-solvingExplorations in the Digital Humanities Case studies & Problem-solving
Explorations in the Digital Humanities Case studies & Problem-solving
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
 
HIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyHIT project - Humanities Integration Technology
HIT project - Humanities Integration Technology
 
Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...Linking Historical Sources to Established Knowledge Bases in Order to Inform ...
Linking Historical Sources to Established Knowledge Bases in Order to Inform ...
 
New trends in ontological engineering, practices and tools
New trends in ontological engineering, practices and toolsNew trends in ontological engineering, practices and tools
New trends in ontological engineering, practices and tools
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledge
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walk
 
Disrupting Digital Monolingualism
Disrupting Digital MonolingualismDisrupting Digital Monolingualism
Disrupting Digital Monolingualism
 
Research Support @ Erasmus University Rotterdam
Research Support @ Erasmus University RotterdamResearch Support @ Erasmus University Rotterdam
Research Support @ Erasmus University Rotterdam
 
The future of scholarly publishing
The future of scholarly publishingThe future of scholarly publishing
The future of scholarly publishing
 

More from Carlos Badenes-Olmedo

Multilingual document analysis
Multilingual document analysisMultilingual document analysis
Multilingual document analysisCarlos Badenes-Olmedo
 
Distributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryDistributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryCarlos Badenes-Olmedo
 
Efficient Clustering from Distributions over Topics
Efficient Clustering from Distributions over TopicsEfficient Clustering from Distributions over Topics
Efficient Clustering from Distributions over TopicsCarlos Badenes-Olmedo
 

More from Carlos Badenes-Olmedo (8)

Crosslingual search-engine
Crosslingual search-engineCrosslingual search-engine
Crosslingual search-engine
 
Cross-lingual Similarity
Cross-lingual SimilarityCross-lingual Similarity
Cross-lingual Similarity
 
Multilingual searchapi
Multilingual searchapiMultilingual searchapi
Multilingual searchapi
 
Multilingual document analysis
Multilingual document analysisMultilingual document analysis
Multilingual document analysis
 
Topic Models Exploration
Topic Models ExplorationTopic Models Exploration
Topic Models Exploration
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introduction
 
Distributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryDistributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIry
 
Efficient Clustering from Distributions over Topics
Efficient Clustering from Distributions over TopicsEfficient Clustering from Distributions over Topics
Efficient Clustering from Distributions over Topics
 

Recently uploaded

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...shivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔Body to body massage wit...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

NLP and Knowledge Graphs

  • 1. NLP and Knowledge Graphs Carlos Badenes-Olmedo Ontology Engineering Group (OEG) Universidad PolitĂ©cnica de Madrid (UPM) 2023/06/15 carlos.badenes@upm.es @carbadol
  • 2. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Outline 2 â–Ș Context (~10min) â–Ș Ontology Engineering Group â–Ș Personal Background â–Ș Relevant Outcomes (~30min) â–Ș Cross-lingual Document Similarity (librAIry) â–Ș Multiple and Heterogenous QA (MuHeQA) â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) â–Ș Ongoing Research (~5min) â–Ș Multi-Dimensional Fake News Identification** â–Ș Inconsistency Detection â–Ș Conversational-assisted Topic Labelling â–Ș Multi-hop KGQA
  • 3. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Ontology Engineering Group 3 â–Ș Directors: AsunciĂłn GĂłmez-PĂ©rez, Oscar Corcho â–Ș Position: 8Âș ranking UPM (200 groups) â–Ș Research group (~30 people) â–Ș 170+ Collaborations â–Ș 50+ Visitors https://oeg.fi.upm.es https://github.com/oeg-upm @oeg-upm
  • 4. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 4 Ontology Engineering Group Ontologies LOT: Industrial Methodology
  • 5. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 5 Ontology Engineering Group Ontologies Knowledge Graphs â–Ș Sync. or Async. integration of heterogeneous data sources â–Ș Data quality, cleaning and linking functions â–Ș Linked Data Service publishing data or maven dependency
  • 6. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 6 Ontologies Knowledge Graphs â–Ș Linguistic Linked Open Data â–Ș Word Sense Disambiguation â–Ș Named Entity Recognition â–Ș Question - Answering NLP Ontology Engineering Group Information Extraction Knowledge-driven Exploration Entity Linking â–Ș Probabilistic Topic Models â–Ș Taxonomies from corpora â–Ș Large-scale Searching â–Ș Classification of Out-of-Knowledge-based Entities KeyQ
  • 7. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 7 Ontologies Knowledge Graphs â–Ș Creating KGs of Research Software Metadata â–Ș Tracking FAIR principles in Research Software NLP Ontology Engineering Group Open Science
  • 8. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 8 2016 2021 2020 Probabilistic Topic Models Clinical Knowledge Graphs Hybrid QA 2022 Fake News Detection Industry Academy Academy Academy Academy Personal Background http://librairy.eu
  • 9. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Call for Papers 9 â–Ș https://kgsum.github.io â–Ș Topics â–Ș Methods to summarize KGs â–Ș KGs features related to summaries â–Ș Scope and Impact of KG summaries â–Ș Call for Papers: â–Ș Paper Submission: Jul 7 (23:59 AoE), 2023 â–Ș Notification to Authors: Jul 24, 2023 â–Ș Workshop Dates: Nov 6-7, 2023 at Athens, Greece.
  • 10. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Call for Workshop and Tutorials 10 â–Ș https://www.k-cap.org/2023 â–Ș Topics â–Ș Knowledge representation â–Ș Knowledge acquisition â–Ș Problem-solving and reasoning 
 â–Ș Call for: â–Ș Papers: Aug 13 (23:59 AoE), 2023 â–Ș Workshop/Tutorials: Jul 9, 2023 â–Ș Conference Dates: Dec 5-7, 2023 at Florida, USA. â–Ș Steering Commitee: â–Ș Jose Manuel GĂłmez-PĂ©rez â–Ș Anna Lisa Gentile â–Ș Ilaria Tiddi â–Ș Krzysztof Janowicz â–Ș Raphael Troncy â–Ș Daniel Garijo â–Ș Valentia Tamma â–Ș Marieke Van Erp â–Ș Rafael Goncalves â–Ș Oscar Corcho â–Ș Organising Commitee: â–Ș K. Brent Venable â–Ș Daniel Garijo â–Ș Brian Jalaian â–Ș Blerina Spahiu â–Ș Niranjan Suri â–Ș Marieke van Erp â–Ș Carlos Badenes-Olmedo â–Ș Alan Ordway
  • 11. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Outline 11 â–Ș Context (~10min) â–Ș Ontology Engineering Group â–Ș Personal Background â–Ș Relevant Outcomes (~30min) â–Ș Cross-lingual Document Similarity (librAIry) â–Ș Multiple and Heterogenous QA (MuHeQA) â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) â–Ș Ongoing Research (~5min) â–Ș Multi-Dimensional Fake News Identification** â–Ș Inconsistency Detection â–Ș Conversational-assisted Topic Labelling â–Ș Multi-hop KGQA
  • 12. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 12
  • 13. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 13 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison ? EN Patents PhD Thesis ES FR
  • 14. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 14 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. â–Ș Probabilistic Topic Models [Blei et al, 2003] â–Ș Each topic is a distribution over words â–Ș Each word is drawn from one of those topics â–Ș Each document is a mixture of corpus- wide topics â–Ș Fixed Vector of topic distributions
  • 15. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 15 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș similar documents do not necessarily share the most relevant topic for each of them. a) simJS = 0.74 b) simJS = 0.71 Distance Metrics
  • 16. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 16 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș Hashing Topic Distributions [Badenes-Olmedo et al, 2019] â–Ș hierarchical set of topics based on their relevance Badenes-Omedo, C., Redondo-GarcĂ­a, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
  • 17. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 17 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș Computation can be an approximate nearest neighbour (ANN) search problem [Mao et al, 2017] based on topic clusters. Badenes-Omedo, C., Redondo-GarcĂ­a, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
  • 18. “NLP and Knowledge Graphs“- carlos.badenes@upm.es â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison Cross-lingual Document Similarity 18 â–Ș Multi-Lingual Topic Models [Viulic et al. 2015]: â–Ș language-specific features of topics â–Ș requires: â–Ș parallel corpus (sentence-aligned documents) â–Ș or comparable corpus (theme-aligned documents) A ‘communication system’ A ‘sistema de comunicaciĂłn’ A ‘systeme de communication’ radio equipo communications equipment red reseaux network comunicaciĂłn electroniques communication espectro acces regulatory electromagnĂ©tico telecommunications EN ES FR B C D E A B C D E A B C D E A
  • 19. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 19 â–Ș Multi-Lingual Dictionaries [Hao and Paul, 2018] â–Ș more widely available than parallel corpora (e.g PANLEX or Wiktionary) â–Ș models are built from words in a target language â–Ș dictionaries as supervised method to align topics â–Ș topics conditioned by pre-established language relations A B C D E â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison
  • 20. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 20 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison Wordnet: â–Ș It is a semantic network. â–Ș Synonymous words are grouped into synsets. â–Ș These synsets are then linked to other synsets via semantic relations â–Ș e.g. hypernym or hyponym. Bond, Francis, P. Vossen, John P. McCrae and Christiane D. Fellbaum. “CILI: the Collaborative Interlingual Index.” Global WordNet Conference (2016).
  • 21. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 21 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison We propose an unsupervised algorithm: â–Ș based on the Open Multilingual Wordnet (OMW) Knowledge Base (no translations required) â–Ș that creates language-specific concept hierarchies (no parallel or comparable corpora required) â–Ș uses only the most relevant topics (no density-based distance metrics) A G K radio.n.01 kit.n.02 access.n.02 equipment.n.01 equipment.n.01 approach.n.07 network.n.02 net.n.02 entree.n.02 net.n.06 web.n.06 communication.n.02 communication.n.02 communication.n.02 bout.n.02 EN ES FR A B C D E G H I J K L M Badenes-Olmedo, Carlos, JosĂ© Luis Redondo GarcĂ­a and Óscar Corcho. “Scalable Cross-lingual Document Similarity through Language-speci fi c Concept Hierarchies.” Proceedings of the 10th International Conference on Knowledge Capture (2019)
  • 22. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 22 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș hierarchical-set of topics from relevance â–Ș nearest neighbour searches (k-d tree) â–Ș Boolean Similarity (Jaccard Index) radio.n.01? buy.v.01? bout.n.02? net.n.01? 1st hierarchy level 2nd hierarchy level
  • 23. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 23 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison https://github.com/librairy/demo
  • 24. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 24 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș Document Classification Task â–Ș Metrics: precision, recall and f-measure â–Ș Data: ~1k docs (monolingual, bi-lingual or multilingual documents ) â–Ș Methodology: comparison of clusters based on EUROVOC categories and annotations created by the model: â–Ș supervised = labeledLDA â–Ș unsupervised = LDA Precision 0 25 50 75 100 e n e s f r e n - e s e n - f r e s - f r e n - e s - f r supervised unsupervised Recall 0 25 50 75 100 e n e s f r e n - e s e n - f r e s - f r e n - e s - f r supervised unsupervised
  • 25. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Cross-lingual Document Similarity 25 â–ȘThree challenges to perform large-scale retrieval of documents in multi-lingual corpora: â–Ș C1: Content representation â–Ș C2: High-dimensional correlation matrix â–Ș C3: Multi-lingual Comparison â–Ș Document Retrieval Task â–Ș Metrics: precision@3, precision@5 and precision@10 â–Ș Data: ~1k docs (monolingual, bi-lingual or multilingual documents ) â–Ș Methodology: comparison of clusters based on EUROVOC categories and annotations created by the model: â–Ș supervised = labeledLDA â–Ș unsupervised = LDA Precision@3 0 25 50 75 100 e n e s f r e n - e s e n - f r e s - f r e n - e s - f r supervised unsupervised Precision@10 0 25 50 75 100 e n e s f r e n - e s e n - f r e s - f r e n - e s - f r supervised unsupervised
  • 26. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Outline 26 â–Ș Context (~10min) â–Ș Ontology Engineering Group â–Ș Background â–Ș Relevant Outcomes (~30min) â–Ș Cross-lingual Document Similarity (librAIry) â–Ș Multiple and Heterogenous QA (MuHeQA) â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) â–Ș Ongoing Research (~5min) â–Ș Multi-Dimensional Fake News Identification** â–Ș Inconsistency Detection â–Ș Conversational-assisted Topic Labelling â–Ș Multi-hop KGQA
  • 27. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Multiple and Heterogenous QA (MuHeQA) 27 â–Ș Objective: facilitate access to information from KGs and unstructured data.
  • 28. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Multiple and Heterogenous QA (MuHeQA) 28 Badenes-Omedo, and Corcho, O. (2023).MuHeQA: Zero-shot Question Answering over Multiple and Heterogeneous Knowledge Bases. Semantic Web Journal.
  • 29. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Multiple and Heterogenous QA (MuHeQA) 29 https://github.com/librairy/muheqa
  • 30. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 30 â–Ș SPARQL queries used to extract the properties of a KG resource Wikidata: DBpedia: Multiple and Heterogenous QA (MuHeQA)
  • 31. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 31 â–Ș Performance identifing keywords in a question: Multiple and Heterogenous QA (MuHeQA) ‱Our method identifies the entities mentioned in a question along with the relevant terms discovered using PoS annotations:
  • 32. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 32 â–Ș Performance when discovering Wikidata or DBpedia resources : Multiple and Heterogenous QA (MuHeQA) Wikidata: DBpedia: ‱We discard the creation of vector spaces where each resource is represented by its labels [7], since one of our assumptions is to avoid the creation of supervised models that perform specific classification tasks over the KG (i.e. prior training) ‱Our proposal does not require training datasets since it performs textual searches based on the terms identified in the query using an inverse index of the labels associated with the resources .
  • 33. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 33 â–Ș Performance based on Knowledge Graph-oriented QA: Multiple and Heterogenous QA (MuHeQA) ‱The results show that our approach offers a performance close to the best system, STaF-QA, and better than other approaches specific to KGQA. ‱However, one of the weak points is the recall, which means that our approach has to improve in response elaboration. The answer is perhaps too straight forward, and we should be concerned with constructing more complex responses.
  • 34. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 34 â–Ș Performance based on Document-oriented QA: Multiple and Heterogenous QA (MuHeQA) ‱The answers created by our algorithm are not as elaborate as those in the evaluation dataset, which were created manually, and this penalises the performance of our system. ‱For example, given the question "How many children were infected by HIV-1 in 2008-2009, worldwide?", the answer inferred by our system is "more than 400,000", while the correct answer is "more than 400,000 children were infected worldwide, mostly through MTCT and 90% of them lived in sub-Saharan Africa"
  • 35. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Outline 35 â–Ș Context (~10min) â–Ș Ontology Engineering Group â–Ș Academic and Industrial Background â–Ș Personal Motivation â–Ș Relevant Outcomes (~30min) â–Ș Cross-lingual Document Similarity (librAIry) â–Ș Multiple and Heterogenous QA (MuHeQA) â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) â–Ș Experiments in progress (~5min) â–Ș Multi-Dimensional Fake News Identification** â–Ș Inconsistency Detection â–Ș Conversational-assisted Topic Labelling â–Ș Multi-hop KGQA
  • 36. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 36 â–Ș Release of scientific documents on coronaviruses (useful in doc retrieval, IE, and knowledge management task) â–Ș First goal: make the scientific literature around coronaviruses useful in some of the immediate needs of hospital pharmacies (e.g. drug shortages, or interactions between chemical substances) â–Ș After: provide an up-to-date knowledge base on coronaviruses extracted from scientific publications Badenes-Olmedo, Carlos and Óscar Corcho. “Lessons learned to enable question answering on knowledge graphs extracted from scienti fi c publications: A case study on the coronavirus literature.” Journal of Biomedical Informatics 142 (2023): 104382 - 104382.
  • 37. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 37 â–ȘRQ1: How to systematize the processing of scienti fi c corpora to build knowledge graphs? â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins mentioned in scienti fi c texts? â–ȘRQ3: How to formally describe evidence based on the association of drugs, diseases, genes and proteins mentioned in the same paragraph of a scienti fi c article? â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are mentioned? â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a collection of scienti fi c publications, through natural language queries?
  • 38. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 38 â–ȘThere is no common methodology for the construction of knowledge graphs from the biomedical literature, but rather a series of steps or stages that coincide among existing works. â–ȘWe propose a work fl ow that is also valid when information update cycles are short oE.g. one-week update cycles â–ȘThis work fl ow addresses the research question RQ1
  • 39. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 39 â–ȘRQ1: How to systematize the processing of scienti fi c corpora to build knowledge graphs? â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins mentioned in scienti fi c texts? â–ȘRQ3: How to formally describe evidence based on the association of drugs, diseases, genes and proteins mentioned in the same paragraph of a scienti fi c article? â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are mentioned? â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a collection of scienti fi c publications, through natural language queries?
  • 40. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 40 ‱Fine-tuned the BioBERT [Lee et al., 2020] model to identify the following biomedical classes: -Diseases: BC5CDR-Diseases and NCBI-Diseases -Chemicals: BC4CHEMD and BC5CDR-Drugs -Genetics: JNLPBA and BC2GM ‱Unique representation of the concept from a set of related terms composed using multiple information sources - 7.4 million annotations in JSON format ‱Multiple sources were taken into account to create a database for each of the biomedical entities ‱With the creation of these models we address the research question RQ2
  • 41. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 41 â–ȘRQ1: How to systematize the processing of scienti fi c corpora to build knowledge graphs? â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins mentioned in scienti fi c texts? â–ȘRQ3: How to formally describe evidence based on the association of drugs, diseases, genes and proteins mentioned in the same paragraph of a scienti fi c article? â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are mentioned? â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a collection of scienti fi c publications, through natural language queries?
  • 42. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 42 ‱ First, identify the requirements to express biomedical concepts and the associations between them. -Uni fi ed Medical Language System (UMLS) -Several efforts have been made to integrate biomedical knowledge into a single shared representation space (e.g. DISNET platform) ‱ Then, create an ontology that describes: - (i) biomedical concepts and associations between them and -(ii) the evidence supporting these associations. ‱ Challenges: -Biomedical taxonomies and vocabularies with reduced semantics (e.g. SNOMED, ICD-10, UMLS...) -how are the concepts related?
  • 43. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 43 ‱ We created the Evidences for BiOmedical Concepts Association (EBOCA) ontology to describe: -biomedical concepts -associations between them -evidence supporting these associations. ‱ It de fi nes the conceptual model on which the Drugs4Covid knowledge graph is built ‱ It is composed of two modules, one oriented toward describing biomedical concepts and associations, EBOCA SEM-DISNET, and the other focused on representing evidence of these associations with metadata and provenance information, EBOCA Evidences https://w3id.org/eboca/portal Perez, Andrea Alvarez, Ana Iglesias-Molina, Lucia Prieto Santamaria, Maria Poveda-Villalon, Carlos Badenes-Olmedo and Alejandro Rodriguez- Gonzalez. “EBOCA: Evidences for BiOmedical Concepts Association Ontology.” International Conference Knowledge Engineering and Knowledge Management (2022).
  • 44. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 44 EBOCA SIM-DISNET module: ‱Designed to represent associations of common biomedical concepts, such as: diseases, phenotypes, genes, genetic variants, biological pathways, drugs, proteins, and targets. ‱ Associations link pairs of concepts, for example, the gene-disease or drug-disease association ‱ Adds semantics to the DISNET structure -phenotypic layer -biological layer -pharmacological layer
  • 45. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 45 EBOCA EVIDENCES module: ‱Extends the associations between biomedical concepts of the SEM-DISNET module with metadata and provenance information ‱ These evidences of associations may come from known curated sources, or may be drawn or inferred directly from the texts. ‱ Describes in more detail the type of evidence supported by the association, the agents involved in its extraction and publication
  • 46. “NLP and Knowledge Graphs“- carlos.badenes@upm.es ÂĄ 46 ‱ The EBOCA ontology address the research question RQ2. ‱ The evaluation of an ontology usually seeks to identify inconsistencies or formal errors that invalidate its de fi nition. However, since this work is oriented to the creation of a knowledge graph from a collection of scienti fi c articles, it is more interesting to focus on an evaluation that measures the coverage of the ontology to a set of competency questions »15 questions associated with the EBOCA SEM-DISNET module » and 10 questions associated with the EBOCA Evidences module
  • 47. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 47 â–ȘRQ1: How to systematize the processing of scienti fi c corpora to build knowledge graphs? â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins mentioned in scienti fi c texts? â–ȘRQ3: How to formally describe evidence based on the association of drugs, diseases, genes and proteins mentioned in the same paragraph of a scienti fi c article? â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are mentioned? â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a collection of scienti fi c publications, through natural language queries?
  • 48. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 48 â–ȘObjective: annotate the entities, their relationships and the evidence according to the EBOCA ontology â–ȘMethodology: -The mapping rules between the annotations and the ontology resources were created with Mapeathor and carried out by the Morph-kgc library. -The articles were then described by the ontology in an RDF fi le. -Finally, GraphDB was chosen to store the RDF content and Helio to provide a SparQL access interface
  • 49. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 49
  • 50. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 50
  • 51. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) 51 â–ȘRQ1: How to systematize the processing of scienti fi c corpora to build knowledge graphs? â–ȘRQ2: How to identify and standardize drugs, diseases and genes/proteins mentioned in scienti fi c texts? â–ȘRQ3: How to formally describe evidence based on the association of drugs, diseases, genes and proteins mentioned in the same paragraph of a scienti fi c article? â–ȘRQ4: How to relate drugs and diseases from the paragraphs where they are mentioned? â–ȘRQ5: How to provide access to a knowledge graph, together with the content of a collection of scienti fi c publications, through natural language queries?
  • 52. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 52 â–ȘObjective: facilitate access to the resources created in this work, especially the knowledge graph. â–ȘMethodology: o There are multiple ways to exploit a KG: prede fi ned SPARQL queries, guided document searches, etc. o Our proposal is a question-answer (QA) system based on natural language, MuHeQA »combines ExtractiveQA and NLP with KGs »Summarization, Evidence Extraction and Answer Generation â–Ș https://drugs4covid.oeg.fi.upm.es/services/bio-qa
  • 53. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Outline 53 â–Ș Context (~10min) â–Ș Ontology Engineering Group â–Ș Academic and Industrial Background â–Ș Personal Motivation â–Ș Relevant Outcomes (~30min) â–Ș Cross-lingual Document Similarity (librAIry) â–Ș Multiple and Heterogenous QA (MuHeQA) â–Ș Knowledge graph-driven Clinical Document Exploration (Drugs4Covid) â–Ș Ongoing Research (~5min) â–Ș Multi-Dimensional Fake News Identification** â–Ș Inconsistency Detection â–Ș Conversational-assisted Topic Labelling â–Ș Multi-hop KGQA
  • 54. “NLP and Knowledge Graphs“- carlos.badenes@upm.es ..In this context the specialist said that what we eat is the main cause of cancer risk, with a *NUMBER* percent; in front of tobacco consumption, responsible in a *NUMBER* percent; and infections, with *NUMBER* percent
” Multi-Dimensional Fake News Identification 54 â–Ș Multi-Dimensional Fake News Identification: Knowledge Graph 5Ws evidences Dictionary Social Network impact
  • 55. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Inconsistency Detection 55 â–Ș Inconsistency Detection: social media political parties specific issues over time Guillen-Pancho, Ibai, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023)
  • 56. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Conversational-assisted Topic Labeling 56 â–Ș Conversational-assisted Topic Labeling: Words Selection Question Composition Question Answering Label Retrieval RamĂłn-Ferrer, Virginia, Badenes-Olmedo, Carlos and Óscar Corcho. “Automatic Topic Label Generation Using Conversational Models.”(2023)
  • 57. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Multi-hop KGQA 57 Liu-Chen, Teng, Badenes-Olmedo, Carlos and Óscar Corcho. “Enabling complex question support in hybrid knowledge bases .”(2023) Min, Sewon, Victor Zhong, Luke Zettlemoyer and Hannaneh Hajishirzi. “Multi-hop Reading Comprehension through Question Decomposition and Rescoring.” ArXiv abs/1906.02916 (2019) â–Ș Multi-hop KGQA:
  • 58. NLP and Knowledge Graphs Carlos Badenes-Olmedo Ontology Engineering Group (OEG) Universidad PolitĂ©cnica de Madrid (UPM) 2023/06/15 carlos.badenes@upm.es @carbadol
  • 59. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Additional Content 59 â–Ș Ontology Engineering Framework â–Ș Knowledge Graph Tools â–Ș NLP Resources â–Ș Open Science Ontologies Knowledge Graphs NLP Open Science
  • 60. OEG Ontology Engineering Framework LOT industrial methodology 60 http://lot.linkeddata.es +20 projects More details at: Poveda-VillalĂłn, M., FernĂĄndez-Izquierdo, A., FernĂĄndez-LĂłpez, M., & GarcĂ­a-Castro, R. (2022). LOT: An industrial oriented ontology engineering framework. Engineering Applications of Artificial Intelligence, 111, 104755. https://doi.org/10.1016/j.engappai.2022.104755
  • 61. OEG Ontology Engineering Framework LOT adoption 61 â–Ș + 20 projects (internal and external) â–Ș https://lot.linkeddata.es/#stories (selected examples)
  • 62. OEG Ontology Engineering Framework Technology landscape 62
  • 63. OEG Ontology Engineering Framework Technology landscape 63 ‱ openly available corpus https://coralcorpus.linkeddata.es/ ‱ 834 ontological requirements annotated ‱ 29 lexico-syntactic patterns ‱ HTML, CSV and RDF ‱ "Creative Commons Attribution 4.0 International" license
  • 64. OEG Ontology Engineering Framework Technology landscape 64 ‱ https://chowlk.linkeddata.es ‱ Notation and converter ‱ Hosted by OEG ‱ Developed by OEG
  • 65. OEG Ontology Engineering Framework Technology landscape 65 ‱ https://lov.linkeddata.es ‱ Vocabulary registry and index ‱ Hosted by OEG ‱ Not developed by OEG
  • 66. OEG Ontology Engineering Framework Technology landscape 66 ‱ http://oops.linkeddata.es/ ‱ Check for pitfalls (41 defined 31 automated) ‱ Online app ‱ Web service ‱ http://themis.linkeddata.es ‱ Ontology unit tests based validation ‱ Online app ‱ Web service
  • 67. OEG Ontology Engineering Framework Technology landscape 67 ‱ https://github.com/dgarijo/Widoco/ ‱ Ontology documentation ‱ Desktop app (maintained by Daniel Garijo, ISI, California) ‱ Web service (OEG)
  • 68. OEG Ontology Engineering Framework Technology landscape 68 ‱ https://github.com/oeg-upm/vocab.linkeddata.es ‱ Ontology portal generation ‱ jar distribution
  • 69. OEG Ontology Engineering Framework Technology landscape 69
  • 70. OEG Ontology Engineering Framework Handle versions and distributed environments 70 Evaluation reports HTML documentation Diagrams Permanent Ids Content negotiation Bundle Pre-view http://ontoology.linkeddata.es
  • 71. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Additional Content 71 â–Ș Ontology Engineering Framework â–Ș Knowledge Graph Tools â–Ș NLP Resources â–Ș Open Science Ontologies Knowledge Graphs NLP Open Science
  • 72. OEG Ontology Engineering Framework Technology landscape 72 ‱ https://morph.oeg-upm.net/ ‱ OBDA (Ont. Based Data Access) ‱ Morph-RDB ‱ Morph-GraphQL ‱ Morph-CSV
  • 73. OEG Ontology Engineering Framework Technology landscape 73 ‱ https://helio.linkeddata.es/ ‱ RDF from heterogeneous data sources ‱ Java ? Web service? Api? ‱ ‱ https://helio.linkeddata.es/ ‱ Sync. or Async. integration of heterogeneous data sources ‱ Data quality, cleaning and linking functions ‱ Liked Data Service publishing data or maven dependency
  • 74. OEG Ontology Engineering Framework Technology landscape 74 ‱ https://astrea.linkeddata.es ‱ Generation of SHACL shapes from ontologies ‱ Validation of data using SHACL shapes ‱ Online service
  • 75. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Additional Content 75 â–Ș Ontology Engineering Framework â–Ș Knowledge Graph Tools â–Ș NLP Resources â–Ș Open Science Ontologies Knowledge Graphs NLP Open Science
  • 76. “NLP and Knowledge Graphs“- carlos.badenes@upm.es librAIry 76 http://librairy.eu @librairy_eu https://github.com/librairy
  • 77. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Drugs4Covid 77 https://drugs4covid.oeg.fi.upm.es/ https://github.com/drugs4covid
  • 78. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Añotador 78 https://annotador.oeg.fi.upm.es/ https://github.com/mnavasloro/Annotador
  • 79. “NLP and Knowledge Graphs“- carlos.badenes@upm.es TermitUp 79 https://termitup.oeg.fi.upm.es/
  • 80. “NLP and Knowledge Graphs“- carlos.badenes@upm.es KeyQ 80 https://aiproc.linkeddata.es/
  • 81. “NLP and Knowledge Graphs“- carlos.badenes@upm.es 81 Methods for Knowledge-Based Systems and Deep Learning Integration Semantic-based Initialization for OOKB entities 1 KG dumps Endpoint PHASE 1 Entity embedding Ontology embedding Entity Entity ontological information KG ontology PHASE 2 c c Embedding composition Embedding composition Ontological information embeddings Ontology embedding Entity embedding PHASE 3 Initial embedding Class Hierarchy Thing Place Country Class Hierarchy Thing Person Amador-DomĂ­nguez, E., Serrano, E., Manrique, D., Hohenecker, P., and Lukasiewicz, T. (2021). An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities. Information Sciences, 564, 85–102.
  • 82. “NLP and Knowledge Graphs“- carlos.badenes@upm.es Additional Content 82 â–Ș Ontology Engineering Framework â–Ș Knowledge Graph Tools â–Ș NLP Resources â–Ș Open Science Ontologies Knowledge Graphs NLP Open Science
  • 83. scalable cross-lingual document similarity 83 â–Ș Readme Analysis o Supervised classification o Regular expressions o Header analysis â–Ș File exploration o Notebooks o Dockerfiles o Documentation â–Ș GitHub API Repository Extraction Results (Metadata) https://github.com/KnowledgeCaptureAndDiscovery/somef/ 1 Kelley, A., & Garijo, D. (2021). A framework for creating knowledge graphs of scientific software metadata. Quantitative Science Studies, 1-37. Open Science: Creating KGs of Research Software Metadata
  • 84. scalable cross-lingual document similarity 84 2 Open Science: Tracking FAIR principles in Research Software - Continuous updates to Research Software catalogs of tools - From software metadata extraction automated feedback on compliance with best practices - Linking research articles with their corresponding software tools https://software.oeg.fi.upm.es/