Type Vector Representations
from Text: An Empirical Analysis
Federico Bianchi, Mauricio Soto, Matteo Palmonari and Vincenzo Cutrona
Department of Informatics, Systems and Communications
University of Milano-Bicocca
federico.bianchi@disco.unimib.it
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
ESWC, Crete, 4th June 2018
● Structured representations
of knowledge
● Entities are classified using
types (i.e., concepts)
● Types are organized in
sub-types graphs
Knowledge Graphs
A.S.
Roma
Kostas
Manolas
team
Soccer
Player
Soccer
Club
Athlete
Thing
Person
Sports
Club
Garry
Kasparov
Chess
Player
Real
Madrid
Organisa.
ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
ESWC, Crete, 4th June 2018
Scope of this Paper
● Propose an approach to learn representations of types by
considering text as a different source of information
○ Distributional semantics
○ Embeddings of types in a vector space
○ Mapping to a word2vec learning problem
● Main intuition: building a type similarity measure that encodes
relatedness between types (beyond ontological similarity)
● Empirical evaluation of the properties of text-based type
representations
○ Focus on similarity (relatedness vs ontological similarity)
ESWC, Crete, 4th June 2018
Vector Representations of Types
Types represented in a vector space:
● Easy and fast evaluation of similarity
2
5
6
2
6
4
2
12
5
2
Soccer
Club
Person
ESWC, Crete, 4th June 2018
Embeddings for Representing Ontologies
● [Jayawardana+, 2017]
○ Instance-based approach for building word embeddings vectors of the instances in a custom
ontology (legal domain)
○ Embedding used to predict the best representative vector for each ontology type
(cluster-based approach)
○ Conclusions: type vectors are aggregation of entity embeddings
● [Smaili+, 2018]
○ Distributional hypothesis based embeddings for ontological representation
○ Textual document generated by considering axioms in an ontology as sentences of a text
○ Conclusions: uses the structure of the ontology
ESWC, Crete, 4th June 2018
State-of-the-Art on Ontological Similarity
● [Rada+, 1989] (path)
○ Shortest path length between concepts
○ Equal path problem: two concepts with the same path length share the same semantic similarity
● [Wu&Palmer, 1994] (wup)
○ Considers the instances depth (based on the Least Common Subsumer - i.e., first common ancestor)
○ Equal depth problem: concepts at the same hierarchical level share the same similarity
● [Zhu&Iglesias, 2017] (wpath)
○ Weighted path length to evaluate the similarity between concepts
○ Exploitation of the statistical Information Content (IC) along with the topology
○ IC computed on text corpora and used to assign higher level to more specific entities
● Topological distant concepts may be highly related (e.g., SoccerPlayer and
SoccerClub)
● Not all siblings pairs are similar in the same way (e.g., is a SoccerPlayer equally
similar to a Wrestler and a BasketballPlayer)
ESWC, Crete, 4th June 2018
Similarity vs. Relatedness
Semantic Similarity
Resemblance general conceptual term
Ex. Settlement and Town
Equal Path problem, Depth problem
Measures based on the ontology topology:
● path
● wup (Least Common Subsumer)
● wpath (Information Content)
Relatedness
Existence of connections
Ex. SoccerPlayer and SoccerClub
Ontology structure obliviousness
Measures based in corpora co-occurrence
Word Embedding (Distributional Hypothesis)
● word2vec
ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
ESWC, Crete, 4th June 2018
Word2Vec [Mikolov+, 2013]
Well-known algorithm for learning word
representations from an input corpus
Distributional hypothesis: similar words appear in
similar contexts (word-word co-occurrence)
Type to Vector (T2V): generate distributed
representations of types based on type-type
co-occurrence.
cat
black
eats
dog
similar words corresponds
to similar vectors
The big black cat eats its food.
My little black cat sleeps all day.
Sometimes my cat eats too much!
Two hyperparameters:
● Desired embedding size
● Length of the context window
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
Part of our approach to learn representations of typed entities:
- Bianchi & Palmonari. Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. NL4AI 2017
- Bianchi & al. Towards Encoding Time in Text-Based Entity Embeddings. ISWC 2018 (to appear).
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Find entities in text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
● Entities are found with a
Named Entity Linking
Service
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Rome Italy Rome Lazio
Find entities in text
● Entities are found with a
Named Entity Linking
Service
● Words are removed
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
Find entities in text
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
● The document containing
sequences of types is fed to
word2vec
Find entities in text
ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Generate Type Vectors
word2vec
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
2
5
6
2
6
4
2
12
5
2
6
7
6
9
7
City Country Adminis.
Region
Similarity can be computed with cosine similarity
Find entities in text
ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State of the Art
● T2V: Type to Vector
● Experiments
ESWC, Crete, 4th June 2018
Empirical Evaluation of T2V Representations
Objective: analyzing the properties of the T2V representations, focus on similarity
Corpus for T2V training: DBpedia 2016-04 abstracts annotated with DBpedia Spotlight
Experiments:
1) Analogical reasoning (standard method of evaluation for word embeddings)
2) Correlation with topological measures
3) Similarity and depth (depth problem)
4) Similarity and siblings (siblings similarity problem)
5) Type matching (similarity between different categorization systems)
ESWC, Crete, 4th June 2018
1) Analogical Reasoning
Hypothesis
T2V can support analogical reasoning as word2vec does
Dataset
Dataset of 868 reasonably objective analogies on sports.
(e.g., sportPlayer - sportTeam)
Methodology
● Tested two different T2V analogical reasoning
models with 100 and 200 dimensions for the
embeddings and a window of 5
● Word2vec answers with the list of closest points
to the analogical operation
● We check if the correct answer is found in the list
of top-k (1, 5, 10)
● In the top-k setting answer is correct if it is in the
top-k the ranked list
Example
“Who is the equivalent of a RugbyPlayer that plays in a
RugbyTeam in a BasketballTeam?”
RugbyPlayer : RugbyTeam :: ? : BasketballTeam
Analogical operation: v(dbo:RugbyPlayer) -
v(dbo:RugbyTeam) + v(dbo:BasketballTeam) ≈
v(dbo:BasketballPlayer)
ESWC, Crete, 4th June 2018
1) Analogical Reasoning
P@1 P@2 P@5
T2V
(200,5)
0.50 0.85 0.98
T2V
(100,5)
0.47 0.76 0.93
Outcome
● Correct answer is often found in the first 5 positions
● Linguistic properties are preserved also in T2V
Model used for the
next experiments
ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Correlation
path wup wpath T2V
path 1.00 0.87 0.94 0.30
wup 1.00 0.93 0.33
wpath 1.00 0.36
T2V 1.00
Hypothesis
T2V similarity is orthogonal to topological similarity
Dataset
~15000 pairs of types in DBpedia
Methodology
Pearson Correlation coefficient between T2V similarity and
well-known topological measures
Outcome
T2V similarity and topological similarity are not strongly
correlated
ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Insights
State of the Art
Based on the topology of the ontology
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (low similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (high similarity, siblings)
T2V
Captures the co-occurrences of types in text
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (high similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (low similarity, siblings)
ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Examples
Type 1 Type 2 Sim - wpath Sim - T2V
dbo:SoccerPlayer dbo:SoccerClub 0.17 0.72
dbo:SoccerPlayer dbo:Wrestler 0.47 0.24
dbo:RailwayLine dbo:Station 0.44 0.81
dbo:Vein dbo:Artery 0.70 0.84
dbo:RailwayLine dbo:PublicTransitSystem 0.11 0.79
dbo:Company dbo:Airline 0.72 0.30
ESWC, Crete, 4th June 2018
3) Similarity vs. Depth
Hypothesis
Sibling types are pairwise more similar when types are more specific
(as noticed in topological similarity )
sim(dbo:BasketballPlayer, dbo:SoccerPlayer)
>
sim(dbo:Person,dbo:Organization)
Dataset
DBpedia ontology
Methodology
● Children Information Distribution CID
○ Average pairwise similarity between the children of
a type p
● CID vs relative depth (relative = to the type path)
ESWC, Crete, 4th June 2018
3) T2V CID vs. Relative Depth
Outcome
● On average, CID increases
with depth
CID drops here: CID(dbo:Thing)>CID(dbo:Agent)
ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Hypothesis
The pairwise similarity for a set of siblings changes from pair to pair
Dataset
31 siblings type from the DBpedia ontology
For each type we selected its most similar sibling and its least similar sibling considering
T2V similarity
(e.g., SoccerPlayer => most similar RugbyPlayer, least similar ChessPlayer)
Methodology
We asked 5 users (knowledgeable about semantic web) to answer questions like the
following:
“Do you think a SoccerPlayer is more similar to a RugbyPlayer or a ChessPlayer?”
Potential Biases
● Low number of participants
● Questions were selected using T2V
ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Outcome
● Agreement between the user using Gwet AC1 [Gwet, 2008] is 0.9 (high agreement)
● Given an input type, users choose as answer the type that is also returned as most similar by
T2V
Examples
Is a Writer more similar to a dbo:Philosopher or a dbo:BusinessPerson?
Is a President more similar to a dbo:PrimeMinister or a dbo:Mayor?
Most challenging question for users
“is a dbo:Skyscraper more similar to a dbo:Hospital or a dbo:Museum?”
ESWC, Crete, 4th June 2018
5) Type Matching
Hypothesis
T2V can be used for ontology matching provided that two different ontologies are used to classify a common set of instances
Methodology
● Learn representations of types from different ontologies in a shared vector space (100 dimensions, 5 window)
● Replace entities with a type of one of the two ontologies (randomly)
Dataset
● DBpedia 2016-04 and Wikidata 2016-06 (instance of)
Same space in
which types of
different
ontologies
co-exist
City (Ontology 1)
Country (Ontology 2)
City (Ontology 1)
Region (Ontology 2)
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
2
9
1
2
5
4
2
6
4
5
7
5
2
2
9
ESWC, Crete, 4th June 2018
5) Type Matching
Wikidata (label) DBpedia Sim
Q4498974 (ice hokey team) HockeyTeam 0.99
Q5107 (continent) Continent 0.99
Q17374546* (Australian rules football club) AustralianFootballTeam 0.99
Q3001412* (horse race) HorseRace 0.98
Q4022 (river) River 0.98
Q46970 (airline) Airline 0.98
Q18127 (record label) RecordLabel 0.98
Q13027888* (baseball team) BaseballTeam 0.98
Q11424 (film) Film 0.98
Q1075* (color) Colour 0.98
Q17156793* (American football team) American Football Team 0.95
Q3146899* (diocese of the Catholic Church) Diocese 0.93
Q7944* (earthquake) Earthquake 0.91
* not declared equivalent in DBpedia
Outcome
● Types with highest similarity are equivalent classes in the
two ontologies (due to the use in text)
● Found equivalent types not declared as equivalent in
DBpedia
Conclusions and
Future Work
Future Work:
● Combine T2V similarity and topological
similarities in one measure
● Study relation between sub-type relation
and the vector representation
● Support ontology matching tasks
● Compare with other methods for
vector-based type representations
Conclusions:
● Similarity with T2V injects relatedness in
type similarity measures (from
handwritten text corpora)
● T2V exhibits some desired properties
(depth, sibling discrimination)
● T2V supports analogical reasoning
● T2V can support ontology matching
Thank You
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
Code and models are publicly available (see the paper for details)
Mail to: federico.bianchi@disco.unimib.it
ESWC, Crete, 4th June 2018
References
Bianchi, F., Palmonari, M., & Nozza, D., Towards Encoding Time in Text-Based Entity Embeddings. in International Semantic Web
Conference, 2018 (to appear).
Bianchi, F., & Palmonari, M. (2017). Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. In In
Proceedings of the NL4AI Workshop, co-located with the International Conference of the Italian Association for Artificial Intelligence
(AI* IA).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical
and Statistical Psychology,61(1):29–48, 2008.
Ganggao Zhu and Carlos A Iglesias. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge
and Data Engineering, 29(1):72–85, 2017.
V. Jayawardana, D. Lakmal, N. de Silva, A. S. Perera, K. Sugathadasa, and B. Ayesha. Deriving a representative vector for
ontology classes with instance word vector embeddings. In INTECH, pages 79–84, Aug 2017.
Fatima Zohra Smaili, Xin Gao, and Robert Hoehndorf. Onto2vec: joint vector-based representation of biological entities and
their ontology-based annotations. arXiv preprint arXiv:1802.00864, 2018.

Type Vector Representations from Text. DL4KGS@ESWC 2018

  • 1.
    Type Vector Representations fromText: An Empirical Analysis Federico Bianchi, Mauricio Soto, Matteo Palmonari and Vincenzo Cutrona Department of Informatics, Systems and Communications University of Milano-Bicocca federico.bianchi@disco.unimib.it Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies (DL4KGS) Co-located with ESWC 18, June 2018, Crete, Greece
  • 2.
    ESWC, Crete, 4thJune 2018 Outline ● Knowledge Graphs ● Scope of this Paper and State-of-the-art ● T2V: Type to Vector ● Experiments
  • 3.
    ESWC, Crete, 4thJune 2018 Outline ● Knowledge Graphs ● Scope of this Paper and State-of-the-art ● T2V: Type to Vector ● Experiments
  • 4.
    ESWC, Crete, 4thJune 2018 ● Structured representations of knowledge ● Entities are classified using types (i.e., concepts) ● Types are organized in sub-types graphs Knowledge Graphs A.S. Roma Kostas Manolas team Soccer Player Soccer Club Athlete Thing Person Sports Club Garry Kasparov Chess Player Real Madrid Organisa.
  • 5.
    ESWC, Crete, 4thJune 2018 Outline ● Knowledge Graphs ● Scope of this Paper and State-of-the-art ● T2V: Type to Vector ● Experiments
  • 6.
    ESWC, Crete, 4thJune 2018 Scope of this Paper ● Propose an approach to learn representations of types by considering text as a different source of information ○ Distributional semantics ○ Embeddings of types in a vector space ○ Mapping to a word2vec learning problem ● Main intuition: building a type similarity measure that encodes relatedness between types (beyond ontological similarity) ● Empirical evaluation of the properties of text-based type representations ○ Focus on similarity (relatedness vs ontological similarity)
  • 7.
    ESWC, Crete, 4thJune 2018 Vector Representations of Types Types represented in a vector space: ● Easy and fast evaluation of similarity 2 5 6 2 6 4 2 12 5 2 Soccer Club Person
  • 8.
    ESWC, Crete, 4thJune 2018 Embeddings for Representing Ontologies ● [Jayawardana+, 2017] ○ Instance-based approach for building word embeddings vectors of the instances in a custom ontology (legal domain) ○ Embedding used to predict the best representative vector for each ontology type (cluster-based approach) ○ Conclusions: type vectors are aggregation of entity embeddings ● [Smaili+, 2018] ○ Distributional hypothesis based embeddings for ontological representation ○ Textual document generated by considering axioms in an ontology as sentences of a text ○ Conclusions: uses the structure of the ontology
  • 9.
    ESWC, Crete, 4thJune 2018 State-of-the-Art on Ontological Similarity ● [Rada+, 1989] (path) ○ Shortest path length between concepts ○ Equal path problem: two concepts with the same path length share the same semantic similarity ● [Wu&Palmer, 1994] (wup) ○ Considers the instances depth (based on the Least Common Subsumer - i.e., first common ancestor) ○ Equal depth problem: concepts at the same hierarchical level share the same similarity ● [Zhu&Iglesias, 2017] (wpath) ○ Weighted path length to evaluate the similarity between concepts ○ Exploitation of the statistical Information Content (IC) along with the topology ○ IC computed on text corpora and used to assign higher level to more specific entities ● Topological distant concepts may be highly related (e.g., SoccerPlayer and SoccerClub) ● Not all siblings pairs are similar in the same way (e.g., is a SoccerPlayer equally similar to a Wrestler and a BasketballPlayer)
  • 10.
    ESWC, Crete, 4thJune 2018 Similarity vs. Relatedness Semantic Similarity Resemblance general conceptual term Ex. Settlement and Town Equal Path problem, Depth problem Measures based on the ontology topology: ● path ● wup (Least Common Subsumer) ● wpath (Information Content) Relatedness Existence of connections Ex. SoccerPlayer and SoccerClub Ontology structure obliviousness Measures based in corpora co-occurrence Word Embedding (Distributional Hypothesis) ● word2vec
  • 11.
    ESWC, Crete, 4thJune 2018 Outline ● Knowledge Graphs ● Scope of this Paper and State-of-the-art ● T2V: Type to Vector ● Experiments
  • 12.
    ESWC, Crete, 4thJune 2018 Word2Vec [Mikolov+, 2013] Well-known algorithm for learning word representations from an input corpus Distributional hypothesis: similar words appear in similar contexts (word-word co-occurrence) Type to Vector (T2V): generate distributed representations of types based on type-type co-occurrence. cat black eats dog similar words corresponds to similar vectors The big black cat eats its food. My little black cat sleeps all day. Sometimes my cat eats too much! Two hyperparameters: ● Desired embedding size ● Length of the context window
  • 13.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” Part of our approach to learn representations of typed entities: - Bianchi & Palmonari. Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. NL4AI 2017 - Bianchi & al. Towards Encoding Time in Text-Based Entity Embeddings. ISWC 2018 (to appear).
  • 14.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text Find entities in text “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” ● Entities are found with a Named Entity Linking Service
  • 15.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” Rome Italy Rome Lazio Find entities in text ● Entities are found with a Named Entity Linking Service ● Words are removed
  • 16.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” Replace Entities With Minimal Types “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” City, Country, City Administrative_Region Rome Italy Rome Lazio ● Entities are found with a Named Entity Linking Service ● Words are removed ● Entities are replaced with their minimal (most specific) type Find entities in text
  • 17.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” Replace Entities With Minimal Types “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” City, Country, City Administrative_Region Rome Italy Rome Lazio ● Entities are found with a Named Entity Linking Service ● Words are removed ● Entities are replaced with their minimal (most specific) type ● The document containing sequences of types is fed to word2vec Find entities in text
  • 18.
    ESWC, Crete, 4thJune 2018 T2V: Word2Vec on Annotated Text Generate Type Vectors word2vec “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” Replace Entities With Minimal Types “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” City, Country, City Administrative_Region Rome Italy Rome Lazio 2 5 6 2 6 4 2 12 5 2 6 7 6 9 7 City Country Adminis. Region Similarity can be computed with cosine similarity Find entities in text
  • 19.
    ESWC, Crete, 4thJune 2018 Outline ● Knowledge Graphs ● Scope of this Paper and State of the Art ● T2V: Type to Vector ● Experiments
  • 20.
    ESWC, Crete, 4thJune 2018 Empirical Evaluation of T2V Representations Objective: analyzing the properties of the T2V representations, focus on similarity Corpus for T2V training: DBpedia 2016-04 abstracts annotated with DBpedia Spotlight Experiments: 1) Analogical reasoning (standard method of evaluation for word embeddings) 2) Correlation with topological measures 3) Similarity and depth (depth problem) 4) Similarity and siblings (siblings similarity problem) 5) Type matching (similarity between different categorization systems)
  • 21.
    ESWC, Crete, 4thJune 2018 1) Analogical Reasoning Hypothesis T2V can support analogical reasoning as word2vec does Dataset Dataset of 868 reasonably objective analogies on sports. (e.g., sportPlayer - sportTeam) Methodology ● Tested two different T2V analogical reasoning models with 100 and 200 dimensions for the embeddings and a window of 5 ● Word2vec answers with the list of closest points to the analogical operation ● We check if the correct answer is found in the list of top-k (1, 5, 10) ● In the top-k setting answer is correct if it is in the top-k the ranked list Example “Who is the equivalent of a RugbyPlayer that plays in a RugbyTeam in a BasketballTeam?” RugbyPlayer : RugbyTeam :: ? : BasketballTeam Analogical operation: v(dbo:RugbyPlayer) - v(dbo:RugbyTeam) + v(dbo:BasketballTeam) ≈ v(dbo:BasketballPlayer)
  • 22.
    ESWC, Crete, 4thJune 2018 1) Analogical Reasoning P@1 P@2 P@5 T2V (200,5) 0.50 0.85 0.98 T2V (100,5) 0.47 0.76 0.93 Outcome ● Correct answer is often found in the first 5 positions ● Linguistic properties are preserved also in T2V Model used for the next experiments
  • 23.
    ESWC, Crete, 4thJune 2018 2) T2V vs Topological Measures: Correlation path wup wpath T2V path 1.00 0.87 0.94 0.30 wup 1.00 0.93 0.33 wpath 1.00 0.36 T2V 1.00 Hypothesis T2V similarity is orthogonal to topological similarity Dataset ~15000 pairs of types in DBpedia Methodology Pearson Correlation coefficient between T2V similarity and well-known topological measures Outcome T2V similarity and topological similarity are not strongly correlated
  • 24.
    ESWC, Crete, 4thJune 2018 2) T2V vs Topological Measures: Insights State of the Art Based on the topology of the ontology Ex. dbo:Settlement and dbo:Town (high similarity) Ex. dbo:SoccerPlayer and dbo:SoccerClub (low similarity) Ex. dbo:Wrestler and dbo:SoccerPlayer (high similarity, siblings) T2V Captures the co-occurrences of types in text Ex. dbo:Settlement and dbo:Town (high similarity) Ex. dbo:SoccerPlayer and dbo:SoccerClub (high similarity) Ex. dbo:Wrestler and dbo:SoccerPlayer (low similarity, siblings)
  • 25.
    ESWC, Crete, 4thJune 2018 2) T2V vs Topological Measures: Examples Type 1 Type 2 Sim - wpath Sim - T2V dbo:SoccerPlayer dbo:SoccerClub 0.17 0.72 dbo:SoccerPlayer dbo:Wrestler 0.47 0.24 dbo:RailwayLine dbo:Station 0.44 0.81 dbo:Vein dbo:Artery 0.70 0.84 dbo:RailwayLine dbo:PublicTransitSystem 0.11 0.79 dbo:Company dbo:Airline 0.72 0.30
  • 26.
    ESWC, Crete, 4thJune 2018 3) Similarity vs. Depth Hypothesis Sibling types are pairwise more similar when types are more specific (as noticed in topological similarity ) sim(dbo:BasketballPlayer, dbo:SoccerPlayer) > sim(dbo:Person,dbo:Organization) Dataset DBpedia ontology Methodology ● Children Information Distribution CID ○ Average pairwise similarity between the children of a type p ● CID vs relative depth (relative = to the type path)
  • 27.
    ESWC, Crete, 4thJune 2018 3) T2V CID vs. Relative Depth Outcome ● On average, CID increases with depth CID drops here: CID(dbo:Thing)>CID(dbo:Agent)
  • 28.
    ESWC, Crete, 4thJune 2018 4) Siblings’ Similarity Hypothesis The pairwise similarity for a set of siblings changes from pair to pair Dataset 31 siblings type from the DBpedia ontology For each type we selected its most similar sibling and its least similar sibling considering T2V similarity (e.g., SoccerPlayer => most similar RugbyPlayer, least similar ChessPlayer) Methodology We asked 5 users (knowledgeable about semantic web) to answer questions like the following: “Do you think a SoccerPlayer is more similar to a RugbyPlayer or a ChessPlayer?” Potential Biases ● Low number of participants ● Questions were selected using T2V
  • 29.
    ESWC, Crete, 4thJune 2018 4) Siblings’ Similarity Outcome ● Agreement between the user using Gwet AC1 [Gwet, 2008] is 0.9 (high agreement) ● Given an input type, users choose as answer the type that is also returned as most similar by T2V Examples Is a Writer more similar to a dbo:Philosopher or a dbo:BusinessPerson? Is a President more similar to a dbo:PrimeMinister or a dbo:Mayor? Most challenging question for users “is a dbo:Skyscraper more similar to a dbo:Hospital or a dbo:Museum?”
  • 30.
    ESWC, Crete, 4thJune 2018 5) Type Matching Hypothesis T2V can be used for ontology matching provided that two different ontologies are used to classify a common set of instances Methodology ● Learn representations of types from different ontologies in a shared vector space (100 dimensions, 5 window) ● Replace entities with a type of one of the two ontologies (randomly) Dataset ● DBpedia 2016-04 and Wikidata 2016-06 (instance of) Same space in which types of different ontologies co-exist City (Ontology 1) Country (Ontology 2) City (Ontology 1) Region (Ontology 2) “Rome is the capital of Italy and a special comune (named Comune di Roma Capitale). Rome also serves as the capital of the Lazio region.” 2 9 1 2 5 4 2 6 4 5 7 5 2 2 9
  • 31.
    ESWC, Crete, 4thJune 2018 5) Type Matching Wikidata (label) DBpedia Sim Q4498974 (ice hokey team) HockeyTeam 0.99 Q5107 (continent) Continent 0.99 Q17374546* (Australian rules football club) AustralianFootballTeam 0.99 Q3001412* (horse race) HorseRace 0.98 Q4022 (river) River 0.98 Q46970 (airline) Airline 0.98 Q18127 (record label) RecordLabel 0.98 Q13027888* (baseball team) BaseballTeam 0.98 Q11424 (film) Film 0.98 Q1075* (color) Colour 0.98 Q17156793* (American football team) American Football Team 0.95 Q3146899* (diocese of the Catholic Church) Diocese 0.93 Q7944* (earthquake) Earthquake 0.91 * not declared equivalent in DBpedia Outcome ● Types with highest similarity are equivalent classes in the two ontologies (due to the use in text) ● Found equivalent types not declared as equivalent in DBpedia
  • 32.
    Conclusions and Future Work FutureWork: ● Combine T2V similarity and topological similarities in one measure ● Study relation between sub-type relation and the vector representation ● Support ontology matching tasks ● Compare with other methods for vector-based type representations Conclusions: ● Similarity with T2V injects relatedness in type similarity measures (from handwritten text corpora) ● T2V exhibits some desired properties (depth, sibling discrimination) ● T2V supports analogical reasoning ● T2V can support ontology matching
  • 33.
    Thank You Workshop onDeep Learning for Knowledge Graphs and Semantic Technologies (DL4KGS) Co-located with ESWC 18, June 2018, Crete, Greece Code and models are publicly available (see the paper for details) Mail to: federico.bianchi@disco.unimib.it
  • 34.
    ESWC, Crete, 4thJune 2018 References Bianchi, F., Palmonari, M., & Nozza, D., Towards Encoding Time in Text-Based Entity Embeddings. in International Semantic Web Conference, 2018 (to appear). Bianchi, F., & Palmonari, M. (2017). Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. In In Proceedings of the NL4AI Workshop, co-located with the International Conference of the Italian Association for Artificial Intelligence (AI* IA). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology,61(1):29–48, 2008. Ganggao Zhu and Carlos A Iglesias. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1):72–85, 2017. V. Jayawardana, D. Lakmal, N. de Silva, A. S. Perera, K. Sugathadasa, and B. Ayesha. Deriving a representative vector for ontology classes with instance word vector embeddings. In INTECH, pages 79–84, Aug 2017. Fatima Zohra Smaili, Xin Gao, and Robert Hoehndorf. Onto2vec: joint vector-based representation of biological entities and their ontology-based annotations. arXiv preprint arXiv:1802.00864, 2018.