Type Vector Representations from Text: An Empirical Analysis. For the Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies (DL4KGS).
Held in conjunction with ESWC 18 in June 2018 in Crete, Greece.
Type Embeddings, Ontology Matching, Type Similarity
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Type Vector Representations from Text. DL4KGS@ESWC 2018
1. Type Vector Representations
from Text: An Empirical Analysis
Federico Bianchi, Mauricio Soto, Matteo Palmonari and Vincenzo Cutrona
Department of Informatics, Systems and Communications
University of Milano-Bicocca
federico.bianchi@disco.unimib.it
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
2. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
3. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
4. ESWC, Crete, 4th June 2018
● Structured representations
of knowledge
● Entities are classified using
types (i.e., concepts)
● Types are organized in
sub-types graphs
Knowledge Graphs
A.S.
Roma
Kostas
Manolas
team
Soccer
Player
Soccer
Club
Athlete
Thing
Person
Sports
Club
Garry
Kasparov
Chess
Player
Real
Madrid
Organisa.
5. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
6. ESWC, Crete, 4th June 2018
Scope of this Paper
● Propose an approach to learn representations of types by
considering text as a different source of information
○ Distributional semantics
○ Embeddings of types in a vector space
○ Mapping to a word2vec learning problem
● Main intuition: building a type similarity measure that encodes
relatedness between types (beyond ontological similarity)
● Empirical evaluation of the properties of text-based type
representations
○ Focus on similarity (relatedness vs ontological similarity)
7. ESWC, Crete, 4th June 2018
Vector Representations of Types
Types represented in a vector space:
● Easy and fast evaluation of similarity
2
5
6
2
6
4
2
12
5
2
Soccer
Club
Person
8. ESWC, Crete, 4th June 2018
Embeddings for Representing Ontologies
● [Jayawardana+, 2017]
○ Instance-based approach for building word embeddings vectors of the instances in a custom
ontology (legal domain)
○ Embedding used to predict the best representative vector for each ontology type
(cluster-based approach)
○ Conclusions: type vectors are aggregation of entity embeddings
● [Smaili+, 2018]
○ Distributional hypothesis based embeddings for ontological representation
○ Textual document generated by considering axioms in an ontology as sentences of a text
○ Conclusions: uses the structure of the ontology
9. ESWC, Crete, 4th June 2018
State-of-the-Art on Ontological Similarity
● [Rada+, 1989] (path)
○ Shortest path length between concepts
○ Equal path problem: two concepts with the same path length share the same semantic similarity
● [Wu&Palmer, 1994] (wup)
○ Considers the instances depth (based on the Least Common Subsumer - i.e., first common ancestor)
○ Equal depth problem: concepts at the same hierarchical level share the same similarity
● [Zhu&Iglesias, 2017] (wpath)
○ Weighted path length to evaluate the similarity between concepts
○ Exploitation of the statistical Information Content (IC) along with the topology
○ IC computed on text corpora and used to assign higher level to more specific entities
● Topological distant concepts may be highly related (e.g., SoccerPlayer and
SoccerClub)
● Not all siblings pairs are similar in the same way (e.g., is a SoccerPlayer equally
similar to a Wrestler and a BasketballPlayer)
10. ESWC, Crete, 4th June 2018
Similarity vs. Relatedness
Semantic Similarity
Resemblance general conceptual term
Ex. Settlement and Town
Equal Path problem, Depth problem
Measures based on the ontology topology:
● path
● wup (Least Common Subsumer)
● wpath (Information Content)
Relatedness
Existence of connections
Ex. SoccerPlayer and SoccerClub
Ontology structure obliviousness
Measures based in corpora co-occurrence
Word Embedding (Distributional Hypothesis)
● word2vec
11. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
12. ESWC, Crete, 4th June 2018
Word2Vec [Mikolov+, 2013]
Well-known algorithm for learning word
representations from an input corpus
Distributional hypothesis: similar words appear in
similar contexts (word-word co-occurrence)
Type to Vector (T2V): generate distributed
representations of types based on type-type
co-occurrence.
cat
black
eats
dog
similar words corresponds
to similar vectors
The big black cat eats its food.
My little black cat sleeps all day.
Sometimes my cat eats too much!
Two hyperparameters:
● Desired embedding size
● Length of the context window
13. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
Part of our approach to learn representations of typed entities:
- Bianchi & Palmonari. Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. NL4AI 2017
- Bianchi & al. Towards Encoding Time in Text-Based Entity Embeddings. ISWC 2018 (to appear).
14. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Find entities in text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
● Entities are found with a
Named Entity Linking
Service
15. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Rome Italy Rome Lazio
Find entities in text
● Entities are found with a
Named Entity Linking
Service
● Words are removed
16. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
Find entities in text
17. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
● The document containing
sequences of types is fed to
word2vec
Find entities in text
18. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Generate Type Vectors
word2vec
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
2
5
6
2
6
4
2
12
5
2
6
7
6
9
7
City Country Adminis.
Region
Similarity can be computed with cosine similarity
Find entities in text
19. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State of the Art
● T2V: Type to Vector
● Experiments
20. ESWC, Crete, 4th June 2018
Empirical Evaluation of T2V Representations
Objective: analyzing the properties of the T2V representations, focus on similarity
Corpus for T2V training: DBpedia 2016-04 abstracts annotated with DBpedia Spotlight
Experiments:
1) Analogical reasoning (standard method of evaluation for word embeddings)
2) Correlation with topological measures
3) Similarity and depth (depth problem)
4) Similarity and siblings (siblings similarity problem)
5) Type matching (similarity between different categorization systems)
21. ESWC, Crete, 4th June 2018
1) Analogical Reasoning
Hypothesis
T2V can support analogical reasoning as word2vec does
Dataset
Dataset of 868 reasonably objective analogies on sports.
(e.g., sportPlayer - sportTeam)
Methodology
● Tested two different T2V analogical reasoning
models with 100 and 200 dimensions for the
embeddings and a window of 5
● Word2vec answers with the list of closest points
to the analogical operation
● We check if the correct answer is found in the list
of top-k (1, 5, 10)
● In the top-k setting answer is correct if it is in the
top-k the ranked list
Example
“Who is the equivalent of a RugbyPlayer that plays in a
RugbyTeam in a BasketballTeam?”
RugbyPlayer : RugbyTeam :: ? : BasketballTeam
Analogical operation: v(dbo:RugbyPlayer) -
v(dbo:RugbyTeam) + v(dbo:BasketballTeam) ≈
v(dbo:BasketballPlayer)
22. ESWC, Crete, 4th June 2018
1) Analogical Reasoning
P@1 P@2 P@5
T2V
(200,5)
0.50 0.85 0.98
T2V
(100,5)
0.47 0.76 0.93
Outcome
● Correct answer is often found in the first 5 positions
● Linguistic properties are preserved also in T2V
Model used for the
next experiments
23. ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Correlation
path wup wpath T2V
path 1.00 0.87 0.94 0.30
wup 1.00 0.93 0.33
wpath 1.00 0.36
T2V 1.00
Hypothesis
T2V similarity is orthogonal to topological similarity
Dataset
~15000 pairs of types in DBpedia
Methodology
Pearson Correlation coefficient between T2V similarity and
well-known topological measures
Outcome
T2V similarity and topological similarity are not strongly
correlated
24. ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Insights
State of the Art
Based on the topology of the ontology
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (low similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (high similarity, siblings)
T2V
Captures the co-occurrences of types in text
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (high similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (low similarity, siblings)
26. ESWC, Crete, 4th June 2018
3) Similarity vs. Depth
Hypothesis
Sibling types are pairwise more similar when types are more specific
(as noticed in topological similarity )
sim(dbo:BasketballPlayer, dbo:SoccerPlayer)
>
sim(dbo:Person,dbo:Organization)
Dataset
DBpedia ontology
Methodology
● Children Information Distribution CID
○ Average pairwise similarity between the children of
a type p
● CID vs relative depth (relative = to the type path)
27. ESWC, Crete, 4th June 2018
3) T2V CID vs. Relative Depth
Outcome
● On average, CID increases
with depth
CID drops here: CID(dbo:Thing)>CID(dbo:Agent)
28. ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Hypothesis
The pairwise similarity for a set of siblings changes from pair to pair
Dataset
31 siblings type from the DBpedia ontology
For each type we selected its most similar sibling and its least similar sibling considering
T2V similarity
(e.g., SoccerPlayer => most similar RugbyPlayer, least similar ChessPlayer)
Methodology
We asked 5 users (knowledgeable about semantic web) to answer questions like the
following:
“Do you think a SoccerPlayer is more similar to a RugbyPlayer or a ChessPlayer?”
Potential Biases
● Low number of participants
● Questions were selected using T2V
29. ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Outcome
● Agreement between the user using Gwet AC1 [Gwet, 2008] is 0.9 (high agreement)
● Given an input type, users choose as answer the type that is also returned as most similar by
T2V
Examples
Is a Writer more similar to a dbo:Philosopher or a dbo:BusinessPerson?
Is a President more similar to a dbo:PrimeMinister or a dbo:Mayor?
Most challenging question for users
“is a dbo:Skyscraper more similar to a dbo:Hospital or a dbo:Museum?”
30. ESWC, Crete, 4th June 2018
5) Type Matching
Hypothesis
T2V can be used for ontology matching provided that two different ontologies are used to classify a common set of instances
Methodology
● Learn representations of types from different ontologies in a shared vector space (100 dimensions, 5 window)
● Replace entities with a type of one of the two ontologies (randomly)
Dataset
● DBpedia 2016-04 and Wikidata 2016-06 (instance of)
Same space in
which types of
different
ontologies
co-exist
City (Ontology 1)
Country (Ontology 2)
City (Ontology 1)
Region (Ontology 2)
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
2
9
1
2
5
4
2
6
4
5
7
5
2
2
9
31. ESWC, Crete, 4th June 2018
5) Type Matching
Wikidata (label) DBpedia Sim
Q4498974 (ice hokey team) HockeyTeam 0.99
Q5107 (continent) Continent 0.99
Q17374546* (Australian rules football club) AustralianFootballTeam 0.99
Q3001412* (horse race) HorseRace 0.98
Q4022 (river) River 0.98
Q46970 (airline) Airline 0.98
Q18127 (record label) RecordLabel 0.98
Q13027888* (baseball team) BaseballTeam 0.98
Q11424 (film) Film 0.98
Q1075* (color) Colour 0.98
Q17156793* (American football team) American Football Team 0.95
Q3146899* (diocese of the Catholic Church) Diocese 0.93
Q7944* (earthquake) Earthquake 0.91
* not declared equivalent in DBpedia
Outcome
● Types with highest similarity are equivalent classes in the
two ontologies (due to the use in text)
● Found equivalent types not declared as equivalent in
DBpedia
32. Conclusions and
Future Work
Future Work:
● Combine T2V similarity and topological
similarities in one measure
● Study relation between sub-type relation
and the vector representation
● Support ontology matching tasks
● Compare with other methods for
vector-based type representations
Conclusions:
● Similarity with T2V injects relatedness in
type similarity measures (from
handwritten text corpora)
● T2V exhibits some desired properties
(depth, sibling discrimination)
● T2V supports analogical reasoning
● T2V can support ontology matching
33. Thank You
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
Code and models are publicly available (see the paper for details)
Mail to: federico.bianchi@disco.unimib.it
34. ESWC, Crete, 4th June 2018
References
Bianchi, F., Palmonari, M., & Nozza, D., Towards Encoding Time in Text-Based Entity Embeddings. in International Semantic Web
Conference, 2018 (to appear).
Bianchi, F., & Palmonari, M. (2017). Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. In In
Proceedings of the NL4AI Workshop, co-located with the International Conference of the Italian Association for Artificial Intelligence
(AI* IA).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical
and Statistical Psychology,61(1):29–48, 2008.
Ganggao Zhu and Carlos A Iglesias. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge
and Data Engineering, 29(1):72–85, 2017.
V. Jayawardana, D. Lakmal, N. de Silva, A. S. Perera, K. Sugathadasa, and B. Ayesha. Deriving a representative vector for
ontology classes with instance word vector embeddings. In INTECH, pages 79–84, Aug 2017.
Fatima Zohra Smaili, Xin Gao, and Robert Hoehndorf. Onto2vec: joint vector-based representation of biological entities and
their ontology-based annotations. arXiv preprint arXiv:1802.00864, 2018.