Type-aware Entity Retrieval
Dar´ıo Garigliotti
University of Stavanger
Type-aware Entity Retrieval
Dar´ıo Garigliotti
University of Stavanger
Motivation
∎ One of the unique characteristics of entity retrieval is that entities are typed.
∎ Typically, types are organized hierarchically in a type categorization system.
∎ We explore three main identified dimensions to understand how to use entity type information:
⋆ RQ1: How do the retrieval approaches perform across different type taxonomies?
⋆ RQ2: How to represent the type information provided by the type hierarchy?
⋆ RQ3: How to combine type-based and text-based information in retrieval?
Type Taxonomies
We normalize four type systems to an uniform taxonomy structure:
DBpedia Ontology
∎ A well-designed hierarchy.
∎ Created manually by considering the
most frequently used infoboxes in
Wikipedia.
∎ Clean and consistent, but with limited
coverage.
0
1
2
3
4
5
6
7
|Level 1| = 58 types
|Level 2| = 114 types
|Level 3| = 142 types
|Level 4| = 213 types
|Level 5| = 45 types
|Level 6| = 17 types
|Level 7| = 1 type
Freebase Types
∎ A two-layer categorization system:
types and domains.
∎ Entities are only assigned to types,
having most of them “same as”
links to DBpedia entities.
0
1
2
|Level 1| = 92 types
|Level 2| = 1, 626 types
Wikipedia Categories
∎ It consists of textual labels known
as categories.
∎ It’s not a well-defined “is-a” hier-
archy, but a graph: it requires a
major normalization strategy.
∎ Category assignments are neither
consistent nor complete.
0
1
2-10
11-24
25-
34
|Level 1| = 27 types
|Level 2 ∪ ... ∪ Level 10| =
121, 657 types
|Level 11 ∪ ... ∪ Level 24| =
410, 697 types
|Level 25 ∪ ... ∪ Level 34| =
14, 564 types
YAGO Types
∎ A deep subsumption hierarchy.
∎ Constructed by taking leaf categories
from Wikipedia categories and then
using WordNet synsets to establish
the hierarchy.
0
1
2-5
6-10
11-
19
|Level 1| = 61 types
|Level 2 ∪ ... ∪ Level 5| =
80, 384 types
|Level 6 ∪ ... ∪ Level 10| =
461, 843 types
|Level 11 ∪ ... ∪ Level 19| =
26, 383 types
Type
Representations
We propose three representations of
hierarchical type information:
Types along path to the top
t3t3
t2t2
t5t5t4t4
t9t9t8t8
e
t6t6
t12t12
t7t7
…
t10t10 t11t11
t0t0
t1t1 …
Top-level types
t3t3
t2t2
t5t5t4t4
t9t9t8t8
e
t6t6
t12t12
t7t7
…
t10t10 t11t11
t0t0
t1t1 …
Most specific types
t3t3
t2t2
t5t5t4t4
t9t9t8t8
e
t6t6
t12t12
t7t7
…
t10t10 t11t11
t0t0
t1t1 …
Type Information in Retrieval
We define the retrieval task in a generative probabilistic framework.
Both query and entity
are considered in the
term space as well as in
the type space.
An oracle process can
provide the target types
for the query from its
relevant results.
query entity
Olympic games
target types
Rio de Janeiro
term-based
similarity
type-based
similarity
… …
entity types
(Strict) Filtering
P(q e) = P(θT′
q θT′
e ) ⋅ χ[types(q) ∩ types(e) ≠ ∅]
Types(q)Types(q) Types(e)Types(e)
(Soft) Filtering
P(q e) = P(θT′
q θT′
e ) ⋅ P(θT
q θT
e )
Interpolation
P(q e) = (1 − λ) ⋅ P(θT′
q θT′
e ) + λ ⋅ P(θT
q θT
e )
Type weight λ takes values in [0,1] in steps of 0.05. We use the
best performing setting when comparing against other approaches.
Results
DBpedia
Freebase
W
ikipedia
YAGO
0
0.1
0.2
0.3
0.4
MAP
Strict filtering Soft filtering Interpolation
(a) Types along path to top
DBpedia
Freebase
W
ikipedia
YAGO
(b) Top-level types
DBpedia
Freebase
W
ikipedia
YAGO
(c) Most-specific types
Fig. 1: Retrieval performance considering only entities that have types from all four type systems.
Term-based baseline (showed with the red line) and the ground truth are restricted to the same set of entities.
DBpedia
Freebase
W
ikipedia
YAGO
0
0.1
0.2
0.3
0.4
MAP
Strict filtering Soft filtering Interpolation
(a) Types along path to top
DBpedia
Freebase
W
ikipedia
YAGO
(b) Top-level types
DBpedia
Freebase
W
ikipedia
YAGO
(c) Most-specific types
Fig. 2: Retrieval performance considering all entities, and using the full set of relevance judgments.
The red line represents the term-based baseline.
Conclusions
∎ Type information proves most useful when larger,
deeper type taxonomies provide very specific types.
⋆ RQ1 (Type taxonomy): given a type representation and a retrieval model, Wikipedia performs best in most of the cases.
⋆ RQ2 (Type representation): using the most specific types is the most effective way to represent type information.
⋆ RQ3 (Retrieval model): all models suffer from missing type information, but interpolation appears to be the most robust.

Type-Aware Entity Retrieval

  • 1.
    Type-aware Entity Retrieval Dar´ıoGarigliotti University of Stavanger Type-aware Entity Retrieval Dar´ıo Garigliotti University of Stavanger Motivation ∎ One of the unique characteristics of entity retrieval is that entities are typed. ∎ Typically, types are organized hierarchically in a type categorization system. ∎ We explore three main identified dimensions to understand how to use entity type information: ⋆ RQ1: How do the retrieval approaches perform across different type taxonomies? ⋆ RQ2: How to represent the type information provided by the type hierarchy? ⋆ RQ3: How to combine type-based and text-based information in retrieval? Type Taxonomies We normalize four type systems to an uniform taxonomy structure: DBpedia Ontology ∎ A well-designed hierarchy. ∎ Created manually by considering the most frequently used infoboxes in Wikipedia. ∎ Clean and consistent, but with limited coverage. 0 1 2 3 4 5 6 7 |Level 1| = 58 types |Level 2| = 114 types |Level 3| = 142 types |Level 4| = 213 types |Level 5| = 45 types |Level 6| = 17 types |Level 7| = 1 type Freebase Types ∎ A two-layer categorization system: types and domains. ∎ Entities are only assigned to types, having most of them “same as” links to DBpedia entities. 0 1 2 |Level 1| = 92 types |Level 2| = 1, 626 types Wikipedia Categories ∎ It consists of textual labels known as categories. ∎ It’s not a well-defined “is-a” hier- archy, but a graph: it requires a major normalization strategy. ∎ Category assignments are neither consistent nor complete. 0 1 2-10 11-24 25- 34 |Level 1| = 27 types |Level 2 ∪ ... ∪ Level 10| = 121, 657 types |Level 11 ∪ ... ∪ Level 24| = 410, 697 types |Level 25 ∪ ... ∪ Level 34| = 14, 564 types YAGO Types ∎ A deep subsumption hierarchy. ∎ Constructed by taking leaf categories from Wikipedia categories and then using WordNet synsets to establish the hierarchy. 0 1 2-5 6-10 11- 19 |Level 1| = 61 types |Level 2 ∪ ... ∪ Level 5| = 80, 384 types |Level 6 ∪ ... ∪ Level 10| = 461, 843 types |Level 11 ∪ ... ∪ Level 19| = 26, 383 types Type Representations We propose three representations of hierarchical type information: Types along path to the top t3t3 t2t2 t5t5t4t4 t9t9t8t8 e t6t6 t12t12 t7t7 … t10t10 t11t11 t0t0 t1t1 … Top-level types t3t3 t2t2 t5t5t4t4 t9t9t8t8 e t6t6 t12t12 t7t7 … t10t10 t11t11 t0t0 t1t1 … Most specific types t3t3 t2t2 t5t5t4t4 t9t9t8t8 e t6t6 t12t12 t7t7 … t10t10 t11t11 t0t0 t1t1 … Type Information in Retrieval We define the retrieval task in a generative probabilistic framework. Both query and entity are considered in the term space as well as in the type space. An oracle process can provide the target types for the query from its relevant results. query entity Olympic games target types Rio de Janeiro term-based similarity type-based similarity … … entity types (Strict) Filtering P(q e) = P(θT′ q θT′ e ) ⋅ χ[types(q) ∩ types(e) ≠ ∅] Types(q)Types(q) Types(e)Types(e) (Soft) Filtering P(q e) = P(θT′ q θT′ e ) ⋅ P(θT q θT e ) Interpolation P(q e) = (1 − λ) ⋅ P(θT′ q θT′ e ) + λ ⋅ P(θT q θT e ) Type weight λ takes values in [0,1] in steps of 0.05. We use the best performing setting when comparing against other approaches. Results DBpedia Freebase W ikipedia YAGO 0 0.1 0.2 0.3 0.4 MAP Strict filtering Soft filtering Interpolation (a) Types along path to top DBpedia Freebase W ikipedia YAGO (b) Top-level types DBpedia Freebase W ikipedia YAGO (c) Most-specific types Fig. 1: Retrieval performance considering only entities that have types from all four type systems. Term-based baseline (showed with the red line) and the ground truth are restricted to the same set of entities. DBpedia Freebase W ikipedia YAGO 0 0.1 0.2 0.3 0.4 MAP Strict filtering Soft filtering Interpolation (a) Types along path to top DBpedia Freebase W ikipedia YAGO (b) Top-level types DBpedia Freebase W ikipedia YAGO (c) Most-specific types Fig. 2: Retrieval performance considering all entities, and using the full set of relevance judgments. The red line represents the term-based baseline. Conclusions ∎ Type information proves most useful when larger, deeper type taxonomies provide very specific types. ⋆ RQ1 (Type taxonomy): given a type representation and a retrieval model, Wikipedia performs best in most of the cases. ⋆ RQ2 (Type representation): using the most specific types is the most effective way to represent type information. ⋆ RQ3 (Retrieval model): all models suffer from missing type information, but interpolation appears to be the most robust.