Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion

Improving Entity Retrieval on
Structured Data
Besnik Fetahu, Ujwal Gadiraju and Stefan Dietze

Outline
• Introduction
• Entity Retrieval: keyword vs. structured queries
• Motivation
• Approach
• Pre-processing: Clustering
• Entity Retrieval
• Experimental Setup
• Evaluation and Results
• Conclusions
2

Introduction
• Large number of available structured and semi-structured
datasets (LOD, Web Data Commons)
• Entity—centric nature of data
• Ad—hoc entity-centric user queries
• Retrieval based on natural language queries
• Structured queries to harness explicit links between entities
(e.g. rdfs:seeAlso, owl:sameAs etc.)
• Multiple representations of entities from various sources
3

Entity Retrieval: “keywords”
• BM25F: standard IR model for entity retrieval (Blanco et al.,
ISWC 2011)
4

Entity Retrieval: “structured queries”
• BM25 + SPARQL (Tonon et al., SIGIR 2012)
• Exploit explicit entity linking statements for retrieval
• Linear weighting between BM25 score and string distance
to the query (e.g. Jaro-Wrinkler distance)
• Query expansion through implicit relevance feedback
5

Motivation
10
0
10
1
10
2
103
10
4
10
5
10
6
107
10
8
100
101
102
103
104
105
106
107
Frequencyofobjectproperties
Frequency of explicit similarity statements
• Explicit entity linking statements
improve retrieval[1]
[1] Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval, SIGIR 2012
equivalence
dbp:Bethlehem,_Pennsylvania
owl:sameAs
fb:Bethlehem (Pennsylvania)
relatedness
rdfs:seeAlso
dbp:The_Lehigh_Valley
redirects
dbo:WikiPageRedirects
dbr:Bethlehem,_PA
• Sparsity of explicit linking
statements
• Majority of links have no
properly deﬁned semantics
6

Motivation (I)
• Most queries are entity-
centric[2]
• Relevant entities in the
result set are usually from
related entity types
• Few entity types (e.g.
‘Person’) are affiliated with
many entity types
Artist
Organization
Famous People
Film
Bird
People
Product
City
Activists
Computer Software
Musical Artist
ArchitecturalStructure
NAACP Image Awards
People with Occupation
Saints
Work
Computer
Educational Organization
Broadcaster
Murdered People
Musical Work
Stadium
Organization
University
CreativeWork
Broadcaster
City
Person
Place
Weapon
0
0.2
0.4
0.6
0.8
1
Query type affinity: Given an entity-
centric query, entities of a specific type
are more likely to be relevant than the
others
q = {’Barack Obama’}
hasType Person
7 Query type affinity

Approach
BTC
et1
1 , . . . , et1
n
etm
1 , . . . , etm
n
he rdf:type t1i
he rdf:type tmi
) (I) Entity Feature Vectors
F(e) = {W1, W2, }
(II) Entity Clustering
• x-means
• spectral clustering
)
1. index
2. clusters
user
(III) Query Analysis
(IV & V) Retrieval & Ranking
1. BM25F
2. BM25F + Clustering
3. Entity reranking
(II) LSH Entity Bucketing
{e1, . . . , ek} {ei, . . . , ei+m}
isA Person
‘Barack Obama’
Pre-processing
1. Entity Feature Vectors
2. Entity Bucketing and Clustering
Online retrieval
1. Query Analysis
2. Entity Retrieval
3. Entity Ranking
8

Pre-Processing: Feature Vectors
Wn — n-gram dictionary scored
through tf-idf
φ — {0,1} if a property for
type t is present in entity e
F(e) = {W1(e), W2(e), }
W1 = [hu1; tfidf(u1)i, . . . , hun; tfidf(un)i]
W2 = [hb1; tfidf(b1)i, . . . , hbn; tfidf(bn)i]
= [ (o1, e), . . . , (on, e)]
(oi, e) ! [0, 1], i 2 {1, . . . , n}
rdfs:label Barack Obama
rdfs:comment Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn ɵˈbɑːmə/; born August 4,
1961) is the 44th and current President of the United States, and the ﬁrst African American to
hold the ofﬁce. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and
Harvard Law School, where he served as president of the Harvard Law Review. He was a
community organizer in Chicago before earning his law degree.
foaf:name Barack Obama
dc:description American politician, 44th President of the United States
foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Barack_Obama
dcterms:subject http://dbpedia.org/resource/Category:Nobel_Peace_Prize_laureates
dcterms:subject http://dbpedia.org/resource/Category:Presidents_of_the_United_States
dcterms:subject http://dbpedia.org/resource/Category:Obama_family
dcterms:subject http://dbpedia.org/resource/Category:American_civil_rights_lawyers
rdfs:seeAlso http://dbpedia.org/resource/United_States_Senate
• n-grams from literals
• object properties
• entity type level statistics
9

Pre-processing: Clustering
• Remedy sparsity of explicit entity linking statements by
clustering entities at type level
• Compute entity buckets through Locality-Sensitive Hashing
• Min-Hash signatures for every entity instance
• Entities that are likely to be similar are grouped into the same
hash bucket
• Entity clustering at the entity bucket level
• x-means
• spectral clustering
• Distance between entities to the cluster centroids measured
through the Euclidean distance
d(e, e0
) =
q
P
(F(e) F(e0))
2
10

e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
11

e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11

LSH Entity
Bucketing
e1
en
ei+1
e4
e2
ei+2
en-2
ei+3
en-1
e3
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11

LSH Entity
Bucketing
e1
en
ei+1
e4
e2
ei+2
en-2
ei+3
en-1
e3
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11
e4
Entity Clustering
en-2
e2
e1
en
ei+1
cluster 3
cluster 2
cluster 1
ei+2
ei+3
e3
en-1

Entity Retrieval: Result-set Expansion
• For each user-query q retrieve an initial set of top—k entities Eb
through BM25F
• Expand with additional entities with entities that are co-
clustered together with Eb
top—k result set
sim(q, ec) =
'(q, ec)
'(q, eb)
+ (1 )d(eb, ec)
Ec = {ei+3, e3, ei+2}
Eb = {en 2, en 1, . . .}
scoring of expanded entities
string similarity
to query q
distance to the
initially retrieved
entity12
ei+3
e3
en-1
en-2
e2
cluster 3
cluster 2
ei+2
e4
e1
en
ei+1
cluster 1

Entity Retrieval: Result set re-ranking
• For a given entity-centric query, rank the entities based on the
query type afﬁnity (certain entity types are more likely to be
relevant)
• In case of contextual query terms (e.g.’Harry Potter movie’)
consider the coverage from a given entity instance
(te, tq) =
p(te|tq)
P
t0
q6=tq
1 p(te|t0
q)
query type afﬁnity ranking
context(q, e) =
1
|Cx|
X
cx2Cx
e has cx
query context overlap
↵(e, tq) = (rank score(e) ⇤ (te, tq)) + (1 ) ⇤ context(q, e)
entity rank score
13

Experimental Setup
Dataset: BTC’12
• 1.4 billion triples
• 107,967 data graphs
• 3,321 entity types
• 454 million entity instances
Entity Bucketing and Clustering
• ~77,485 entities fed into LSH
bucketing algorithm
• ~400 entities on average for the
clustering approaches
• ~13–38 clusters
• ~10–20 entities per cluster
Queries: SemSearch[1]
• 92 queries
[1] http://km.aifb.kit.edu/ws/semsearch10/
[2] T. Neumann and G. Weikum. Rdf-3x: A risc-style engine for rdf. Proc. VLDB Endow.,1(1):647–659, Aug. 2008.  
Data Indexes
• RDF3X[2] and Lucene Index
• title + body ﬁelds
• body (consists of all literals of an
entity
14

Experimental Setup (I)
• B: baseline BM25F approach
• S1: State-of-the-art, approach (Tonon et al, SIGIR’12) with one-
hop entities
• Our approach:
• SP — entities are expanded from clusters generated through
spectral clustering
• XM — entities are expanded from clusters generated through
xmeans clustering
15

Evaluation Results: Clustering Accuracy
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40 45
WorkerAgr.onClusterAccuracy
Random Clusters
xmeans spectral
0
10
20
30
40
100% 80% 70% 60% 50%
Numberofclusters
Accuracy
xmeans spectral
• Crowdsourcing evaluation of clustering accuracy on a 100
randomly selected cluster with a total 1000 entities
• Crowd workers: “Pick the odd entity out!”
• High agreement rate between workers, 0.75 and 0.6 for spectral
and x-means clustering approaches respectively
16

Evaluation Results: Entity Retrieval
• Evaluate the retrieval task through crowdsourcing (Blanco et
al.,SIGIR’11)
• Entities are assessed by 3 crowd workers, on a 5-point Likert
scale
17

Evaluation Results: Entity Retrieval
• Signiﬁcantly more relevant entities on the scale 3-5, no
difference for the relevance score 2
• Cluster size and number of expanded entities per cluster
0
20
40
60
80
100
120
140
160
2 3 4 5
Numberofentities
Entity Relevance
Bt
S1t
SPt
XMt
Bb
S1b
SPb
XMb
18
0
0.1
0.2
0.3
0.4
0.5
0.6
5-1
10-1
10-5
20-1
20-5
50-1
50-5
100-1
100-5
1000-1
1000-5
Avg.NDCG
Result set expansion configurations
XMt SPt XMb SPb
cluster size and number of
expanded entities per cluster

Conclusions
• Explicit entity linking statements improve the process of entity
retrieval on structured data
• Explicit linking statements are sparse in collections like the
BTC’12
• Clustering approaches can be used to remedy the sparsity of
such links
• Given the scale of structured data, bucketing approaches like
LSH improve drastically the scalability
• For a given entity-centric query, certain entity types are more
likely to be relevant
• Similarity of the entity to the query is highly important when
expanding the result set
19

Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion

Similar to Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion (20)

Recently uploaded

Recently uploaded (20)

Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion