SlideShare a Scribd company logo
1 of 23
Download to read offline
Improving Entity Retrieval on
Structured Data
Besnik Fetahu, Ujwal Gadiraju and Stefan Dietze
Outline
• Introduction
• Entity Retrieval: keyword vs. structured queries
• Motivation
• Approach
• Pre-processing: Clustering
• Entity Retrieval
• Experimental Setup
• Evaluation and Results
• Conclusions
2
Introduction
• Large number of available structured and semi-structured
datasets (LOD, Web Data Commons)
• Entity—centric nature of data
• Ad—hoc entity-centric user queries
• Retrieval based on natural language queries
• Structured queries to harness explicit links between entities
(e.g. rdfs:seeAlso, owl:sameAs etc.)
• Multiple representations of entities from various sources
3
Entity Retrieval: “keywords”
• BM25F: standard IR model for entity retrieval (Blanco et al.,
ISWC 2011)
4
Entity Retrieval: “structured queries”
• BM25 + SPARQL (Tonon et al., SIGIR 2012)
• Exploit explicit entity linking statements for retrieval
• Linear weighting between BM25 score and string distance
to the query (e.g. Jaro-Wrinkler distance)
• Query expansion through implicit relevance feedback
5
Motivation
10
0
10
1
10
2
103
10
4
10
5
10
6
107
10
8
100
101
102
103
104
105
106
107
Frequencyofobjectproperties
Frequency of explicit similarity statements
• Explicit entity linking statements
improve retrieval[1]
[1] Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval, SIGIR 2012
equivalence
dbp:Bethlehem,_Pennsylvania
owl:sameAs
fb:Bethlehem (Pennsylvania)
relatedness
dbp:Bethlehem,_Pennsylvania
rdfs:seeAlso
dbp:The_Lehigh_Valley
redirects
dbp:Bethlehem,_Pennsylvania
dbo:WikiPageRedirects
dbr:Bethlehem,_PA
• Sparsity of explicit linking
statements
• Majority of links have no
properly defined semantics
6
Motivation (I)
• Most queries are entity-
centric[2]
• Relevant entities in the
result set are usually from
related entity types
• Few entity types (e.g.
‘Person’) are affiliated with
many entity types
Artist
Organization
Famous People
Film
Bird
People
Product
City
Activists
Computer Software
Musical Artist
ArchitecturalStructure
NAACP Image Awards
People with Occupation
Saints
Work
Computer
Educational Organization
Broadcaster
Murdered People
Musical Work
Stadium
Organization
University
CreativeWork
Broadcaster
City
Person
Place
Weapon
0
0.2
0.4
0.6
0.8
1
Query type affinity: Given an entity-
centric query, entities of a specific type
are more likely to be relevant than the
others
q = {’Barack Obama’}
hasType Person
7 Query type affinity
Approach
BTC
et1
1 , . . . , et1
n
etm
1 , . . . , etm
n
he rdf:type t1i
he rdf:type tmi
) (I) Entity Feature Vectors
F(e) = {W1, W2, }
(II) Entity Clustering
• x-means
• spectral clustering
)
1. index
2. clusters
user
(III) Query Analysis
(IV & V) Retrieval & Ranking
1. BM25F
2. BM25F + Clustering
3. Entity reranking
(II) LSH Entity Bucketing
{e1, . . . , ek} {ei, . . . , ei+m}
isA Person
‘Barack Obama’
Pre-processing
1. Entity Feature Vectors
2. Entity Bucketing and Clustering
Online retrieval
1. Query Analysis
2. Entity Retrieval
3. Entity Ranking
8
Pre-Processing: Feature Vectors
Wn — n-gram dictionary scored
through tf-idf
φ — {0,1} if a property for
type t is present in entity e
F(e) = {W1(e), W2(e), }
W1 = [hu1; tfidf(u1)i, . . . , hun; tfidf(un)i]
W2 = [hb1; tfidf(b1)i, . . . , hbn; tfidf(bn)i]
= [ (o1, e), . . . , (on, e)]
(oi, e) ! [0, 1], i 2 {1, . . . , n}
rdfs:label Barack Obama
rdfs:comment Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn ɵˈbɑːmə/; born August 4,
1961) is the 44th and current President of the United States, and the first African American to
hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and
Harvard Law School, where he served as president of the Harvard Law Review. He was a
community organizer in Chicago before earning his law degree.
foaf:name Barack Obama
dc:description American politician, 44th President of the United States
foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Barack_Obama
dcterms:subject http://dbpedia.org/resource/Category:Nobel_Peace_Prize_laureates
dcterms:subject http://dbpedia.org/resource/Category:Presidents_of_the_United_States
dcterms:subject http://dbpedia.org/resource/Category:Obama_family
dcterms:subject http://dbpedia.org/resource/Category:American_civil_rights_lawyers
rdfs:seeAlso http://dbpedia.org/resource/United_States_Senate
• n-grams from literals
• object properties
• entity type level statistics
9
Pre-processing: Clustering
• Remedy sparsity of explicit entity linking statements by
clustering entities at type level
• Compute entity buckets through Locality-Sensitive Hashing
• Min-Hash signatures for every entity instance
• Entities that are likely to be similar are grouped into the same
hash bucket
• Entity clustering at the entity bucket level
• x-means
• spectral clustering
• Distance between entities to the cluster centroids measured
through the Euclidean distance
d(e, e0
) =
q
P
(F(e) F(e0))
2
10
Pre-processing: Clustering
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
11
Pre-processing: Clustering
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11
Pre-processing: Clustering
LSH Entity
Bucketing
e1
en
ei+1
e4
e2
ei+2
en-2
ei+3
en-1
e3
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11
Pre-processing: Clustering
LSH Entity
Bucketing
e1
en
ei+1
e4
e2
ei+2
en-2
ei+3
en-1
e3
e1
e2
e3
e4
ei+1
ei+2
ei+3
en-2
en-1
en
Entity instances
of type t
min-hash
signatures
11
e4
Entity Clustering
en-2
e2
e1
en
ei+1
cluster 3
cluster 2
cluster 1
ei+2
ei+3
e3
en-1
Entity Retrieval: Result-set Expansion
• For each user-query q retrieve an initial set of top—k entities Eb
through BM25F
• Expand with additional entities with entities that are co-
clustered together with Eb
top—k result set
sim(q, ec) =
'(q, ec)
'(q, eb)
+ (1 )d(eb, ec)
Ec = {ei+3, e3, ei+2}
Eb = {en 2, en 1, . . .}
scoring of expanded entities
string similarity
to query q
distance to the
initially retrieved
entity12
ei+3
e3
en-1
en-2
e2
cluster 3
cluster 2
ei+2
e4
e1
en
ei+1
cluster 1
Entity Retrieval: Result set re-ranking
• For a given entity-centric query, rank the entities based on the
query type affinity (certain entity types are more likely to be
relevant)
• In case of contextual query terms (e.g.’Harry Potter movie’)
consider the coverage from a given entity instance
(te, tq) =
p(te|tq)
P
t0
q6=tq
1 p(te|t0
q)
query type affinity ranking
context(q, e) =
1
|Cx|
X
cx2Cx
e has cx
query context overlap
↵(e, tq) = (rank score(e) ⇤ (te, tq)) + (1 ) ⇤ context(q, e)
entity rank score
13
Experimental Setup
Dataset: BTC’12
• 1.4 billion triples
• 107,967 data graphs
• 3,321 entity types
• 454 million entity instances
Entity Bucketing and Clustering
• ~77,485 entities fed into LSH
bucketing algorithm
• ~400 entities on average for the
clustering approaches
• ~13–38 clusters
• ~10–20 entities per cluster
Queries: SemSearch[1]
• 92 queries
[1] http://km.aifb.kit.edu/ws/semsearch10/
[2] T. Neumann and G. Weikum. Rdf-3x: A risc-style engine for rdf. Proc. VLDB Endow.,1(1):647–659, Aug. 2008. 

Data Indexes
• RDF3X[2] and Lucene Index
• title + body fields
• body (consists of all literals of an
entity
14
Experimental Setup (I)
• B: baseline BM25F approach
• S1: State-of-the-art, approach (Tonon et al, SIGIR’12) with one-
hop entities
• Our approach:
• SP — entities are expanded from clusters generated through
spectral clustering
• XM — entities are expanded from clusters generated through
xmeans clustering
15
Evaluation Results: Clustering Accuracy
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40 45
WorkerAgr.onClusterAccuracy
Random Clusters
xmeans spectral
0
10
20
30
40
100% 80% 70% 60% 50%
Numberofclusters
Accuracy
xmeans spectral
• Crowdsourcing evaluation of clustering accuracy on a 100
randomly selected cluster with a total 1000 entities
• Crowd workers: “Pick the odd entity out!”
• High agreement rate between workers, 0.75 and 0.6 for spectral
and x-means clustering approaches respectively
16
Evaluation Results: Entity Retrieval
• Evaluate the retrieval task through crowdsourcing (Blanco et
al.,SIGIR’11)
• Entities are assessed by 3 crowd workers, on a 5-point Likert
scale
17
Evaluation Results: Entity Retrieval
• Significantly more relevant entities on the scale 3-5, no
difference for the relevance score 2
• Cluster size and number of expanded entities per cluster
0
20
40
60
80
100
120
140
160
2 3 4 5
Numberofentities
Entity Relevance
Bt
S1t
SPt
XMt
Bb
S1b
SPb
XMb
18
0
0.1
0.2
0.3
0.4
0.5
0.6
5-1
10-1
10-5
20-1
20-5
50-1
50-5
100-1
100-5
1000-1
1000-5
Avg.NDCG
Result set expansion configurations
XMt SPt XMb SPb
cluster size and number of
expanded entities per cluster
Conclusions
• Explicit entity linking statements improve the process of entity
retrieval on structured data
• Explicit linking statements are sparse in collections like the
BTC’12
• Clustering approaches can be used to remedy the sparsity of
such links
• Given the scale of structured data, bucketing approaches like
LSH improve drastically the scalability
• For a given entity-centric query, certain entity types are more
likely to be relevant
• Similarity of the entity to the query is highly important when
expanding the result set
19
Thank you!
Questions?
20

More Related Content

What's hot

Java Foundations: Objects and Classes
Java Foundations: Objects and ClassesJava Foundations: Objects and Classes
Java Foundations: Objects and ClassesSvetlin Nakov
 
The Ring programming language version 1.10 book - Part 39 of 212
The Ring programming language version 1.10 book - Part 39 of 212The Ring programming language version 1.10 book - Part 39 of 212
The Ring programming language version 1.10 book - Part 39 of 212Mahmoud Samir Fayed
 
Core java by a introduction sandesh sharma
Core java by a introduction sandesh sharmaCore java by a introduction sandesh sharma
Core java by a introduction sandesh sharmaSandesh Sharma
 
Java 103 intro to java data structures
Java 103   intro to java data structuresJava 103   intro to java data structures
Java 103 intro to java data structuresagorolabs
 
11. Java Objects and classes
11. Java  Objects and classes11. Java  Objects and classes
11. Java Objects and classesIntro C# Book
 
Java Collections API
Java Collections APIJava Collections API
Java Collections APIAlex Miller
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversionsKnoldus Inc.
 
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...scalaconfjp
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
06 Java Language And OOP Part VI
06 Java Language And OOP Part VI06 Java Language And OOP Part VI
06 Java Language And OOP Part VIHari Christian
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
16. Java stacks and queues
16. Java stacks and queues16. Java stacks and queues
16. Java stacks and queuesIntro C# Book
 
The Ring programming language version 1.4 book - Part 9 of 30
The Ring programming language version 1.4 book - Part 9 of 30The Ring programming language version 1.4 book - Part 9 of 30
The Ring programming language version 1.4 book - Part 9 of 30Mahmoud Samir Fayed
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Scala: Object-Oriented Meets Functional, by Iulian Dragos
Scala: Object-Oriented Meets Functional, by Iulian DragosScala: Object-Oriented Meets Functional, by Iulian Dragos
Scala: Object-Oriented Meets Functional, by Iulian Dragos3Pillar Global
 

What's hot (19)

Java Day-5
Java Day-5Java Day-5
Java Day-5
 
Java Foundations: Objects and Classes
Java Foundations: Objects and ClassesJava Foundations: Objects and Classes
Java Foundations: Objects and Classes
 
The Ring programming language version 1.10 book - Part 39 of 212
The Ring programming language version 1.10 book - Part 39 of 212The Ring programming language version 1.10 book - Part 39 of 212
The Ring programming language version 1.10 book - Part 39 of 212
 
Core java by a introduction sandesh sharma
Core java by a introduction sandesh sharmaCore java by a introduction sandesh sharma
Core java by a introduction sandesh sharma
 
Java 103 intro to java data structures
Java 103   intro to java data structuresJava 103   intro to java data structures
Java 103 intro to java data structures
 
Elementary Sort
Elementary SortElementary Sort
Elementary Sort
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
 
11. Java Objects and classes
11. Java  Objects and classes11. Java  Objects and classes
11. Java Objects and classes
 
Java Collections API
Java Collections APIJava Collections API
Java Collections API
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversions
 
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
06 Java Language And OOP Part VI
06 Java Language And OOP Part VI06 Java Language And OOP Part VI
06 Java Language And OOP Part VI
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
16. Java stacks and queues
16. Java stacks and queues16. Java stacks and queues
16. Java stacks and queues
 
The Ring programming language version 1.4 book - Part 9 of 30
The Ring programming language version 1.4 book - Part 9 of 30The Ring programming language version 1.4 book - Part 9 of 30
The Ring programming language version 1.4 book - Part 9 of 30
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Scala: Object-Oriented Meets Functional, by Iulian Dragos
Scala: Object-Oriented Meets Functional, by Iulian DragosScala: Object-Oriented Meets Functional, by Iulian Dragos
Scala: Object-Oriented Meets Functional, by Iulian Dragos
 
OOPs & Inheritance Notes
OOPs & Inheritance NotesOOPs & Inheritance Notes
OOPs & Inheritance Notes
 

Viewers also liked

How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?Besnik Fetahu
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesBesnik Fetahu
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Besnik Fetahu
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesBesnik Fetahu
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphBesnik Fetahu
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For WikipediaBesnik Fetahu
 

Viewers also liked (9)

How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity Pages
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For Wikipedia
 

Similar to Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion

CS375 Presentation-binary sort.pptx
CS375 Presentation-binary sort.pptxCS375 Presentation-binary sort.pptx
CS375 Presentation-binary sort.pptxLiyu Ying
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R LearningKumar P
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Dongmin Choi
 
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptKamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppttaoufikakabli1
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic searchEdgar Meij
 
Customer Linguistic Profiling
Customer Linguistic ProfilingCustomer Linguistic Profiling
Customer Linguistic ProfilingF789GH
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)krisztianbalog
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
 
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for  Ad-hoc Object RetrievalCombining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for Ad-hoc Object RetrievaleXascale Infolab
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
JIST2015-data challenge
JIST2015-data challengeJIST2015-data challenge
JIST2015-data challengeGUANGYUAN PIAO
 
Mining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIMining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIAbhay Prakash
 
Recommender System with Distributed Representation
Recommender System with Distributed RepresentationRecommender System with Distributed Representation
Recommender System with Distributed RepresentationRakuten Group, Inc.
 

Similar to Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion (20)

CS375 Presentation-binary sort.pptx
CS375 Presentation-binary sort.pptxCS375 Presentation-binary sort.pptx
CS375 Presentation-binary sort.pptx
 
advancedR.pdf
advancedR.pdfadvancedR.pdf
advancedR.pdf
 
Advanced r
Advanced rAdvanced r
Advanced r
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R Learning
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]
 
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptKamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic search
 
Customer Linguistic Profiling
Customer Linguistic ProfilingCustomer Linguistic Profiling
Customer Linguistic Profiling
 
Entity-Relationship Queries over Wikipedia
Entity-Relationship Queries over WikipediaEntity-Relationship Queries over Wikipedia
Entity-Relationship Queries over Wikipedia
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for  Ad-hoc Object RetrievalCombining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
JIST2015-data challenge
JIST2015-data challengeJIST2015-data challenge
JIST2015-data challenge
 
Ch02
Ch02Ch02
Ch02
 
Mining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIMining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-II
 
Recommender System with Distributed Representation
Recommender System with Distributed RepresentationRecommender System with Distributed Representation
Recommender System with Distributed Representation
 
20220811 - computer vision
20220811 - computer vision20220811 - computer vision
20220811 - computer vision
 

Recently uploaded

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 

Recently uploaded (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 

Improving Entity Retrieval on Structured Data through Clustering and Result Set Expansion

  • 1. Improving Entity Retrieval on Structured Data Besnik Fetahu, Ujwal Gadiraju and Stefan Dietze
  • 2. Outline • Introduction • Entity Retrieval: keyword vs. structured queries • Motivation • Approach • Pre-processing: Clustering • Entity Retrieval • Experimental Setup • Evaluation and Results • Conclusions 2
  • 3. Introduction • Large number of available structured and semi-structured datasets (LOD, Web Data Commons) • Entity—centric nature of data • Ad—hoc entity-centric user queries • Retrieval based on natural language queries • Structured queries to harness explicit links between entities (e.g. rdfs:seeAlso, owl:sameAs etc.) • Multiple representations of entities from various sources 3
  • 4. Entity Retrieval: “keywords” • BM25F: standard IR model for entity retrieval (Blanco et al., ISWC 2011) 4
  • 5. Entity Retrieval: “structured queries” • BM25 + SPARQL (Tonon et al., SIGIR 2012) • Exploit explicit entity linking statements for retrieval • Linear weighting between BM25 score and string distance to the query (e.g. Jaro-Wrinkler distance) • Query expansion through implicit relevance feedback 5
  • 6. Motivation 10 0 10 1 10 2 103 10 4 10 5 10 6 107 10 8 100 101 102 103 104 105 106 107 Frequencyofobjectproperties Frequency of explicit similarity statements • Explicit entity linking statements improve retrieval[1] [1] Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval, SIGIR 2012 equivalence dbp:Bethlehem,_Pennsylvania owl:sameAs fb:Bethlehem (Pennsylvania) relatedness dbp:Bethlehem,_Pennsylvania rdfs:seeAlso dbp:The_Lehigh_Valley redirects dbp:Bethlehem,_Pennsylvania dbo:WikiPageRedirects dbr:Bethlehem,_PA • Sparsity of explicit linking statements • Majority of links have no properly defined semantics 6
  • 7. Motivation (I) • Most queries are entity- centric[2] • Relevant entities in the result set are usually from related entity types • Few entity types (e.g. ‘Person’) are affiliated with many entity types Artist Organization Famous People Film Bird People Product City Activists Computer Software Musical Artist ArchitecturalStructure NAACP Image Awards People with Occupation Saints Work Computer Educational Organization Broadcaster Murdered People Musical Work Stadium Organization University CreativeWork Broadcaster City Person Place Weapon 0 0.2 0.4 0.6 0.8 1 Query type affinity: Given an entity- centric query, entities of a specific type are more likely to be relevant than the others q = {’Barack Obama’} hasType Person 7 Query type affinity
  • 8. Approach BTC et1 1 , . . . , et1 n etm 1 , . . . , etm n he rdf:type t1i he rdf:type tmi ) (I) Entity Feature Vectors F(e) = {W1, W2, } (II) Entity Clustering • x-means • spectral clustering ) 1. index 2. clusters user (III) Query Analysis (IV & V) Retrieval & Ranking 1. BM25F 2. BM25F + Clustering 3. Entity reranking (II) LSH Entity Bucketing {e1, . . . , ek} {ei, . . . , ei+m} isA Person ‘Barack Obama’ Pre-processing 1. Entity Feature Vectors 2. Entity Bucketing and Clustering Online retrieval 1. Query Analysis 2. Entity Retrieval 3. Entity Ranking 8
  • 9. Pre-Processing: Feature Vectors Wn — n-gram dictionary scored through tf-idf φ — {0,1} if a property for type t is present in entity e F(e) = {W1(e), W2(e), } W1 = [hu1; tfidf(u1)i, . . . , hun; tfidf(un)i] W2 = [hb1; tfidf(b1)i, . . . , hbn; tfidf(bn)i] = [ (o1, e), . . . , (on, e)] (oi, e) ! [0, 1], i 2 {1, . . . , n} rdfs:label Barack Obama rdfs:comment Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn ɵˈbɑːmə/; born August 4, 1961) is the 44th and current President of the United States, and the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. foaf:name Barack Obama dc:description American politician, 44th President of the United States foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Barack_Obama dcterms:subject http://dbpedia.org/resource/Category:Nobel_Peace_Prize_laureates dcterms:subject http://dbpedia.org/resource/Category:Presidents_of_the_United_States dcterms:subject http://dbpedia.org/resource/Category:Obama_family dcterms:subject http://dbpedia.org/resource/Category:American_civil_rights_lawyers rdfs:seeAlso http://dbpedia.org/resource/United_States_Senate • n-grams from literals • object properties • entity type level statistics 9
  • 10. Pre-processing: Clustering • Remedy sparsity of explicit entity linking statements by clustering entities at type level • Compute entity buckets through Locality-Sensitive Hashing • Min-Hash signatures for every entity instance • Entities that are likely to be similar are grouped into the same hash bucket • Entity clustering at the entity bucket level • x-means • spectral clustering • Distance between entities to the cluster centroids measured through the Euclidean distance d(e, e0 ) = q P (F(e) F(e0)) 2 10
  • 14. Pre-processing: Clustering LSH Entity Bucketing e1 en ei+1 e4 e2 ei+2 en-2 ei+3 en-1 e3 e1 e2 e3 e4 ei+1 ei+2 ei+3 en-2 en-1 en Entity instances of type t min-hash signatures 11 e4 Entity Clustering en-2 e2 e1 en ei+1 cluster 3 cluster 2 cluster 1 ei+2 ei+3 e3 en-1
  • 15. Entity Retrieval: Result-set Expansion • For each user-query q retrieve an initial set of top—k entities Eb through BM25F • Expand with additional entities with entities that are co- clustered together with Eb top—k result set sim(q, ec) = '(q, ec) '(q, eb) + (1 )d(eb, ec) Ec = {ei+3, e3, ei+2} Eb = {en 2, en 1, . . .} scoring of expanded entities string similarity to query q distance to the initially retrieved entity12 ei+3 e3 en-1 en-2 e2 cluster 3 cluster 2 ei+2 e4 e1 en ei+1 cluster 1
  • 16. Entity Retrieval: Result set re-ranking • For a given entity-centric query, rank the entities based on the query type affinity (certain entity types are more likely to be relevant) • In case of contextual query terms (e.g.’Harry Potter movie’) consider the coverage from a given entity instance (te, tq) = p(te|tq) P t0 q6=tq 1 p(te|t0 q) query type affinity ranking context(q, e) = 1 |Cx| X cx2Cx e has cx query context overlap ↵(e, tq) = (rank score(e) ⇤ (te, tq)) + (1 ) ⇤ context(q, e) entity rank score 13
  • 17. Experimental Setup Dataset: BTC’12 • 1.4 billion triples • 107,967 data graphs • 3,321 entity types • 454 million entity instances Entity Bucketing and Clustering • ~77,485 entities fed into LSH bucketing algorithm • ~400 entities on average for the clustering approaches • ~13–38 clusters • ~10–20 entities per cluster Queries: SemSearch[1] • 92 queries [1] http://km.aifb.kit.edu/ws/semsearch10/ [2] T. Neumann and G. Weikum. Rdf-3x: A risc-style engine for rdf. Proc. VLDB Endow.,1(1):647–659, Aug. 2008. 
 Data Indexes • RDF3X[2] and Lucene Index • title + body fields • body (consists of all literals of an entity 14
  • 18. Experimental Setup (I) • B: baseline BM25F approach • S1: State-of-the-art, approach (Tonon et al, SIGIR’12) with one- hop entities • Our approach: • SP — entities are expanded from clusters generated through spectral clustering • XM — entities are expanded from clusters generated through xmeans clustering 15
  • 19. Evaluation Results: Clustering Accuracy 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 WorkerAgr.onClusterAccuracy Random Clusters xmeans spectral 0 10 20 30 40 100% 80% 70% 60% 50% Numberofclusters Accuracy xmeans spectral • Crowdsourcing evaluation of clustering accuracy on a 100 randomly selected cluster with a total 1000 entities • Crowd workers: “Pick the odd entity out!” • High agreement rate between workers, 0.75 and 0.6 for spectral and x-means clustering approaches respectively 16
  • 20. Evaluation Results: Entity Retrieval • Evaluate the retrieval task through crowdsourcing (Blanco et al.,SIGIR’11) • Entities are assessed by 3 crowd workers, on a 5-point Likert scale 17
  • 21. Evaluation Results: Entity Retrieval • Significantly more relevant entities on the scale 3-5, no difference for the relevance score 2 • Cluster size and number of expanded entities per cluster 0 20 40 60 80 100 120 140 160 2 3 4 5 Numberofentities Entity Relevance Bt S1t SPt XMt Bb S1b SPb XMb 18 0 0.1 0.2 0.3 0.4 0.5 0.6 5-1 10-1 10-5 20-1 20-5 50-1 50-5 100-1 100-5 1000-1 1000-5 Avg.NDCG Result set expansion configurations XMt SPt XMb SPb cluster size and number of expanded entities per cluster
  • 22. Conclusions • Explicit entity linking statements improve the process of entity retrieval on structured data • Explicit linking statements are sparse in collections like the BTC’12 • Clustering approaches can be used to remedy the sparsity of such links • Given the scale of structured data, bucketing approaches like LSH improve drastically the scalability • For a given entity-centric query, certain entity types are more likely to be relevant • Similarity of the entity to the query is highly important when expanding the result set 19