Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Entity Retrieval on Structured Data

707 views

Published on

ISWC15

Published in: Science
  • Be the first to comment

  • Be the first to like this

Improving Entity Retrieval on Structured Data

  1. 1. Improving Entity Retrieval on Structured Data Besnik Fetahu, Ujwal Gadiraju and Stefan Dietze
  2. 2. Outline • Introduction • Entity Retrieval: keyword vs. structured queries • Motivation • Approach • Pre-processing: Clustering • Entity Retrieval • Experimental Setup • Evaluation and Results • Conclusions 2
  3. 3. Introduction • Large number of available structured and semi-structured datasets (LOD, Web Data Commons) • Entity—centric nature of data • Ad—hoc entity-centric user queries • Retrieval based on natural language queries • Structured queries to harness explicit links between entities (e.g. rdfs:seeAlso, owl:sameAs etc.) • Multiple representations of entities from various sources 3
  4. 4. Entity Retrieval: “keywords” • BM25F: standard IR model for entity retrieval (Blanco et al., ISWC 2011) 4
  5. 5. Entity Retrieval: “structured queries” • BM25 + SPARQL (Tonon et al., SIGIR 2012) • Exploit explicit entity linking statements for retrieval • Linear weighting between BM25 score and string distance to the query (e.g. Jaro-Wrinkler distance) • Query expansion through implicit relevance feedback 5
  6. 6. Motivation 10 0 10 1 10 2 103 10 4 10 5 10 6 107 10 8 100 101 102 103 104 105 106 107 Frequencyofobjectproperties Frequency of explicit similarity statements • Explicit entity linking statements improve retrieval[1] [1] Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval, SIGIR 2012 equivalence dbp:Bethlehem,_Pennsylvania owl:sameAs fb:Bethlehem (Pennsylvania) relatedness dbp:Bethlehem,_Pennsylvania rdfs:seeAlso dbp:The_Lehigh_Valley redirects dbp:Bethlehem,_Pennsylvania dbo:WikiPageRedirects dbr:Bethlehem,_PA • Sparsity of explicit linking statements • Majority of links have no properly defined semantics 6
  7. 7. Motivation (I) • Most queries are entity- centric[2] • Relevant entities in the result set are usually from related entity types • Few entity types (e.g. ‘Person’) are affiliated with many entity types Artist Organization Famous People Film Bird People Product City Activists Computer Software Musical Artist ArchitecturalStructure NAACP Image Awards People with Occupation Saints Work Computer Educational Organization Broadcaster Murdered People Musical Work Stadium Organization University CreativeWork Broadcaster City Person Place Weapon 0 0.2 0.4 0.6 0.8 1 Query type affinity: Given an entity- centric query, entities of a specific type are more likely to be relevant than the others q = {’Barack Obama’} hasType Person 7 Query type affinity
  8. 8. Approach BTC et1 1 , . . . , et1 n etm 1 , . . . , etm n he rdf:type t1i he rdf:type tmi ) (I) Entity Feature Vectors F(e) = {W1, W2, } (II) Entity Clustering • x-means • spectral clustering ) 1. index 2. clusters user (III) Query Analysis (IV & V) Retrieval & Ranking 1. BM25F 2. BM25F + Clustering 3. Entity reranking (II) LSH Entity Bucketing {e1, . . . , ek} {ei, . . . , ei+m} isA Person ‘Barack Obama’ Pre-processing 1. Entity Feature Vectors 2. Entity Bucketing and Clustering Online retrieval 1. Query Analysis 2. Entity Retrieval 3. Entity Ranking 8
  9. 9. Pre-Processing: Feature Vectors Wn — n-gram dictionary scored through tf-idf φ — {0,1} if a property for type t is present in entity e F(e) = {W1(e), W2(e), } W1 = [hu1; tfidf(u1)i, . . . , hun; tfidf(un)i] W2 = [hb1; tfidf(b1)i, . . . , hbn; tfidf(bn)i] = [ (o1, e), . . . , (on, e)] (oi, e) ! [0, 1], i 2 {1, . . . , n} rdfs:label Barack Obama rdfs:comment Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn ɵˈbɑːmə/; born August 4, 1961) is the 44th and current President of the United States, and the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. foaf:name Barack Obama dc:description American politician, 44th President of the United States foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Barack_Obama dcterms:subject http://dbpedia.org/resource/Category:Nobel_Peace_Prize_laureates dcterms:subject http://dbpedia.org/resource/Category:Presidents_of_the_United_States dcterms:subject http://dbpedia.org/resource/Category:Obama_family dcterms:subject http://dbpedia.org/resource/Category:American_civil_rights_lawyers rdfs:seeAlso http://dbpedia.org/resource/United_States_Senate • n-grams from literals • object properties • entity type level statistics 9
  10. 10. Pre-processing: Clustering • Remedy sparsity of explicit entity linking statements by clustering entities at type level • Compute entity buckets through Locality-Sensitive Hashing • Min-Hash signatures for every entity instance • Entities that are likely to be similar are grouped into the same hash bucket • Entity clustering at the entity bucket level • x-means • spectral clustering • Distance between entities to the cluster centroids measured through the Euclidean distance d(e, e0 ) = q P (F(e) F(e0)) 2 10
  11. 11. Pre-processing: Clustering e1 e2 e3 e4 ei+1 ei+2 ei+3 en-2 en-1 en Entity instances of type t 11
  12. 12. Pre-processing: Clustering e1 e2 e3 e4 ei+1 ei+2 ei+3 en-2 en-1 en Entity instances of type t min-hash signatures 11
  13. 13. Pre-processing: Clustering LSH Entity Bucketing e1 en ei+1 e4 e2 ei+2 en-2 ei+3 en-1 e3 e1 e2 e3 e4 ei+1 ei+2 ei+3 en-2 en-1 en Entity instances of type t min-hash signatures 11
  14. 14. Pre-processing: Clustering LSH Entity Bucketing e1 en ei+1 e4 e2 ei+2 en-2 ei+3 en-1 e3 e1 e2 e3 e4 ei+1 ei+2 ei+3 en-2 en-1 en Entity instances of type t min-hash signatures 11 e4 Entity Clustering en-2 e2 e1 en ei+1 cluster 3 cluster 2 cluster 1 ei+2 ei+3 e3 en-1
  15. 15. Entity Retrieval: Result-set Expansion • For each user-query q retrieve an initial set of top—k entities Eb through BM25F • Expand with additional entities with entities that are co- clustered together with Eb top—k result set sim(q, ec) = '(q, ec) '(q, eb) + (1 )d(eb, ec) Ec = {ei+3, e3, ei+2} Eb = {en 2, en 1, . . .} scoring of expanded entities string similarity to query q distance to the initially retrieved entity12 ei+3 e3 en-1 en-2 e2 cluster 3 cluster 2 ei+2 e4 e1 en ei+1 cluster 1
  16. 16. Entity Retrieval: Result set re-ranking • For a given entity-centric query, rank the entities based on the query type affinity (certain entity types are more likely to be relevant) • In case of contextual query terms (e.g.’Harry Potter movie’) consider the coverage from a given entity instance (te, tq) = p(te|tq) P t0 q6=tq 1 p(te|t0 q) query type affinity ranking context(q, e) = 1 |Cx| X cx2Cx e has cx query context overlap ↵(e, tq) = (rank score(e) ⇤ (te, tq)) + (1 ) ⇤ context(q, e) entity rank score 13
  17. 17. Experimental Setup Dataset: BTC’12 • 1.4 billion triples • 107,967 data graphs • 3,321 entity types • 454 million entity instances Entity Bucketing and Clustering • ~77,485 entities fed into LSH bucketing algorithm • ~400 entities on average for the clustering approaches • ~13–38 clusters • ~10–20 entities per cluster Queries: SemSearch[1] • 92 queries [1] http://km.aifb.kit.edu/ws/semsearch10/ [2] T. Neumann and G. Weikum. Rdf-3x: A risc-style engine for rdf. Proc. VLDB Endow.,1(1):647–659, Aug. 2008. 
 Data Indexes • RDF3X[2] and Lucene Index • title + body fields • body (consists of all literals of an entity 14
  18. 18. Experimental Setup (I) • B: baseline BM25F approach • S1: State-of-the-art, approach (Tonon et al, SIGIR’12) with one- hop entities • Our approach: • SP — entities are expanded from clusters generated through spectral clustering • XM — entities are expanded from clusters generated through xmeans clustering 15
  19. 19. Evaluation Results: Clustering Accuracy 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 WorkerAgr.onClusterAccuracy Random Clusters xmeans spectral 0 10 20 30 40 100% 80% 70% 60% 50% Numberofclusters Accuracy xmeans spectral • Crowdsourcing evaluation of clustering accuracy on a 100 randomly selected cluster with a total 1000 entities • Crowd workers: “Pick the odd entity out!” • High agreement rate between workers, 0.75 and 0.6 for spectral and x-means clustering approaches respectively 16
  20. 20. Evaluation Results: Entity Retrieval • Evaluate the retrieval task through crowdsourcing (Blanco et al.,SIGIR’11) • Entities are assessed by 3 crowd workers, on a 5-point Likert scale 17
  21. 21. Evaluation Results: Entity Retrieval • Significantly more relevant entities on the scale 3-5, no difference for the relevance score 2 • Cluster size and number of expanded entities per cluster 0 20 40 60 80 100 120 140 160 2 3 4 5 Numberofentities Entity Relevance Bt S1t SPt XMt Bb S1b SPb XMb 18 0 0.1 0.2 0.3 0.4 0.5 0.6 5-1 10-1 10-5 20-1 20-5 50-1 50-5 100-1 100-5 1000-1 1000-5 Avg.NDCG Result set expansion configurations XMt SPt XMb SPb cluster size and number of expanded entities per cluster
  22. 22. Conclusions • Explicit entity linking statements improve the process of entity retrieval on structured data • Explicit linking statements are sparse in collections like the BTC’12 • Clustering approaches can be used to remedy the sparsity of such links • Given the scale of structured data, bucketing approaches like LSH improve drastically the scalability • For a given entity-centric query, certain entity types are more likely to be relevant • Similarity of the entity to the query is highly important when expanding the result set 19
  23. 23. Thank you! Questions? 20

×