Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Survey of Entity Ranking over RDF Graphs

1,228 views

Published on

The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").

Published in: Technology, Education
  • Be the first to comment

A Survey of Entity Ranking over RDF Graphs

  1. 1. A Survey of Entity Ranking over RDF Graphs Nikita Zhiltsov Kazan Federal University Russia November 29, 2013 1 / 60
  2. 2. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 2 / 60
  3. 3. Motivation The increasing amount of valuable semi-structured data has become available online, e.g. RDF graphs: Linking Open Data (LOD) cloud Web pages enhanced with microformats, RDFa etc.: CommonCrawl, Web Data Commons Google: Freebase Annotations of the ClueWeb Corpora More than a half of queries from real query logs have the entity-centric user intent Examples from industry: Google Knowledge Graph, Facebook Graph Search, Yandex Islands ⇒ 3 / 60
  4. 4. Google Knowledge Graph 4 / 60
  5. 5. Facebook Graph Graph 5 / 60
  6. 6. Yandex Islands 6 / 60
  7. 7. Overview of Semantic Search Approaches T. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It 7 / 60
  8. 8. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 8 / 60
  9. 9. In this talk, we focus on entity ranking over RDF graphs given a keyword search query 9 / 60
  10. 10. Key Issues in Entity Ranking Ambiguity in names Related entities from heterogeneous data sources Complex queries with clarifying terms 10 / 60
  11. 11. Key Issues in Entity Ranking Ambiguity in names Given a query university of michigan, University of Michigan, Ann Arbor Central Michigan University, Michigan Technological University, Michigan State University 11 / 60
  12. 12. Key Issues in Entity Ranking Related entities from heterogeneous data sources Given a query harry potter movie, Semantic link information can effectively enhance term context 12 / 60
  13. 13. Key Issues in Entity Ranking Complex queries with clarifying terms Given a query shobana masala, the user intent is likely about Shobana Chandrakumar, an Indian actress starring in movies of the Masala genre 13 / 60
  14. 14. Ad-hoc Object Retrieval in the Web of Data Jeffrey Pound, Peter Mika, Hugo Zaragoza WWW 2010 14 / 60
  15. 15. Query Categories Entity query (∼ 40%∗), e.g. 1978 cj5 jeep Type query† (∼ 12%), e.g. doctors in barcelona Attribute query (∼ 5%), e.g. zip code atlanta Other query (∼ 36%) however, ∼ 14% of them contain a context entity or type ∗ estimated on real query logs from Yahoo! † a.k.a. list search query 15 / 60
  16. 16. Repeatable and Reliable Search System Evaluation using Crowdsourcing Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh D. Tran SIGIR 2011 16 / 60
  17. 17. Data Collection Billion Triples Challenge 2009 RDF data set The size of uncompressed data is 247GB; 1.4B triples describing 114 million objects It was composed by combining crawls of multiple RDF search engines 17 / 60
  18. 18. Data Collection Classes 18 / 60
  19. 19. Data Collection Properties 19 / 60
  20. 20. Data Collection Sources 20 / 60
  21. 21. Query Set Preparation 1 Emulate top queries Given Microsoft Live Search log containing queries repeated by at least 10 different users Sample 50 queries prefiltered with a NER and a gazetteer 2 Emulate long-tailed queries Given Yahoo! Search Query Log Tiny Sample v1.0 – 4,500 queries Sample and manually filter out ambiguous queries ⇒ 42 queries 3 ⇒ a list of 92 queries 21 / 60
  22. 22. Crowdsourcing Judgements A purpose-built rendering tool to present the search results There have been conducted the evaluation (MT1) and its repetition(MT2) after 6 months Using Amazon Mechanical Turk HITs Each HIT consists of 12 query-result pairs: 10 real ones and 2 were from "golden standard" annotated by experts 64 workers for MT1 and 69 workers for MT2 22 / 60
  23. 23. Rendering Tool 23 / 60
  24. 24. Analysis of Results Repeatability The level of agreement is the same for two pools The rank order of the systems is unchanged 24 / 60
  25. 25. Targeting Evaluation Measures I All the measures are usually computed on top-10 search results (k=10) 1 P@k (precision at k): P @k(π, l) = 2 t≤k I{lπ(k) =1} k MAP (mean average precision): AP (π, l) = m k=1 P @k · I{lπ(k) =1} m1 MAP = mean of AP over all queries 25 / 60
  26. 26. Targeting Evaluation Measures II 3 NDCG: normalized discounted cumulative gain k G(lπ(j) ) · η(j), DCG@k(π, l) = j=1 where G(·), the rating of a document, is usually 1 G(z) = 2z − 1, η(j) = log(j+1) , lπ(j) ∈ {0, 1, 2} N DCG@k(π, l) = 1 DCG@k(π, l) Zk 26 / 60
  27. 27. Analysis of Results Reliability Metric Difference MAP 1.8% NDCG 3.5% P@10 12.8% In the setting, experts rate more results negative than workers P@10 is more fragile than MAP and NDCG 27 / 60
  28. 28. Yahoo! SemSearch Challenge (YSC) 2010 & 2011 http://semsearch.yahoo.com 28 / 60
  29. 29. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 29 / 60
  30. 30. Entity Search Track Submission by Yahoo! Research Barcelona Roi Blanco, Peter Mika, Hugo Zaragoza SSW at WWW 2010 30 / 60
  31. 31. YSC 2010 Winner Approach RDF S-P-O triples with literals are only considered Triples are filtered by predicates from a predefined list of 300 predicates Triples about the same subject are grouped into a pseudo document with multiple fields BM25F ranking formula is applied (the weighting scheme wc is handcrafted): BM 25F = t∈q∩d tf (t, d) · idf (t), k1 + b ∗ tf (t, d) wc · tfc (t, d) tf (t, d) = c∈d 31 / 60
  32. 32. Sindice BM25MF at SemSearch 2011 Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati, Diego Ceccarelli, Giovanni Tummarello SSW at WWW 2011 32 / 60
  33. 33. YSC 2011 Winner Approach I URI resolution for triple objects Extended BM25F approach with additional normalization for term frequencies per predicate types: The weighting scheme is handcrafted The proportion of query terms in entity literals 33 / 60
  34. 34. YSC 2011 Winner Approach II RDF graph example: 34 / 60
  35. 35. YSC 2011 Winner Approach III Star-shaped query matching the entity: 35 / 60
  36. 36. YSC 2011 Winner Approach IV Empirical weights: 36 / 60
  37. 37. On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data Robert Neumayer, Kristztian Balog, Kjetil Nørvåg ECIR 2012 37 / 60
  38. 38. Approach to entity representation I RDF graph example: 38 / 60
  39. 39. Approach to entity representation II a) Unstructured Entity Model; b) Structure Entity Model: 39 / 60
  40. 40. Main Findings Two generative language models (LMs) for the task: Unstructured Entity Model Structured Entity Model The evaluation on the YSC data shows that the representation of relations as a mixture of predicate type LMs can contribute significantly to overall performance 40 / 60
  41. 41. LM Retrieval Framework P (q|e)P (e) rank = P (q|e)P (e), P (q) where P (e|q) - probability of being relevant given query q P (e|q) = Further Assumptions (i) P (e) is uniform; (ii) query terms are i.i.d Let θe be the entity model that predicts how likely the entity would produce a given term t, then the query likelihood is P (t|θe )tf (t,q) P (q|θe ) = t∈q 41 / 60
  42. 42. Unstructured Entity Model Idea Collapse all text values of properties associated with the entity into a single document and apply standard IR techniques The entity model is a Dirichlet-smoothed multinomial distribution: P (t|θe) = tf (t, e) + µP (t|θc) |e| + µ 42 / 60
  43. 43. Structured Entity Model Folding Predicates Group RDF triples by the following predicate types pt : Name, e.g. literal values of foaf:name, rdfs:label Attributes, i.e. remaining datatype properties OutRelations: resolving "object" (O) URIs in S-P-O triple getting their names InRelations: resolving "subject" (S) URIs in S-P-O triple getting their names 43 / 60
  44. 44. Structured Entity Model Mixture of Language Models p Each group has its own LM P (t|θe t ): p P (t|θe t ) p tf (t, pt, e) + µpt P (t|θc t ) = |pt, e| + µpt Then, the entity model is a linear mixture of the predicate type LMs: p P (t|θe t )P (pt) P (t|θe) = pt 44 / 60
  45. 45. Comparative Evaluation Model UEM SEM UEM SEM MAP P@10 NDCG YSC 2010 0.207 0.314 0.383 0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%) YSC 2011 0.207 0.188 0.295 0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%) The multi-fielded document approach improves the targeted measures on 26-35% 45 / 60
  46. 46. Combining N-gram Retrieval with Weights Propagation on Massive RDF Graphs He Hu, Xiaoyang Du FSKD 2012 46 / 60
  47. 47. Approach I Considering 2- to 5-grams while indexing entity URIs as well as literals Thinking of URIs as hierarchical names Computing the entity-query similarity scores: simU RI (Q) = engram_hit_count (||Q| − |U RI.path|| + 1) · (U RI.depth + 1) simLIT ERAL (Q) = engram_hit_count ||Q| − |LIT ERAL.length|| + 1 47 / 60
  48. 48. Approach II Ranking score: ScoreU RI (Q) = 1 − e−sim(Q) Taking advantage of iterative PageRank-like weight propagation: WU RI_hit (i + 1) = α · WU RI_hit (i) WU RI_unhit (i + 1) = (1 − α) · WU RI_hit_neighbors (i) NU RI_hit_neighbors Improvement up to 80% w.r.t. the plain n-gram ranker 48 / 60
  49. 49. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonton, Gianluca Demartini, Phillipe Cudré-Mauroux SIGIR 2012 49 / 60
  50. 50. Hybrid Search System 50 / 60
  51. 51. Structured Inverted Index Consider the following property values as fields: URI: tokens from entity URI, e.g. http: //dbpedia.org/page/Barack_Obama ⇒ ’barack’, ’obama’ etc. Labels: values of a list of manually selected datatype properties Attributes: other properties BM25F is used as a ranking function 51 / 60
  52. 52. Graph-based Entity Search 1 2 3 4 Given a query q, obtain a list of entities Retr = {e1 , e2 , . . . , en } ranked by the BM25F scores Use top-N elements as seeds for graph traversal To get StructRetr = {e1 , . . . , em }, exploit promising LOD properties‡ as well as Jaro-Winkler string similarity scores JW (q, e ) > τ Combine two rankings: f inalScore(q, e ) = λ × BM 25(q, e) + (1 − λ) × JW (q, e ) ‡ owl:sameAs, dbpedia:disambiguates, dbpedia:redirect 52 / 60
  53. 53. Evaluation The graph-based approach (S1_1) outperforms BM25 scoring with 25% improvement of MAP on the 2010 data set No significant improvement over baseline on the 2011 data set This may be explained by the lack of the used predicates (owl:sameAs volume < 0.7%) 53 / 60
  54. 54. Improving Entity Search over Linked Data by Modeling Latent Semantics Nikita Zhiltsov, Eugene Agichtein CIKM 2013 54 / 60
  55. 55. Key Contributions A tensor factorization based approach to incorporate semantic link information into ranking model Outperforms the state of the art baseline in NDCG/MAP/P@10 A thorough evaluation of the proposed techniques by acquiring thousands of manual labels to augment the YSC benchmark data set ⇒ more details in the next talk 55 / 60
  56. 56. Negative results The ideas that do not work out 56 / 60
  57. 57. Negative Results The ideas from standard IR that do not work out: Wordnet-based query expansion [Tonon et al., SIGIR 2012] Pseudo-relevance feedback [Tonon et al., SIGIR 2012] Query suggestions of a commercial search engine [Tonon et al., SIGIR 2012] Direct application of centrality measures, such as PageRank and HITS [Campinas et al., SSW WWW 2010; Dali et al., 2012] 57 / 60
  58. 58. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 58 / 60
  59. 59. Wrap up Entity search over RDF graphs a.k.a. ad-hoc object retrieval has emerged as a new task in IR There is a robust and consistent evaluation methodology for it State-of-the-art approaches revolve around applications of well-known IR methods along Lack of approaches for leveraging semantic links Lots of data: scalability really matters 59 / 60
  60. 60. Thanks for your attention! 60 / 60

×