A Survey of Entity Ranking over
RDF Graphs
Nikita Zhiltsov
Kazan Federal University
Russia
November 29, 2013

1 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion

2 / 60
Motivation
The increasing amount of valuable semi-structured
data has become available online, e.g.
RDF graphs: Linking Op...
Google Knowledge Graph

4 / 60
Facebook Graph Graph

5 / 60
Yandex Islands

6 / 60
Overview of Semantic Search Approaches
T. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behi...
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion

8 / 60
In this talk, we focus on entity ranking over RDF
graphs given a keyword search query

9 / 60
Key Issues in Entity Ranking

Ambiguity in names
Related entities from heterogeneous
data sources
Complex queries with cla...
Key Issues in Entity Ranking
Ambiguity in names

Given a query university of michigan,
University of Michigan, Ann Arbor
C...
Key Issues in Entity Ranking
Related entities from heterogeneous data sources

Given a query harry potter movie,

Semantic...
Key Issues in Entity Ranking
Complex queries with clarifying terms

Given a query shobana
masala, the user intent is
likel...
Ad-hoc Object Retrieval in the Web of Data
Jeffrey Pound, Peter Mika, Hugo Zaragoza
WWW 2010

14 / 60
Query Categories
Entity query (∼ 40%∗), e.g. 1978 cj5
jeep
Type query† (∼ 12%), e.g. doctors in
barcelona
Attribute query ...
Repeatable and Reliable
Search System Evaluation
using Crowdsourcing
Roi Blanco, Harry Halpin, Daniel M. Herzig,
Peter Mik...
Data Collection

Billion Triples Challenge 2009 RDF data set
The size of uncompressed data is 247GB;
1.4B triples describi...
Data Collection
Classes

18 / 60
Data Collection
Properties

19 / 60
Data Collection
Sources

20 / 60
Query Set Preparation
1

Emulate top queries
Given Microsoft Live Search log containing
queries repeated by at least 10 di...
Crowdsourcing Judgements
A purpose-built rendering tool to present
the search results
There have been conducted the evalua...
Rendering Tool

23 / 60
Analysis of Results
Repeatability

The level of agreement is the same for two
pools
The rank order of the systems is uncha...
Targeting Evaluation Measures I
All the measures are usually computed on top-10 search
results (k=10)
1

P@k (precision at...
Targeting Evaluation Measures II
3

NDCG: normalized discounted cumulative gain
k

G(lπ(j) ) · η(j),

DCG@k(π, l) =
j=1

w...
Analysis of Results
Reliability

Metric Difference
MAP
1.8%
NDCG 3.5%
P@10
12.8%

In the setting, experts rate more results...
Yahoo! SemSearch Challenge (YSC) 2010 & 2011
http://semsearch.yahoo.com

28 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion

29 / 60
Entity Search Track Submission by
Yahoo! Research Barcelona
Roi Blanco, Peter Mika, Hugo Zaragoza
SSW at WWW 2010

30 / 60
YSC 2010 Winner Approach
RDF S-P-O triples with literals are only considered
Triples are filtered by predicates from a pred...
Sindice BM25MF at SemSearch 2011
Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,
Diego Ceccarelli, Giovanni Tummarell...
YSC 2011 Winner Approach I
URI resolution for triple objects
Extended BM25F approach with additional
normalization for ter...
YSC 2011 Winner Approach II
RDF graph example:

34 / 60
YSC 2011 Winner Approach III
Star-shaped query matching the entity:

35 / 60
YSC 2011 Winner Approach IV
Empirical weights:

36 / 60
On the Modeling of Entities
for Ad-Hoc Entity Search in the Web of Data
Robert Neumayer, Kristztian Balog, Kjetil Nørvåg
E...
Approach to entity representation I
RDF graph example:

38 / 60
Approach to entity representation II
a) Unstructured Entity Model; b) Structure Entity Model:

39 / 60
Main Findings
Two generative language models (LMs) for
the task:
Unstructured Entity Model
Structured Entity Model

The ev...
LM Retrieval Framework
P (q|e)P (e) rank
= P (q|e)P (e),
P (q)
where P (e|q) - probability of being relevant given query q...
Unstructured Entity Model

Idea
Collapse all text values of properties associated
with the entity into a single document a...
Structured Entity Model
Folding Predicates

Group RDF triples by the following predicate types pt :
Name, e.g. literal val...
Structured Entity Model
Mixture of Language Models
p
Each group has its own LM P (t|θe t ):
p
P (t|θe t )

p
tf (t, pt, e)...
Comparative Evaluation
Model
UEM
SEM
UEM
SEM

MAP

P@10
NDCG
YSC 2010
0.207
0.314
0.383
0.282 (+36.2%) 0.400 (+27.4%) 0.49...
Combining N-gram Retrieval with Weights
Propagation on Massive RDF Graphs
He Hu, Xiaoyang Du
FSKD 2012

46 / 60
Approach I
Considering 2- to 5-grams while indexing entity
URIs as well as literals
Thinking of URIs as hierarchical names...
Approach II
Ranking score:
ScoreU RI (Q) = 1 − e−sim(Q)

Taking advantage of iterative PageRank-like weight
propagation:
W...
Combining Inverted Indices
and Structured Search
for Ad-hoc Object Retrieval
Alberto Tonton, Gianluca Demartini,
Phillipe ...
Hybrid Search System

50 / 60
Structured Inverted Index
Consider the following property values as fields:
URI: tokens from entity URI, e.g. http:
//dbped...
Graph-based Entity Search
1

2
3

4

Given a query q, obtain a list of entities
Retr = {e1 , e2 , . . . , en } ranked by t...
Evaluation

The graph-based approach (S1_1) outperforms BM25
scoring with 25% improvement of MAP on the 2010 data set
No s...
Improving Entity Search over Linked Data
by Modeling Latent Semantics
Nikita Zhiltsov, Eugene Agichtein
CIKM 2013

54 / 60
Key Contributions
A tensor factorization based approach to incorporate
semantic link information into ranking model
Outper...
Negative results
The ideas that do not work out

56 / 60
Negative Results
The ideas from standard IR that do not work out:
Wordnet-based query expansion [Tonon et al.,
SIGIR 2012]...
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion

58 / 60
Wrap up
Entity search over RDF graphs a.k.a. ad-hoc object
retrieval has emerged as a new task in IR
There is a robust and...
Thanks for your attention!

60 / 60
Upcoming SlideShare
Loading in …5
×

A Survey of Entity Ranking over RDF Graphs

937 views
814 views

Published on

The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
937
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
29
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

A Survey of Entity Ranking over RDF Graphs

  1. 1. A Survey of Entity Ranking over RDF Graphs Nikita Zhiltsov Kazan Federal University Russia November 29, 2013 1 / 60
  2. 2. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 2 / 60
  3. 3. Motivation The increasing amount of valuable semi-structured data has become available online, e.g. RDF graphs: Linking Open Data (LOD) cloud Web pages enhanced with microformats, RDFa etc.: CommonCrawl, Web Data Commons Google: Freebase Annotations of the ClueWeb Corpora More than a half of queries from real query logs have the entity-centric user intent Examples from industry: Google Knowledge Graph, Facebook Graph Search, Yandex Islands ⇒ 3 / 60
  4. 4. Google Knowledge Graph 4 / 60
  5. 5. Facebook Graph Graph 5 / 60
  6. 6. Yandex Islands 6 / 60
  7. 7. Overview of Semantic Search Approaches T. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It 7 / 60
  8. 8. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 8 / 60
  9. 9. In this talk, we focus on entity ranking over RDF graphs given a keyword search query 9 / 60
  10. 10. Key Issues in Entity Ranking Ambiguity in names Related entities from heterogeneous data sources Complex queries with clarifying terms 10 / 60
  11. 11. Key Issues in Entity Ranking Ambiguity in names Given a query university of michigan, University of Michigan, Ann Arbor Central Michigan University, Michigan Technological University, Michigan State University 11 / 60
  12. 12. Key Issues in Entity Ranking Related entities from heterogeneous data sources Given a query harry potter movie, Semantic link information can effectively enhance term context 12 / 60
  13. 13. Key Issues in Entity Ranking Complex queries with clarifying terms Given a query shobana masala, the user intent is likely about Shobana Chandrakumar, an Indian actress starring in movies of the Masala genre 13 / 60
  14. 14. Ad-hoc Object Retrieval in the Web of Data Jeffrey Pound, Peter Mika, Hugo Zaragoza WWW 2010 14 / 60
  15. 15. Query Categories Entity query (∼ 40%∗), e.g. 1978 cj5 jeep Type query† (∼ 12%), e.g. doctors in barcelona Attribute query (∼ 5%), e.g. zip code atlanta Other query (∼ 36%) however, ∼ 14% of them contain a context entity or type ∗ estimated on real query logs from Yahoo! † a.k.a. list search query 15 / 60
  16. 16. Repeatable and Reliable Search System Evaluation using Crowdsourcing Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh D. Tran SIGIR 2011 16 / 60
  17. 17. Data Collection Billion Triples Challenge 2009 RDF data set The size of uncompressed data is 247GB; 1.4B triples describing 114 million objects It was composed by combining crawls of multiple RDF search engines 17 / 60
  18. 18. Data Collection Classes 18 / 60
  19. 19. Data Collection Properties 19 / 60
  20. 20. Data Collection Sources 20 / 60
  21. 21. Query Set Preparation 1 Emulate top queries Given Microsoft Live Search log containing queries repeated by at least 10 different users Sample 50 queries prefiltered with a NER and a gazetteer 2 Emulate long-tailed queries Given Yahoo! Search Query Log Tiny Sample v1.0 – 4,500 queries Sample and manually filter out ambiguous queries ⇒ 42 queries 3 ⇒ a list of 92 queries 21 / 60
  22. 22. Crowdsourcing Judgements A purpose-built rendering tool to present the search results There have been conducted the evaluation (MT1) and its repetition(MT2) after 6 months Using Amazon Mechanical Turk HITs Each HIT consists of 12 query-result pairs: 10 real ones and 2 were from "golden standard" annotated by experts 64 workers for MT1 and 69 workers for MT2 22 / 60
  23. 23. Rendering Tool 23 / 60
  24. 24. Analysis of Results Repeatability The level of agreement is the same for two pools The rank order of the systems is unchanged 24 / 60
  25. 25. Targeting Evaluation Measures I All the measures are usually computed on top-10 search results (k=10) 1 P@k (precision at k): P @k(π, l) = 2 t≤k I{lπ(k) =1} k MAP (mean average precision): AP (π, l) = m k=1 P @k · I{lπ(k) =1} m1 MAP = mean of AP over all queries 25 / 60
  26. 26. Targeting Evaluation Measures II 3 NDCG: normalized discounted cumulative gain k G(lπ(j) ) · η(j), DCG@k(π, l) = j=1 where G(·), the rating of a document, is usually 1 G(z) = 2z − 1, η(j) = log(j+1) , lπ(j) ∈ {0, 1, 2} N DCG@k(π, l) = 1 DCG@k(π, l) Zk 26 / 60
  27. 27. Analysis of Results Reliability Metric Difference MAP 1.8% NDCG 3.5% P@10 12.8% In the setting, experts rate more results negative than workers P@10 is more fragile than MAP and NDCG 27 / 60
  28. 28. Yahoo! SemSearch Challenge (YSC) 2010 & 2011 http://semsearch.yahoo.com 28 / 60
  29. 29. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 29 / 60
  30. 30. Entity Search Track Submission by Yahoo! Research Barcelona Roi Blanco, Peter Mika, Hugo Zaragoza SSW at WWW 2010 30 / 60
  31. 31. YSC 2010 Winner Approach RDF S-P-O triples with literals are only considered Triples are filtered by predicates from a predefined list of 300 predicates Triples about the same subject are grouped into a pseudo document with multiple fields BM25F ranking formula is applied (the weighting scheme wc is handcrafted): BM 25F = t∈q∩d tf (t, d) · idf (t), k1 + b ∗ tf (t, d) wc · tfc (t, d) tf (t, d) = c∈d 31 / 60
  32. 32. Sindice BM25MF at SemSearch 2011 Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati, Diego Ceccarelli, Giovanni Tummarello SSW at WWW 2011 32 / 60
  33. 33. YSC 2011 Winner Approach I URI resolution for triple objects Extended BM25F approach with additional normalization for term frequencies per predicate types: The weighting scheme is handcrafted The proportion of query terms in entity literals 33 / 60
  34. 34. YSC 2011 Winner Approach II RDF graph example: 34 / 60
  35. 35. YSC 2011 Winner Approach III Star-shaped query matching the entity: 35 / 60
  36. 36. YSC 2011 Winner Approach IV Empirical weights: 36 / 60
  37. 37. On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data Robert Neumayer, Kristztian Balog, Kjetil Nørvåg ECIR 2012 37 / 60
  38. 38. Approach to entity representation I RDF graph example: 38 / 60
  39. 39. Approach to entity representation II a) Unstructured Entity Model; b) Structure Entity Model: 39 / 60
  40. 40. Main Findings Two generative language models (LMs) for the task: Unstructured Entity Model Structured Entity Model The evaluation on the YSC data shows that the representation of relations as a mixture of predicate type LMs can contribute significantly to overall performance 40 / 60
  41. 41. LM Retrieval Framework P (q|e)P (e) rank = P (q|e)P (e), P (q) where P (e|q) - probability of being relevant given query q P (e|q) = Further Assumptions (i) P (e) is uniform; (ii) query terms are i.i.d Let θe be the entity model that predicts how likely the entity would produce a given term t, then the query likelihood is P (t|θe )tf (t,q) P (q|θe ) = t∈q 41 / 60
  42. 42. Unstructured Entity Model Idea Collapse all text values of properties associated with the entity into a single document and apply standard IR techniques The entity model is a Dirichlet-smoothed multinomial distribution: P (t|θe) = tf (t, e) + µP (t|θc) |e| + µ 42 / 60
  43. 43. Structured Entity Model Folding Predicates Group RDF triples by the following predicate types pt : Name, e.g. literal values of foaf:name, rdfs:label Attributes, i.e. remaining datatype properties OutRelations: resolving "object" (O) URIs in S-P-O triple getting their names InRelations: resolving "subject" (S) URIs in S-P-O triple getting their names 43 / 60
  44. 44. Structured Entity Model Mixture of Language Models p Each group has its own LM P (t|θe t ): p P (t|θe t ) p tf (t, pt, e) + µpt P (t|θc t ) = |pt, e| + µpt Then, the entity model is a linear mixture of the predicate type LMs: p P (t|θe t )P (pt) P (t|θe) = pt 44 / 60
  45. 45. Comparative Evaluation Model UEM SEM UEM SEM MAP P@10 NDCG YSC 2010 0.207 0.314 0.383 0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%) YSC 2011 0.207 0.188 0.295 0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%) The multi-fielded document approach improves the targeted measures on 26-35% 45 / 60
  46. 46. Combining N-gram Retrieval with Weights Propagation on Massive RDF Graphs He Hu, Xiaoyang Du FSKD 2012 46 / 60
  47. 47. Approach I Considering 2- to 5-grams while indexing entity URIs as well as literals Thinking of URIs as hierarchical names Computing the entity-query similarity scores: simU RI (Q) = engram_hit_count (||Q| − |U RI.path|| + 1) · (U RI.depth + 1) simLIT ERAL (Q) = engram_hit_count ||Q| − |LIT ERAL.length|| + 1 47 / 60
  48. 48. Approach II Ranking score: ScoreU RI (Q) = 1 − e−sim(Q) Taking advantage of iterative PageRank-like weight propagation: WU RI_hit (i + 1) = α · WU RI_hit (i) WU RI_unhit (i + 1) = (1 − α) · WU RI_hit_neighbors (i) NU RI_hit_neighbors Improvement up to 80% w.r.t. the plain n-gram ranker 48 / 60
  49. 49. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonton, Gianluca Demartini, Phillipe Cudré-Mauroux SIGIR 2012 49 / 60
  50. 50. Hybrid Search System 50 / 60
  51. 51. Structured Inverted Index Consider the following property values as fields: URI: tokens from entity URI, e.g. http: //dbpedia.org/page/Barack_Obama ⇒ ’barack’, ’obama’ etc. Labels: values of a list of manually selected datatype properties Attributes: other properties BM25F is used as a ranking function 51 / 60
  52. 52. Graph-based Entity Search 1 2 3 4 Given a query q, obtain a list of entities Retr = {e1 , e2 , . . . , en } ranked by the BM25F scores Use top-N elements as seeds for graph traversal To get StructRetr = {e1 , . . . , em }, exploit promising LOD properties‡ as well as Jaro-Winkler string similarity scores JW (q, e ) > τ Combine two rankings: f inalScore(q, e ) = λ × BM 25(q, e) + (1 − λ) × JW (q, e ) ‡ owl:sameAs, dbpedia:disambiguates, dbpedia:redirect 52 / 60
  53. 53. Evaluation The graph-based approach (S1_1) outperforms BM25 scoring with 25% improvement of MAP on the 2010 data set No significant improvement over baseline on the 2011 data set This may be explained by the lack of the used predicates (owl:sameAs volume < 0.7%) 53 / 60
  54. 54. Improving Entity Search over Linked Data by Modeling Latent Semantics Nikita Zhiltsov, Eugene Agichtein CIKM 2013 54 / 60
  55. 55. Key Contributions A tensor factorization based approach to incorporate semantic link information into ranking model Outperforms the state of the art baseline in NDCG/MAP/P@10 A thorough evaluation of the proposed techniques by acquiring thousands of manual labels to augment the YSC benchmark data set ⇒ more details in the next talk 55 / 60
  56. 56. Negative results The ideas that do not work out 56 / 60
  57. 57. Negative Results The ideas from standard IR that do not work out: Wordnet-based query expansion [Tonon et al., SIGIR 2012] Pseudo-relevance feedback [Tonon et al., SIGIR 2012] Query suggestions of a commercial search engine [Tonon et al., SIGIR 2012] Direct application of centrality measures, such as PageRank and HITS [Campinas et al., SSW WWW 2010; Dali et al., 2012] 57 / 60
  58. 58. Outline 1 Introduction 2 Task Statement and Evaluation Methodology 3 Approaches 4 Conclusion 58 / 60
  59. 59. Wrap up Entity search over RDF graphs a.k.a. ad-hoc object retrieval has emerged as a new task in IR There is a robust and consistent evaluation methodology for it State-of-the-art approaches revolve around applications of well-known IR methods along Lack of approaches for leveraging semantic links Lots of data: scalability really matters 59 / 60
  60. 60. Thanks for your attention! 60 / 60

×