A Survey of Entity Ranking over RDF Graphs

A Survey of Entity Ranking over
RDF Graphs
Nikita Zhiltsov
Kazan Federal University
Russia
November 29, 2013

1 / 60

Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion

2 / 60

Motivation
The increasing amount of valuable semi-structured
data has become available online, e.g.
RDF graphs: Linking Open Data (LOD) cloud
Web pages enhanced with microformats, RDFa
etc.: CommonCrawl, Web Data Commons
Google: Freebase Annotations of the ClueWeb
Corpora
More than a half of queries from real query logs
have the entity-centric user intent
Examples from industry: Google Knowledge Graph,
Facebook Graph Search, Yandex Islands ⇒
3 / 60

Google Knowledge Graph

4 / 60

Overview of Semantic Search Approaches
T. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It

7 / 60

Outline
1 Introduction
3 Approaches
4 Conclusion

8 / 60

In this talk, we focus on entity ranking over RDF
graphs given a keyword search query

9 / 60

Key Issues in Entity Ranking

Ambiguity in names
Related entities from heterogeneous
data sources
Complex queries with clarifying terms

10 / 60

Ambiguity in names

Given a query university of michigan,
University of Michigan, Ann Arbor
Central Michigan University, Michigan
Technological University, Michigan State
University

11 / 60

Related entities from heterogeneous data sources

Given a query harry potter movie,

Semantic link information can eﬀectively enhance term
context
12 / 60

Complex queries with clarifying terms

Given a query shobana
masala, the user intent is
likely about Shobana
Chandrakumar, an Indian
actress starring in movies of
the Masala genre

13 / 60

Ad-hoc Object Retrieval in the Web of Data
Jeﬀrey Pound, Peter Mika, Hugo Zaragoza
WWW 2010

14 / 60

Query Categories
Entity query (∼ 40%∗), e.g. 1978 cj5
jeep
Type query† (∼ 12%), e.g. doctors in
barcelona
Attribute query (∼ 5%), e.g. zip code
atlanta
Other query (∼ 36%)
however, ∼ 14% of them contain a context
entity or type
∗

estimated on real query logs from Yahoo!

†

a.k.a. list search query
15 / 60

Repeatable and Reliable
Search System Evaluation
using Crowdsourcing
Roi Blanco, Harry Halpin, Daniel M. Herzig,
Peter Mika, Jeﬀrey Pound, Henry S. Thompson,
Thanh D. Tran
SIGIR 2011

16 / 60

Data Collection

Billion Triples Challenge 2009 RDF data set
The size of uncompressed data is 247GB;
1.4B triples describing 114 million objects
It was composed by combining crawls of
multiple RDF search engines

17 / 60

Data Collection
Classes

18 / 60

Data Collection
Properties

19 / 60

Data Collection
Sources

20 / 60

Query Set Preparation
1

Emulate top queries
Given Microsoft Live Search log containing
queries repeated by at least 10 different users
Sample 50 queries prefiltered with a NER and
a gazetteer

2

Emulate long-tailed queries
Given Yahoo! Search Query Log Tiny Sample
v1.0 – 4,500 queries
Sample and manually filter out ambiguous
queries ⇒ 42 queries

3

⇒ a list of 92 queries
21 / 60

Crowdsourcing Judgements
A purpose-built rendering tool to present
the search results
There have been conducted the evaluation
(MT1) and its repetition(MT2) after 6
months
Using Amazon Mechanical Turk HITs
Each HIT consists of 12 query-result pairs:
10 real ones and 2 were from "golden
standard" annotated by experts
64 workers for MT1 and 69 workers for MT2
22 / 60

Analysis of Results
Repeatability

The level of agreement is the same for two
pools
The rank order of the systems is unchanged
24 / 60

Targeting Evaluation Measures I
All the measures are usually computed on top-10 search
results (k=10)
1

P@k (precision at k):
P @k(π, l) =

2

t≤k I{lπ(k) =1}

k

MAP (mean average precision):
AP (π, l) =

m
k=1 P @k

· I{lπ(k) =1}

m1

MAP = mean of AP over all queries
25 / 60

Targeting Evaluation Measures II
3

NDCG: normalized discounted cumulative gain
k

G(lπ(j) ) · η(j),

DCG@k(π, l) =
j=1

where G(·), the rating of a document, is usually
1
G(z) = 2z − 1, η(j) = log(j+1) , lπ(j) ∈ {0, 1, 2}
N DCG@k(π, l) =

1
DCG@k(π, l)
Zk

26 / 60

Analysis of Results
Reliability

Metric Diﬀerence
MAP
1.8%
NDCG 3.5%
P@10
12.8%

In the setting, experts rate more results
negative than workers
P@10 is more fragile than MAP and NDCG
27 / 60

Yahoo! SemSearch Challenge (YSC) 2010 & 2011
http://semsearch.yahoo.com

28 / 60

Outline
1 Introduction
3 Approaches
4 Conclusion

29 / 60

Entity Search Track Submission by
Yahoo! Research Barcelona
Roi Blanco, Peter Mika, Hugo Zaragoza
SSW at WWW 2010

30 / 60

YSC 2010 Winner Approach
RDF S-P-O triples with literals are only considered
Triples are filtered by predicates from a predefined
list of 300 predicates
Triples about the same subject are grouped into a
pseudo document with multiple fields
BM25F ranking formula is applied (the weighting
scheme wc is handcrafted):
BM 25F =
t∈q∩d

tf (t, d)
· idf (t),
k1 + b ∗ tf (t, d)
wc · tfc (t, d)

tf (t, d) =
c∈d

31 / 60

Sindice BM25MF at SemSearch 2011
Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,
Diego Ceccarelli, Giovanni Tummarello
SSW at WWW 2011

32 / 60

YSC 2011 Winner Approach I
URI resolution for triple objects
Extended BM25F approach with additional
normalization for term frequencies per
predicate types:
The weighting scheme is handcrafted
The proportion of query terms in entity
literals
33 / 60

YSC 2011 Winner Approach II
RDF graph example:

34 / 60

YSC 2011 Winner Approach III
Star-shaped query matching the entity:

35 / 60

YSC 2011 Winner Approach IV
Empirical weights:

36 / 60

On the Modeling of Entities
for Ad-Hoc Entity Search in the Web of Data
Robert Neumayer, Kristztian Balog, Kjetil Nørvåg
ECIR 2012

37 / 60

Approach to entity representation I
RDF graph example:

38 / 60

Approach to entity representation II
a) Unstructured Entity Model; b) Structure Entity Model:

39 / 60

Main Findings
Two generative language models (LMs) for
the task:
Unstructured Entity Model
Structured Entity Model

The evaluation on the YSC data shows that
the representation of relations as a mixture
of predicate type LMs can contribute
signiﬁcantly to overall performance
40 / 60

LM Retrieval Framework
P (q|e)P (e) rank
= P (q|e)P (e),
P (q)
where P (e|q) - probability of being relevant given query q
P (e|q) =

Further Assumptions
(i) P (e) is uniform; (ii) query terms are i.i.d
Let θe be the entity model that predicts how likely the
entity would produce a given term t, then
the query likelihood is
P (t|θe )tf (t,q)

P (q|θe ) =
t∈q

41 / 60

Unstructured Entity Model

Idea
Collapse all text values of properties associated
with the entity into a single document and apply
standard IR techniques
The entity model is a Dirichlet-smoothed
multinomial distribution:
P (t|θe) =

tf (t, e) + µP (t|θc)
|e| + µ
42 / 60

Folding Predicates

Group RDF triples by the following predicate types pt :
Name, e.g. literal values of foaf:name, rdfs:label
Attributes, i.e. remaining datatype properties
OutRelations: resolving "object" (O) URIs in S-P-O
triple getting their names
InRelations: resolving "subject" (S) URIs in S-P-O
triple getting their names

43 / 60

Comparative Evaluation
Model
UEM
SEM
UEM
SEM

MAP

P@10
NDCG
YSC 2010
0.207
0.314
0.383
0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%)
YSC 2011
0.207
0.188
0.295
0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%)

The multi-ﬁelded document approach improves
the targeted measures on 26-35%

45 / 60

Combining N-gram Retrieval with Weights
Propagation on Massive RDF Graphs
He Hu, Xiaoyang Du
FSKD 2012

46 / 60

Approach I
Considering 2- to 5-grams while indexing entity
URIs as well as literals
Thinking of URIs as hierarchical names
Computing the entity-query similarity scores:
simU RI (Q) =

engram_hit_count
(||Q| − |U RI.path|| + 1) · (U RI.depth + 1)

simLIT ERAL (Q) =

engram_hit_count
||Q| − |LIT ERAL.length|| + 1

47 / 60

Approach II
Ranking score:
ScoreU RI (Q) = 1 − e−sim(Q)

Taking advantage of iterative PageRank-like weight
propagation:
WU RI_hit (i + 1) = α · WU RI_hit (i)
WU RI_unhit (i + 1) = (1 − α) ·

WU RI_hit_neighbors (i)
NU RI_hit_neighbors

Improvement up to 80% w.r.t. the plain n-gram
ranker
48 / 60

Combining Inverted Indices
and Structured Search
for Ad-hoc Object Retrieval
Alberto Tonton, Gianluca Demartini,
Phillipe Cudré-Mauroux
SIGIR 2012

49 / 60

Structured Inverted Index
Consider the following property values as ﬁelds:
URI: tokens from entity URI, e.g. http:
//dbpedia.org/page/Barack_Obama
⇒ ’barack’, ’obama’ etc.
Labels: values of a list of manually selected
datatype properties
Attributes: other properties
BM25F is used as a ranking function
51 / 60

Graph-based Entity Search
1

2
3

4

Given a query q, obtain a list of entities
Retr = {e1 , e2 , . . . , en } ranked by the BM25F
scores
Use top-N elements as seeds for graph traversal
To get StructRetr = {e1 , . . . , em }, exploit
promising LOD properties‡ as well as Jaro-Winkler
string similarity scores JW (q, e ) > τ
Combine two rankings:
f inalScore(q, e ) = λ × BM 25(q, e) + (1 − λ) × JW (q, e )

‡

owl:sameAs, dbpedia:disambiguates, dbpedia:redirect
52 / 60

Evaluation

The graph-based approach (S1_1) outperforms BM25
scoring with 25% improvement of MAP on the 2010 data set
No signiﬁcant improvement over baseline on the 2011 data
set
This may be explained by the lack of the used predicates
(owl:sameAs volume < 0.7%)
53 / 60

Improving Entity Search over Linked Data
by Modeling Latent Semantics
Nikita Zhiltsov, Eugene Agichtein
CIKM 2013

54 / 60

Key Contributions
A tensor factorization based approach to incorporate
semantic link information into ranking model
Outperforms the state of the art baseline in
NDCG/MAP/P@10
A thorough evaluation of the proposed techniques
by acquiring thousands of manual labels to augment
the YSC benchmark data set
⇒ more details in the next talk
55 / 60

Negative results
The ideas that do not work out

56 / 60

Negative Results
The ideas from standard IR that do not work out:
Wordnet-based query expansion [Tonon et al.,
SIGIR 2012]
Pseudo-relevance feedback [Tonon et al., SIGIR
2012]
Query suggestions of a commercial search engine
[Tonon et al., SIGIR 2012]
Direct application of centrality measures, such as
PageRank and HITS [Campinas et al., SSW WWW
2010; Dali et al., 2012]
57 / 60

Outline
1 Introduction
3 Approaches
4 Conclusion

58 / 60

Wrap up
Entity search over RDF graphs a.k.a. ad-hoc object
retrieval has emerged as a new task in IR
There is a robust and consistent evaluation
methodology for it
State-of-the-art approaches revolve around
applications of well-known IR methods along
Lack of approaches for leveraging semantic links
Lots of data: scalability really matters
59 / 60

Thanks for your attention!

60 / 60

A Survey of Entity Ranking over RDF Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to A Survey of Entity Ranking over RDF Graphs

Similar to A Survey of Entity Ranking over RDF Graphs (20)

Recently uploaded

Recently uploaded (20)

A Survey of Entity Ranking over RDF Graphs