SlideShare a Scribd company logo
1 of 101
Download to read offline
Part II
Entity Retrieval
Tutorial organized by Radialpoint | Montreal, Canada, 2014
Krisztian Balog 

University of Stavanger
Entity retrieval
Addressing information needs that are better
answered by returning speciļ¬c objects
(entities) instead of just any type of documents.
6%
36%
1%5% 12%
41%
Entity (ā€œ1978 cj5 jeepā€)
Type (ā€œdoctors in barcelonaā€)
Attribute (ā€œzip code waterville Maineā€)
Relation (ā€œtom cruise katie holmesā€)
Other (ā€œnightlife in Barcelonaā€)
Uninterpretable
Distribution of web search
queries [Pound et al. 2010]
28%
15%
10% 4%
14%
29% Entity
Entity+reļ¬ner
Category
Category+reļ¬ner
Other
Website
Distribution of web search
queries [Lin et al. 2011]
Whatā€™s so special here?
- Entities are not always directly represented

- Recognize and disambiguate entities in text ā€Ø
(that is, entity linking)
- Collect and aggregate information about a given
entity from multiple documents and even multiple
data collections
- More structure than in document-based IR

- Types (from some taxonomy)
- Attributes (from some ontology)
- Relationships to other entities (ā€œtyped linksā€)
Semantics in our context
- working deļ¬nition:ā€Ø
references to meaningful structures
- How to capture, represent, and use structure?

- It concerns all components of the retrieval process!
Text-only representation Text+structure representation
info
need
entity
matching
Abc Abc
Abc
info
need
entity
matching
Abc
Abc
Abc
Overview of core tasks
Queries Data set Results
(adhoc) entity
retrieval
keyword
unstructured/ā€Ø
semi-structured
ranked list
keyword++ā€Ø
(target type(s))
semi-structured ranked list
list completion
keyword++ā€Ø
(examples)
semi-structured ranked list
related entity
ļ¬nding
keyword++ā€Ø
(target type, relation)
unstructured ā€Ø
& semi-structured
ranked list
In this part
- Input: keyword(++) query
- Output: a ranked list of entities

- Data collection: unstructured and
(semi)structured data sources (and their
combinations)

- Main RQ: How to incorporate structure into
text-based retrieval models?
Outline
1.Ranking based on
entity descriptions

2.Incorporating entity
types

3.Entity relationships
Attributes ā€Ø
(/Descriptions)
Type(s)
Relationships
Probabilistic models (mostly)
- Estimating conditional probabilities

!
!
- Conditional independence

!
!
- Conditional dependence
P(A|B)
P(A, B|C) = P(A|C) Ā· P(B|C)
P(A|B) = P(A)
P(A, B|C)
P(A, B|C) = P(A|B, C)P(B|C)
P(A|B) = P(B|A)P(A)/P(B)
Ranking entity
descriptions
Attributes ā€Ø
(/Descriptions)
Type(s)
Relationships
Task: ad-hoc entity retrieval
- Input: unconstrained natural language query

- ā€œtelegraphicā€ queries (neither well-formed nor
grammatically correct sentences or questions)
- Output: ranked list of entities

- Collection: unstructured and/or semi-
structured documents
Example information needs
meg ryan war
american embassy nairobi
ben franklin
Chernobyl
Worst actor century
Sweden Iceland currency
Two settings
1.With ready-made entity descriptionsā€Ø
ā€Ø
ā€Ø
ā€Ø
2.Without explicit entity representations
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Ranking with ready-made
entity descriptions
This is not unrealistic...
Document-based entity
representations
- Most entities have a ā€œhome pageā€

- I.e., each entity is described by a document

- In this scenario, ranking entities is much like
ranking documents

- unstructured
- semi-structured
Crash course into
Language modeling
Example
In the town where I was born,
Lived a man who sailed to sea,
And he told us of his life,ā€Ø
In the land of submarines,
So we sailed on to the sun,ā€Ø
Till we found the sea green,ā€Ø
And we lived beneath the
waves, In our yellow
submarine,
We all live in yellow
submarine, yellow submarine,
yellow submarine, We all live
in yellow submarine, yellow
submarine, yellow submarine.
Empirical document LM
0,00
0,03
0,06
0,08
0,11
0,14
submarine
yellow
we
all
live
lived
sailed
sea
beneath
born
found
green
he
his
i
land
life
man
our
so
submarines
sun
till
told
town
us
waves
where
who
P(t|d) =
n(t, d)
|d|
Alternatively...
Scoring a query
q = {sea, submarine}
P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d)
Language Modeling
Estimate a multinomial 

probability distribution 

from the text
Smooth the distribution

with one estimated from 

the entire collection
P(t|āœ“d) = (1 )P(t|d) + P(t|C)
Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)
P(d|q) =
P(q|d)P(d)
P(q)
/ P(q|d)P(d)
Document priorā€Ø
Probability of the document ā€Ø
being relevant to any query
Query likelihoodā€Ø
Probability that query q ā€Ø
was ā€œproducedā€ by document d
P(q|d) =
Y
t2q
P(t|āœ“d)n(t,q)
Standard Language Modeling
approach (2)
Number of times t appears in q
Empirical ā€Ø
document modelā€Ø
Collection
model ā€Ø
Smoothing parameterā€Ø
Maximumā€Ø
likelihood ā€Ø
estimates
P(q|d) =
Y
t2q
P(t|āœ“d)n(t,q)
Document language modelā€Ø
Multinomial probability distribution
over the vocabulary of terms
P(t|āœ“d) = (1 )P(t|d) + P(t|C)
n(t, d)
|d|
P
d n(t, d)
P
d |d|
Scoring a query
q = {sea, submarine}
P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d)
0.04 0.00020.10.9
0.03602
t P(t|d)
submarine 0,14
sea 0,04
...
t P(t|C)
submarine 0,0001
sea 0,0002
...
(1 )P(ā€œseaā€|d) + P(ā€œseaā€|C)
Scoring a query
q = {sea, submarine}
P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d)
0.14 0.00010.10.9
0.03602
t P(t|d)
submarine 0,14
sea 0,04
...
t P(t|C)
submarine 0,0001
sea 0,0002
...
(1 )P(ā€œsubmarineā€|d) + P(ā€œsubmarineā€|C)
0.126010.04538
Here, documents==entities, so
P(e|q) / P(e)P(q|āœ“e) = P(e)
Y
t2q
P(t|āœ“e)n(t,q)
Entity priorā€Ø
Probability of the entity ā€Ø
being relevant to any query
Entity language modelā€Ø
Multinomial probability distribution
over the vocabulary of terms
Semi-structured entity
representation
- Entity description documents are rarely
unstructured

- Representing entities as 

- Fielded documents ā€“ the IR approach
- Graphs ā€“ the DB/SW approach
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS
dbpedia:Audi_A4
Mixture of Language Models
[Ogilvie & Callan 2003]
- Build a separate language model for each ļ¬eld

- Take a linear combination of them
mX
j=1
Āµj = 1
Field language modelā€Ø
Smoothed with a collection model builtā€Ø
from all document representations of theā€Ø
same type in the collectionField weights
P(t|āœ“d) =
mX
j=1
ĀµjP(t|āœ“dj )
Setting ļ¬eld weights
- Heuristically 

- Proportional to the length of text content in that ļ¬eld,
to the ļ¬eldā€™s individual performance, etc.
- Empirically (using training queries)

- Problems

- Number of possible ļ¬elds is huge
- It is not possible to optimise their weights directly
- Entities are sparse w.r.t. diļ¬€erent ļ¬elds

- Most entities have only a handful of predicates
Predicate folding
- Idea: reduce the number of ļ¬elds by grouping
them together

- Grouping based on (BM25F and)

- type [PĆ©rez-AgĆ¼era et al. 2010]
- manually determined importance [Blanco et al. 2011]
Hierarchical Entity Model
[Neumayer et al. 2012]
- Organize ļ¬elds into a 2-level hierarchy

- Field types (4) on the top level
- Individual ļ¬elds of that type on the bottom level
- Estimate ļ¬eld weights

- Using training data for ļ¬eld types
- Using heuristics for bottom-level types
Two-level hierarchy
[Neumayer et al. 2012]
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS
!
Nameā€Ø
Attributesā€Ø
Out-relationsā€Ø
In-relationsā€Ø
Comparison of models
d
dfF
...
t
dfF t
... ...d
tdf
...
tdf
...d
t
...
t
Unstructuredā€Ø
document model
Fieldedā€Ø
document model
Hierarchicalā€Ø
document model
Probabilistic Retrieval Model
for Semistructured data
[Kim et al. 2009]
- Extension to the Mixture of Language Models

- Find which document ļ¬eld each query term
may be associated with
Mapping probabilityā€Ø
Estimated for each query term
P(t|āœ“d) =
mX
j=1
ĀµjP(t|āœ“dj )
P(t|āœ“d) =
mX
j=1
P(dj|t)P(t|āœ“dj )
Estimating the mapping
probability
Term likelihood

Probability of a query term
occurring in a given ļ¬eld typeā€Ø
Prior ļ¬eld probabilityā€Ø
Probability of mapping the query term ā€Ø
to this ļ¬eld before observing collection
statistics
P(dj|t) =
P(t|dj)P(dj)
P(t)
X
dk
P(t|dk)P(dk)
P(t|Cj) =
P
d n(t, dj)
P
d |dj|
Example
cast 0,407
team 0,382
title 0,187
genre 0,927
title 0,07
location 0,002
cast 0,601
team 0,381
title 0,017
dj dj djP(t|dj) P(t|dj) P(t|dj)
meg ryan war
Evaluation initiatives
- INEX Entity Ranking track (2007-09)

- Collection is the (English) Wikipedia
- Entities are represented by Wikipedia articles
- Semantic Search Challenge (2010-11)

- Collection is a Semantic Web crawl (BTC2009)
- ~1 billion RDF triples
- Entities are represented by URIs
- INEX Linked Data track (2012-13)

- Wikipedia enriched with RDF properties from
DBpedia and YAGO
Ranking without explicit
entity representations
Scenario
- Entity descriptions are not readily available

- Entity occurrences are annotated

- manually
- automatically (~entity linking)
The basic idea
Use documents to go from queries to entities
Query-document
association
the documentā€™s relevance
Document-entity
association
how well the document
characterises the entity
eq
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Two principal approaches
- Proļ¬le-based methods

- Create a textual proļ¬le for entities, then rank them
(by adapting document retrieval techniques)
- Document-based methods

- Indirect representation based on mentions identiļ¬ed
in documents
- First ranking documents (or snippets) and then
aggregating evidence for associated entities
Proļ¬le-based methods
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
e
Document-based methods
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
X
e
X
X
e
e
Many possibilities in terms of
modeling
- Generative (probabilistic) models

- Discriminative (probabilistic) models

- Voting models

- Graph-based models
Generative probabilistic
models
- Candidate generation models (P(e|q))

- Two-stage language model
- Topic generation models (P(q|e))

- Candidate model, a.k.a. Model 1
- Document model, a.k.a. Model 2
- Proximity-based variations
- Both families of models can be derived from the
Probability Ranking Principle [Fang & Zhai 2007]
Candidate models (ā€œModel 1ā€)
[Balog et al. 2006]
P(q|āœ“e) =
Y
t2q
P(t|āœ“e)n(t,q)
Smoothingā€Ø
With collection-wide background model
(1 )P(t|e) + P(t)
X
d
P(t|d, e)P(d|e)
Document-entity
association
Term-candidate ā€Ø
co-occurrence
In a particular document. 

In the simplest case:P(t|d)
Document models (ā€œModel 2ā€)
[Balog et al. 2006]
P(q|e) =
X
d
P(q|d, e)P(d|e)
Document-entity
association
Document relevance
How well document d
supports the claim that e
is relevant to q
Y
t2q
P(t|d, e)n(t,q)
Simplifying assumption ā€Ø
(t and e are conditionally
independent given d)
P(t|āœ“d)
Document-entity associations
- Boolean (or set-based) approach

- Weighted by the conļ¬dence in entity linking

- Consider other entities mentioned in the
document
eq
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Proximity-based variations
- So far, conditional independence assumption
between candidates and terms when
computing the probability P(t|d,e)

- Relationship between terms and entities that in
the same document is ignored

- Entity is equally strongly associated with everything
discussed in that document
- Letā€™s capture the dependence between entities
and terms

- Use their distance in the document
Using proximity kernels
[Petkova & Croft 2007]
P(t|d, e) =
1
Z
NX
i=1
d(i, t)k(t, e)
Indicator function
1 if the term at position i is t,
0 otherwise
Normalizing
contant
Proximity-based kernel
- constant function

- triangle kernel

- Gaussian kernel

- step function
Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity
retrieval. CIKM'07.
Many possibilities in terms of
modeling
- Generative probabilistic models

- Discriminative probabilistic models

- Voting models

- Graph-based models
Discriminative models
- Vs. generative models:

- Fewer assumptions (e.g., term independence)
- ā€œLet the data speakā€
- Sufļ¬cient amounts of training data required
- Incorporating more document features, multiple
signals for document-entity associations
- Estimating P(r=1|e,q) directly (instead of P(e,q|r=1))
- Optimization can get trapped in a local maximum/
minimum
Arithmetic Mean
Discriminative (AMD) model
[Yang et al. 2010]
Pāœ“(r = 1|e, q) =
X
d
P(r1 = 1|q, d)P(r2 = 1|e, d)P(d)
Document
prior
Query-document
relevance
Document-entity
relevance
logistic function ā€Ø
over a linear
combination of features
ā‡£ Ng
X
j=1
jgj(e, dt)
āŒ˜ā‡£ Nf
X
i=1
ā†µifi(q, dt)
āŒ˜
standard logistic
function
weight ā€Ø
parametersā€Ø
(learned)
features
Learning to rank && entity
retrieval
- Pointwise

- AMD, GMD [Yang et al. 2010]
- Multilayer perceptrons, logistic regression [Sorg &
Cimiano 2011]
- Additive Groves [Moreira et al. 2011]
- Pairwise

- Ranking SVM [Yang et al. 2009]
- RankBoost, RankNet [Moreira et al. 2011]
- Listwise

- AdaRank, Coordinate Ascent [Moreira et al. 2011]
Voting models
[Macdonald & Ounis 2006]
- Inspired by techniques from data fusion

- Combining evidence from different sources
- Documents ranked w.r.t. the query are seen as
ā€œvotesā€ for the entity
Voting models
Many different variants, including...
- Votes

- Number of documents mentioning the entity
!
!
- Reciprocal Rank

- Sum of inverse ranks of documents
!
!
- CombSUM

- Sum of scores of documents
Score(e, q) = |{M(e)  R(q)}|
X
{M(e)R(q)}
s(d, q)
Score(e, q) =
X
{M(e)R(q)}
1
rank(d, q)
Score(e, q) = |M(e)  R(q)|
Graph-based models
[Serdyukov et al. 2008]
- One particular way of constructing graphs

- Vertices are documents and entities
- Only document-entity edges
- Search can be approached as a random walk
on this graph

- Pick a random document or entity
- Follow links to entities or other documents
- Repeat it a number of times
Inļ¬nite random walk
[Serdyukov et al. 2008]
Pi(d) = PJ (d) + (1 )
X
e!d
P(d|e)Pi 1(e),
Pi(e) =
X
d!e
P(e|d)Pi 1(d),
PJ (d) = P(d|q),
ee e
d d
e
d d
Evaluation
- Expert ļ¬nding task @TREC Enterprise track

- Enterprise setting (intranet of a large organization)
- Given a query, return people who are experts on the
query topic
- List of potential experts is provided
- We assume that the collection has been
annotated with <person>...</person> tokens
Incorporating
entity types
Attributes ā€Ø
(/Descriptions)
Type(s)
Relationships
people
products
organizations
locations
Interacting with typesā€Ø
grouping results
people
companies
jobs
(more) people
Interacting with typesā€Ø
ļ¬ltering results
Interacting with typesā€Ø
ļ¬ltering results
Type-aware ranking
- Typically, a two-component model:
P(q|e) = P(qT , qt|e) = P(qT |e)P(qt|e)
Type-basedā€Ø
similarity
Term-basedā€Ø
similarity
#1 Where to get the target type from?
#2 How to estimate type-based similarity?
Target type
- Provided by the user

- keyword++ query
- Need to be automatically identiļ¬ed

- keyword query
Target type(s) are providedā€Ø
faceted search, form ļ¬ll-in, etc.
But what about very many types?ā€Ø
which are typically hierarchically organized
Challenges
- Users are not familiar with the type system
In general, categorizing things
can be hard
- What is King Arthur?

- Person / Royalty / British royalty
- Person / Military person
- Person / Fictional character
Which King Arthur?!
Upshot for type-aware ranking
- Need to be able to handle the imperfections of
the type system

- Inconsistencies
- Missing assignments
- Granularity issues
- Entities labeled with too general or too speciļ¬c types
- User input is to be treated as a hint, not as a
strict ļ¬lter
Two settings
- Target type(s) are provided by the user

- keyword++ query
- Target types need to be automatically identiļ¬ed

- keyword query
Identifying target types for
queries
- Types can be ranked much like entitiesā€Ø
[Balog & Neumayer 2012]

- Direct term-based representations (ā€œModel 1ā€)
- Types of top ranked entities (ā€œModel 2ā€)ā€Ø
[Vallet & Zaragoza 2008]
Type-centric vs. entity-centric
type ranking
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
t
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
t
t
Type-centric
q
X
t
X
X
t
t
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Entity-centric
Hierarchical target type
identiļ¬cation
- Finding the single most speciļ¬c type [from an
ontology] that is general enough to cover all
entities that are relevant to the query.
- Finding the right granularity is diļ¬ƒcultā€¦

- Models are good at ļ¬nding either general (top-level)
or speciļ¬c (leaf-level) types
Type-based similarity
- Measuring similarity

- Set-based
- Content-based (based on type labels)
- Need ā€œsoftā€ matching to deal with the
imperfections of the category system

- Lexical similarity of type labels
- Distance based on the hierarchy
- Query expansion
P(q|e) = P(qT , qt|e) = P(qT |e)P(qt|e)
Modeling types as probability
distributions [Balog et al. 2011]
- Analogously to term-based representations

!
!
!
!
- Advantages

- Sound modeling of uncertainty associated with
category information
- Category-based feedback is possible
p(c|āœ“C
q ) p(c|āœ“C
e )
Query Entity
KL(āœ“C
q ||āœ“C
e )
Joint type detection and entity
ranking [Sawant & Chakrabarti 2013]
- Assumes ā€œtelegraphicā€ queries with target type

- woodrow wilson president university
- dolly clone institute
- lead singer led zeppelin band
- Type detection is integrated into the ranking

- Multiple query interpretations are considered
- Both generative and discriminative
formulations
Approach
- Each query term is either a ā€œtype hintā€ ( )
or a ā€œword matcherā€ ( )

- Number of possible partitions is manageable ( )2|q|
h(~q, ~z)
s(~q, ~z)
By comparison, the Padres have been to two
World Series, losing in 1984 and 1998.
losing baseball team world series 1998
Major league baseball teams
mentionOf
instanceOf
San Diego PadresEntity
Evidence ā€Ø
snippets
Type
Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ā€™13. (see presentation)
San Diego Padres!
Major league !
baseball team!
type context
E
T Padres have been to two World
Series, losing in 1984 and 1998!
Īø
Type hint :
baseball , team
losing team baseball world series 1998
Z
Ļ•
Context matchers : !
lost , 1998, world seriesswitch!
model! model!
q losing team baseball world series 1998
Generative approach
Generate query from entity
Generative formulation
P(e|q) / P(e)
X
t,~z
P(t|e)P(~z)P(h(~q, ~z)|t)P(s(~q, ~z)|e)
Type priorā€Ø
Estimated from ā€Ø
answer types ā€Ø
in the past
Type modelā€Ø
Probability of observing
t in the type model
Entity modelā€Ø
Probability of observing
t in the entity model
Query switchā€Ø
Probability of the
interpretation
Entity prior
Discriminative approach
Separate correct and incorrect entities
Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ā€™13. (see presentation)
San_Diego_Padres!
losing team baseball
world series 1998
(baseball team)!
losing team baseball
world series 1998
(baseball team)!
losing team baseball
world series 1998
(t = baseball team)!
1998_World_Series!
losing team baseball
world series 1998
(series)!
losing team baseball
world series 1998
(series)!
losing team baseball
world series 1998
(t = series)!
: losing team baseball world series 1998!q!
Discriminative formulation
(q, e, t, ~z) = h 1(q, e), 2(t, e), 3(q, ~z, t), 4(q, ~z, e)i
Comparability
between hint words
and type
Comparability
between matchers
and snippets that
mention e
Models the type
prior P(t|e)
Models the entity
prior P(e)
Evaluation
- INEX Entity Ranking track

- Entities are represented by Wikipedia articles
- Topic deļ¬nition includes target categories
Movies with eight or more Academy Awardsā€Ø
best picture oscar british ļ¬lms american ļ¬lms
Entity relationships
Attributes ā€Ø
(/Descriptions)
Type(s)
Relationships
Related entities
Searching for arbitrary relations*
airlines that currently use Boeing 747 planesā€Ø
ORG Boeing 747
Members of The Beaux Arts Trioā€Ø
PER The Beaux Arts Trio
What countries does Eurail operate in?ā€Ø
LOC Eurail
*given an input entity and target type
Modeling related entity ļ¬nding
[Bron et al. 2010]
- Ranking entities of a given type (T) that stand
in a required relation (R) with an input entity (E)

- Three-component model
p(e|E, T, R) / p(e|E) Ā· p(T|e) Ā· p(R|E, e)
Context model
Type ļ¬ltering
Co-occurrence
model
xxxx x xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx
x xxxxx xxx
xxxx x xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx
x xxxxx xxx
xxxx x xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx
x xxxxx xxx
Evaluation
- TREC Entity track

- Given

- Input entity (deļ¬ned by name and homepage)
- Type of the target entity (PER/ORG/LOC)
- Narrative (describing the nature of the relation in free
text)
- Return (homepages of) related entities
Wrapping up
- Entity retrieval in diļ¬€erent ļ¬‚avors using
generative approaches based on language
modeling techniques

- Increasingly more discriminative approaches
over generative ones

- Increasing amount of components (and parameters)
- Easier to incrementally add informative but
correlated features
- But, (massive amounts of ) training data is required!
Future challenges
- Itā€™s ā€œeasyā€ when the ā€œquery intentā€ is known

- Desired results: single entity, ranked list, set, ā€¦
- Query type: ad-hoc, list search, related entity ļ¬nding, ā€¦
- Methods speciļ¬cally tailored to speciļ¬c types ā€Ø
of requests

- Understanding query intent still has a long ā€Ø
way to go
Entity Retrieval (tutorial organized by Radialpoint in Montreal)

More Related Content

What's hot

Entity Linking in Queries: Efficiency vs. Effectiveness
Entity Linking in Queries: Efficiency vs. EffectivenessEntity Linking in Queries: Efficiency vs. Effectiveness
Entity Linking in Queries: Efficiency vs. EffectivenessFaegheh Hasibi
Ā 
Everything you wanted to know about Dublin Core metadata
Everything you wanted to know about Dublin Core metadataEverything you wanted to know about Dublin Core metadata
Everything you wanted to know about Dublin Core metadataEduserv Foundation
Ā 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDFNarni Rajesh
Ā 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationVladimir Alexiev, PhD, PMP
Ā 
Debunking some ā€œRDF vs. Property Graphā€ Alternative Facts
Debunking some ā€œRDF vs. Property Graphā€ Alternative FactsDebunking some ā€œRDF vs. Property Graphā€ Alternative Facts
Debunking some ā€œRDF vs. Property Graphā€ Alternative FactsNeo4j
Ā 
Representing financial reports on the semantic web a faithful translation f...
Representing financial reports on the semantic web   a faithful translation f...Representing financial reports on the semantic web   a faithful translation f...
Representing financial reports on the semantic web a faithful translation f...Jie Bao
Ā 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluationkrisztianbalog
Ā 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)Dan Brickley
Ā 
Dublin Core Basic Syntax Tutorial
Dublin Core Basic Syntax TutorialDublin Core Basic Syntax Tutorial
Dublin Core Basic Syntax TutorialEduserv Foundation
Ā 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Cognitum
Ā 
Ontology Engineering: Introduction
Ontology Engineering: IntroductionOntology Engineering: Introduction
Ontology Engineering: IntroductionGuus Schreiber
Ā 
Xml Overview
Xml OverviewXml Overview
Xml Overviewkevinreiss
Ā 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentationRama Bastola
Ā 

What's hot (18)

Entity Linking in Queries: Efficiency vs. Effectiveness
Entity Linking in Queries: Efficiency vs. EffectivenessEntity Linking in Queries: Efficiency vs. Effectiveness
Entity Linking in Queries: Efficiency vs. Effectiveness
Ā 
Everything you wanted to know about Dublin Core metadata
Everything you wanted to know about Dublin Core metadataEverything you wanted to know about Dublin Core metadata
Everything you wanted to know about Dublin Core metadata
Ā 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
Ā 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Ā 
Debunking some ā€œRDF vs. Property Graphā€ Alternative Facts
Debunking some ā€œRDF vs. Property Graphā€ Alternative FactsDebunking some ā€œRDF vs. Property Graphā€ Alternative Facts
Debunking some ā€œRDF vs. Property Graphā€ Alternative Facts
Ā 
RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
Ā 
Representing financial reports on the semantic web a faithful translation f...
Representing financial reports on the semantic web   a faithful translation f...Representing financial reports on the semantic web   a faithful translation f...
Representing financial reports on the semantic web a faithful translation f...
Ā 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
Ā 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
Ā 
5 rdfs
5 rdfs5 rdfs
5 rdfs
Ā 
RDF briefing
RDF briefingRDF briefing
RDF briefing
Ā 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
Ā 
Dublin Core Basic Syntax Tutorial
Dublin Core Basic Syntax TutorialDublin Core Basic Syntax Tutorial
Dublin Core Basic Syntax Tutorial
Ā 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014
Ā 
Ontology Engineering: Introduction
Ontology Engineering: IntroductionOntology Engineering: Introduction
Ontology Engineering: Introduction
Ā 
Xml Overview
Xml OverviewXml Overview
Xml Overview
Ā 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
Ā 
Ontology engineering
Ontology engineering Ontology engineering
Ontology engineering
Ā 

Similar to Entity Retrieval (tutorial organized by Radialpoint in Montreal)

1 Project 1 Introduction - the SeaPort Project seri.docx
1  Project 1 Introduction - the SeaPort Project seri.docx1  Project 1 Introduction - the SeaPort Project seri.docx
1 Project 1 Introduction - the SeaPort Project seri.docxoswald1horne84988
Ā 
Dynamic Factual Summaries for Entity Cards
Dynamic Factual Summaries for Entity CardsDynamic Factual Summaries for Entity Cards
Dynamic Factual Summaries for Entity CardsFaegheh Hasibi
Ā 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligencekrisztianbalog
Ā 
Tailoring the conceptual reference model to archaeological requirements
Tailoring the conceptual reference model to archaeological requirementsTailoring the conceptual reference model to archaeological requirements
Tailoring the conceptual reference model to archaeological requirementsariadnenetwork
Ā 
Introduction to neo4j - a hands-on crash course
Introduction to neo4j - a hands-on crash courseIntroduction to neo4j - a hands-on crash course
Introduction to neo4j - a hands-on crash courseNeo4j
Ā 
From text to entities: Information Extraction in the Era of Knowledge Graphs
From text to entities: Information Extraction in the Era of Knowledge GraphsFrom text to entities: Information Extraction in the Era of Knowledge Graphs
From text to entities: Information Extraction in the Era of Knowledge GraphsGraphRM
Ā 
Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19ngamou
Ā 
On Growth and Software
On Growth and SoftwareOn Growth and Software
On Growth and SoftwareCarlo Pescio
Ā 
Functional Scala 2020
Functional Scala 2020Functional Scala 2020
Functional Scala 2020Alexander Ioffe
Ā 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010Giorgio Orsi
Ā 
Wikidata for libraries and archives
Wikidata for libraries and archivesWikidata for libraries and archives
Wikidata for libraries and archives_Emw
Ā 
Kotlin for Android Developers
Kotlin for Android DevelopersKotlin for Android Developers
Kotlin for Android DevelopersHassan Abid
Ā 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
Ā 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
Ā 
Slides
SlidesSlides
Slidesbutest
Ā 
Virtual enterprise synthesys
 Virtual enterprise synthesys Virtual enterprise synthesys
Virtual enterprise synthesysVictor Romanov
Ā 
Differential Privacy Analysis of Data Processing Workflows
Differential Privacy Analysis of Data Processing WorkflowsDifferential Privacy Analysis of Data Processing Workflows
Differential Privacy Analysis of Data Processing WorkflowsMarlon Dumas
Ā 

Similar to Entity Retrieval (tutorial organized by Radialpoint in Montreal) (17)

1 Project 1 Introduction - the SeaPort Project seri.docx
1  Project 1 Introduction - the SeaPort Project seri.docx1  Project 1 Introduction - the SeaPort Project seri.docx
1 Project 1 Introduction - the SeaPort Project seri.docx
Ā 
Dynamic Factual Summaries for Entity Cards
Dynamic Factual Summaries for Entity CardsDynamic Factual Summaries for Entity Cards
Dynamic Factual Summaries for Entity Cards
Ā 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
Ā 
Tailoring the conceptual reference model to archaeological requirements
Tailoring the conceptual reference model to archaeological requirementsTailoring the conceptual reference model to archaeological requirements
Tailoring the conceptual reference model to archaeological requirements
Ā 
Introduction to neo4j - a hands-on crash course
Introduction to neo4j - a hands-on crash courseIntroduction to neo4j - a hands-on crash course
Introduction to neo4j - a hands-on crash course
Ā 
From text to entities: Information Extraction in the Era of Knowledge Graphs
From text to entities: Information Extraction in the Era of Knowledge GraphsFrom text to entities: Information Extraction in the Era of Knowledge Graphs
From text to entities: Information Extraction in the Era of Knowledge Graphs
Ā 
Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19
Ā 
On Growth and Software
On Growth and SoftwareOn Growth and Software
On Growth and Software
Ā 
Functional Scala 2020
Functional Scala 2020Functional Scala 2020
Functional Scala 2020
Ā 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010
Ā 
Wikidata for libraries and archives
Wikidata for libraries and archivesWikidata for libraries and archives
Wikidata for libraries and archives
Ā 
Kotlin for Android Developers
Kotlin for Android DevelopersKotlin for Android Developers
Kotlin for Android Developers
Ā 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
Ā 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
Ā 
Slides
SlidesSlides
Slides
Ā 
Virtual enterprise synthesys
 Virtual enterprise synthesys Virtual enterprise synthesys
Virtual enterprise synthesys
Ā 
Differential Privacy Analysis of Data Processing Workflows
Differential Privacy Analysis of Data Processing WorkflowsDifferential Privacy Analysis of Data Processing Workflows
Differential Privacy Analysis of Data Processing Workflows
Ā 

More from krisztianbalog

Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...krisztianbalog
Ā 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...krisztianbalog
Ā 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
Ā 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphskrisztianbalog
Ā 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Editionkrisztianbalog
Ā 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Labkrisztianbalog
Ā 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systemskrisztianbalog
Ā 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendationkrisztianbalog
Ā 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seachkrisztianbalog
Ā 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Searchkrisztianbalog
Ā 

More from krisztianbalog (10)

Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Ā 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Ā 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
Ā 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphs
Ā 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
Ā 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Ā 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Ā 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
Ā 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
Ā 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Search
Ā 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Ā 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
Ā 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
Ā 
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Patryk Bandurski
Ā 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
Ā 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
Ā 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Ā 
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Wonjun Hwang
Ā 
Nellā€™iperspazio con Rocket: il Framework Web di Rust!
Nellā€™iperspazio con Rocket: il Framework Web di Rust!Nellā€™iperspazio con Rocket: il Framework Web di Rust!
Nellā€™iperspazio con Rocket: il Framework Web di Rust!Commit University
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Ā 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Ā 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Ā 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Ā 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Ā 
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort ServiceHot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Ā 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Ā 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Ā 
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Ā 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Ā 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
Ā 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Ā 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Ā 
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Ā 
Nellā€™iperspazio con Rocket: il Framework Web di Rust!
Nellā€™iperspazio con Rocket: il Framework Web di Rust!Nellā€™iperspazio con Rocket: il Framework Web di Rust!
Nellā€™iperspazio con Rocket: il Framework Web di Rust!
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Ā 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Ā 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Ā 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Ā 

Entity Retrieval (tutorial organized by Radialpoint in Montreal)

  • 1. Part II Entity Retrieval Tutorial organized by Radialpoint | Montreal, Canada, 2014 Krisztian Balog University of Stavanger
  • 2. Entity retrieval Addressing information needs that are better answered by returning speciļ¬c objects (entities) instead of just any type of documents.
  • 3. 6% 36% 1%5% 12% 41% Entity (ā€œ1978 cj5 jeepā€) Type (ā€œdoctors in barcelonaā€) Attribute (ā€œzip code waterville Maineā€) Relation (ā€œtom cruise katie holmesā€) Other (ā€œnightlife in Barcelonaā€) Uninterpretable Distribution of web search queries [Pound et al. 2010]
  • 5. Whatā€™s so special here? - Entities are not always directly represented - Recognize and disambiguate entities in text ā€Ø (that is, entity linking) - Collect and aggregate information about a given entity from multiple documents and even multiple data collections - More structure than in document-based IR - Types (from some taxonomy) - Attributes (from some ontology) - Relationships to other entities (ā€œtyped linksā€)
  • 6. Semantics in our context - working deļ¬nition:ā€Ø references to meaningful structures - How to capture, represent, and use structure? - It concerns all components of the retrieval process! Text-only representation Text+structure representation info need entity matching Abc Abc Abc info need entity matching Abc Abc Abc
  • 7. Overview of core tasks Queries Data set Results (adhoc) entity retrieval keyword unstructured/ā€Ø semi-structured ranked list keyword++ā€Ø (target type(s)) semi-structured ranked list list completion keyword++ā€Ø (examples) semi-structured ranked list related entity ļ¬nding keyword++ā€Ø (target type, relation) unstructured ā€Ø & semi-structured ranked list
  • 8. In this part - Input: keyword(++) query - Output: a ranked list of entities - Data collection: unstructured and (semi)structured data sources (and their combinations) - Main RQ: How to incorporate structure into text-based retrieval models?
  • 9. Outline 1.Ranking based on entity descriptions 2.Incorporating entity types 3.Entity relationships Attributes ā€Ø (/Descriptions) Type(s) Relationships
  • 10. Probabilistic models (mostly) - Estimating conditional probabilities ! ! - Conditional independence ! ! - Conditional dependence P(A|B) P(A, B|C) = P(A|C) Ā· P(B|C) P(A|B) = P(A) P(A, B|C) P(A, B|C) = P(A|B, C)P(B|C) P(A|B) = P(B|A)P(A)/P(B)
  • 12. Task: ad-hoc entity retrieval - Input: unconstrained natural language query - ā€œtelegraphicā€ queries (neither well-formed nor grammatically correct sentences or questions) - Output: ranked list of entities - Collection: unstructured and/or semi- structured documents
  • 13. Example information needs meg ryan war american embassy nairobi ben franklin Chernobyl Worst actor century Sweden Iceland currency
  • 14. Two settings 1.With ready-made entity descriptionsā€Ø ā€Ø ā€Ø ā€Ø 2.Without explicit entity representations xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 16. This is not unrealistic...
  • 17. Document-based entity representations - Most entities have a ā€œhome pageā€ - I.e., each entity is described by a document - In this scenario, ranking entities is much like ranking documents - unstructured - semi-structured
  • 19. Example In the town where I was born, Lived a man who sailed to sea, And he told us of his life,ā€Ø In the land of submarines, So we sailed on to the sun,ā€Ø Till we found the sea green,ā€Ø And we lived beneath the waves, In our yellow submarine, We all live in yellow submarine, yellow submarine, yellow submarine, We all live in yellow submarine, yellow submarine, yellow submarine.
  • 22. Scoring a query q = {sea, submarine} P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d)
  • 23. Language Modeling Estimate a multinomial probability distribution from the text Smooth the distribution with one estimated from the entire collection P(t|āœ“d) = (1 )P(t|d) + P(t|C)
  • 24. Standard Language Modeling approach - Rank documents d according to their likelihood of being relevant given a query q: P(d|q) P(d|q) = P(q|d)P(d) P(q) / P(q|d)P(d) Document priorā€Ø Probability of the document ā€Ø being relevant to any query Query likelihoodā€Ø Probability that query q ā€Ø was ā€œproducedā€ by document d P(q|d) = Y t2q P(t|āœ“d)n(t,q)
  • 25. Standard Language Modeling approach (2) Number of times t appears in q Empirical ā€Ø document modelā€Ø Collection model ā€Ø Smoothing parameterā€Ø Maximumā€Ø likelihood ā€Ø estimates P(q|d) = Y t2q P(t|āœ“d)n(t,q) Document language modelā€Ø Multinomial probability distribution over the vocabulary of terms P(t|āœ“d) = (1 )P(t|d) + P(t|C) n(t, d) |d| P d n(t, d) P d |d|
  • 26. Scoring a query q = {sea, submarine} P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d) 0.04 0.00020.10.9 0.03602 t P(t|d) submarine 0,14 sea 0,04 ... t P(t|C) submarine 0,0001 sea 0,0002 ... (1 )P(ā€œseaā€|d) + P(ā€œseaā€|C)
  • 27. Scoring a query q = {sea, submarine} P(q|d) = P(ā€œseaā€|āœ“d) Ā· P(ā€œsubmarineā€|āœ“d) 0.14 0.00010.10.9 0.03602 t P(t|d) submarine 0,14 sea 0,04 ... t P(t|C) submarine 0,0001 sea 0,0002 ... (1 )P(ā€œsubmarineā€|d) + P(ā€œsubmarineā€|C) 0.126010.04538
  • 28. Here, documents==entities, so P(e|q) / P(e)P(q|āœ“e) = P(e) Y t2q P(t|āœ“e)n(t,q) Entity priorā€Ø Probability of the entity ā€Ø being relevant to any query Entity language modelā€Ø Multinomial probability distribution over the vocabulary of terms
  • 29. Semi-structured entity representation - Entity description documents are rarely unstructured - Representing entities as - Fielded documents ā€“ the IR approach - Graphs ā€“ the DB/SW approach
  • 30.
  • 31. foaf:name Audi A4 rdfs:label Audi A4 rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] dbpprop:production 1994 2001 2005 2008 rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car owl:sameAs freebase:Audi A4 is dbpedia-owl:predecessor of dbpedia:Audi_A5 is dbpprop:similar of dbpedia:Cadillac_BLS dbpedia:Audi_A4
  • 32. Mixture of Language Models [Ogilvie & Callan 2003] - Build a separate language model for each ļ¬eld - Take a linear combination of them mX j=1 Āµj = 1 Field language modelā€Ø Smoothed with a collection model builtā€Ø from all document representations of theā€Ø same type in the collectionField weights P(t|āœ“d) = mX j=1 ĀµjP(t|āœ“dj )
  • 33. Setting ļ¬eld weights - Heuristically - Proportional to the length of text content in that ļ¬eld, to the ļ¬eldā€™s individual performance, etc. - Empirically (using training queries) - Problems - Number of possible ļ¬elds is huge - It is not possible to optimise their weights directly - Entities are sparse w.r.t. diļ¬€erent ļ¬elds - Most entities have only a handful of predicates
  • 34. Predicate folding - Idea: reduce the number of ļ¬elds by grouping them together - Grouping based on (BM25F and) - type [PĆ©rez-AgĆ¼era et al. 2010] - manually determined importance [Blanco et al. 2011]
  • 35. Hierarchical Entity Model [Neumayer et al. 2012] - Organize ļ¬elds into a 2-level hierarchy - Field types (4) on the top level - Individual ļ¬elds of that type on the bottom level - Estimate ļ¬eld weights - Using training data for ļ¬eld types - Using heuristics for bottom-level types
  • 36. Two-level hierarchy [Neumayer et al. 2012] foaf:name Audi A4 rdfs:label Audi A4 rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] dbpprop:production 1994 2001 2005 2008 rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car owl:sameAs freebase:Audi A4 is dbpedia-owl:predecessor of dbpedia:Audi_A5 is dbpprop:similar of dbpedia:Cadillac_BLS ! Nameā€Ø Attributesā€Ø Out-relationsā€Ø In-relationsā€Ø
  • 37. Comparison of models d dfF ... t dfF t ... ...d tdf ... tdf ...d t ... t Unstructuredā€Ø document model Fieldedā€Ø document model Hierarchicalā€Ø document model
  • 38. Probabilistic Retrieval Model for Semistructured data [Kim et al. 2009] - Extension to the Mixture of Language Models - Find which document ļ¬eld each query term may be associated with Mapping probabilityā€Ø Estimated for each query term P(t|āœ“d) = mX j=1 ĀµjP(t|āœ“dj ) P(t|āœ“d) = mX j=1 P(dj|t)P(t|āœ“dj )
  • 39. Estimating the mapping probability Term likelihood Probability of a query term occurring in a given ļ¬eld typeā€Ø Prior ļ¬eld probabilityā€Ø Probability of mapping the query term ā€Ø to this ļ¬eld before observing collection statistics P(dj|t) = P(t|dj)P(dj) P(t) X dk P(t|dk)P(dk) P(t|Cj) = P d n(t, dj) P d |dj|
  • 40. Example cast 0,407 team 0,382 title 0,187 genre 0,927 title 0,07 location 0,002 cast 0,601 team 0,381 title 0,017 dj dj djP(t|dj) P(t|dj) P(t|dj) meg ryan war
  • 41. Evaluation initiatives - INEX Entity Ranking track (2007-09) - Collection is the (English) Wikipedia - Entities are represented by Wikipedia articles - Semantic Search Challenge (2010-11) - Collection is a Semantic Web crawl (BTC2009) - ~1 billion RDF triples - Entities are represented by URIs - INEX Linked Data track (2012-13) - Wikipedia enriched with RDF properties from DBpedia and YAGO
  • 43. Scenario - Entity descriptions are not readily available - Entity occurrences are annotated - manually - automatically (~entity linking)
  • 44. The basic idea Use documents to go from queries to entities Query-document association the documentā€™s relevance Document-entity association how well the document characterises the entity eq xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 45. Two principal approaches - Proļ¬le-based methods - Create a textual proļ¬le for entities, then rank them (by adapting document retrieval techniques) - Document-based methods - Indirect representation based on mentions identiļ¬ed in documents - First ranking documents (or snippets) and then aggregating evidence for associated entities
  • 46. Proļ¬le-based methods q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e e
  • 47. Document-based methods q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx X e X X e e
  • 48. Many possibilities in terms of modeling - Generative (probabilistic) models - Discriminative (probabilistic) models - Voting models - Graph-based models
  • 49. Generative probabilistic models - Candidate generation models (P(e|q)) - Two-stage language model - Topic generation models (P(q|e)) - Candidate model, a.k.a. Model 1 - Document model, a.k.a. Model 2 - Proximity-based variations - Both families of models can be derived from the Probability Ranking Principle [Fang & Zhai 2007]
  • 50. Candidate models (ā€œModel 1ā€) [Balog et al. 2006] P(q|āœ“e) = Y t2q P(t|āœ“e)n(t,q) Smoothingā€Ø With collection-wide background model (1 )P(t|e) + P(t) X d P(t|d, e)P(d|e) Document-entity association Term-candidate ā€Ø co-occurrence In a particular document. In the simplest case:P(t|d)
  • 51. Document models (ā€œModel 2ā€) [Balog et al. 2006] P(q|e) = X d P(q|d, e)P(d|e) Document-entity association Document relevance How well document d supports the claim that e is relevant to q Y t2q P(t|d, e)n(t,q) Simplifying assumption ā€Ø (t and e are conditionally independent given d) P(t|āœ“d)
  • 52. Document-entity associations - Boolean (or set-based) approach - Weighted by the conļ¬dence in entity linking - Consider other entities mentioned in the document eq xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 53. Proximity-based variations - So far, conditional independence assumption between candidates and terms when computing the probability P(t|d,e) - Relationship between terms and entities that in the same document is ignored - Entity is equally strongly associated with everything discussed in that document - Letā€™s capture the dependence between entities and terms - Use their distance in the document
  • 54. Using proximity kernels [Petkova & Croft 2007] P(t|d, e) = 1 Z NX i=1 d(i, t)k(t, e) Indicator function 1 if the term at position i is t, 0 otherwise Normalizing contant Proximity-based kernel - constant function - triangle kernel - Gaussian kernel - step function
  • 55. Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity retrieval. CIKM'07.
  • 56. Many possibilities in terms of modeling - Generative probabilistic models - Discriminative probabilistic models - Voting models - Graph-based models
  • 57. Discriminative models - Vs. generative models: - Fewer assumptions (e.g., term independence) - ā€œLet the data speakā€ - Sufļ¬cient amounts of training data required - Incorporating more document features, multiple signals for document-entity associations - Estimating P(r=1|e,q) directly (instead of P(e,q|r=1)) - Optimization can get trapped in a local maximum/ minimum
  • 58. Arithmetic Mean Discriminative (AMD) model [Yang et al. 2010] Pāœ“(r = 1|e, q) = X d P(r1 = 1|q, d)P(r2 = 1|e, d)P(d) Document prior Query-document relevance Document-entity relevance logistic function ā€Ø over a linear combination of features ā‡£ Ng X j=1 jgj(e, dt) āŒ˜ā‡£ Nf X i=1 ā†µifi(q, dt) āŒ˜ standard logistic function weight ā€Ø parametersā€Ø (learned) features
  • 59. Learning to rank && entity retrieval - Pointwise - AMD, GMD [Yang et al. 2010] - Multilayer perceptrons, logistic regression [Sorg & Cimiano 2011] - Additive Groves [Moreira et al. 2011] - Pairwise - Ranking SVM [Yang et al. 2009] - RankBoost, RankNet [Moreira et al. 2011] - Listwise - AdaRank, Coordinate Ascent [Moreira et al. 2011]
  • 60. Voting models [Macdonald & Ounis 2006] - Inspired by techniques from data fusion - Combining evidence from different sources - Documents ranked w.r.t. the query are seen as ā€œvotesā€ for the entity
  • 61. Voting models Many different variants, including... - Votes - Number of documents mentioning the entity ! ! - Reciprocal Rank - Sum of inverse ranks of documents ! ! - CombSUM - Sum of scores of documents Score(e, q) = |{M(e) R(q)}| X {M(e)R(q)} s(d, q) Score(e, q) = X {M(e)R(q)} 1 rank(d, q) Score(e, q) = |M(e) R(q)|
  • 62. Graph-based models [Serdyukov et al. 2008] - One particular way of constructing graphs - Vertices are documents and entities - Only document-entity edges - Search can be approached as a random walk on this graph - Pick a random document or entity - Follow links to entities or other documents - Repeat it a number of times
  • 63. Inļ¬nite random walk [Serdyukov et al. 2008] Pi(d) = PJ (d) + (1 ) X e!d P(d|e)Pi 1(e), Pi(e) = X d!e P(e|d)Pi 1(d), PJ (d) = P(d|q), ee e d d e d d
  • 64. Evaluation - Expert ļ¬nding task @TREC Enterprise track - Enterprise setting (intranet of a large organization) - Given a query, return people who are experts on the query topic - List of potential experts is provided - We assume that the collection has been annotated with <person>...</person> tokens
  • 67. Interacting with typesā€Ø grouping results people companies jobs (more) people
  • 70. Type-aware ranking - Typically, a two-component model: P(q|e) = P(qT , qt|e) = P(qT |e)P(qt|e) Type-basedā€Ø similarity Term-basedā€Ø similarity #1 Where to get the target type from? #2 How to estimate type-based similarity?
  • 71. Target type - Provided by the user - keyword++ query - Need to be automatically identiļ¬ed - keyword query
  • 72. Target type(s) are providedā€Ø faceted search, form ļ¬ll-in, etc.
  • 73. But what about very many types?ā€Ø which are typically hierarchically organized
  • 74. Challenges - Users are not familiar with the type system
  • 75. In general, categorizing things can be hard - What is King Arthur? - Person / Royalty / British royalty - Person / Military person - Person / Fictional character
  • 77. Upshot for type-aware ranking - Need to be able to handle the imperfections of the type system - Inconsistencies - Missing assignments - Granularity issues - Entities labeled with too general or too speciļ¬c types - User input is to be treated as a hint, not as a strict ļ¬lter
  • 78. Two settings - Target type(s) are provided by the user - keyword++ query - Target types need to be automatically identiļ¬ed - keyword query
  • 79. Identifying target types for queries - Types can be ranked much like entitiesā€Ø [Balog & Neumayer 2012] - Direct term-based representations (ā€œModel 1ā€) - Types of top ranked entities (ā€œModel 2ā€)ā€Ø [Vallet & Zaragoza 2008]
  • 80. Type-centric vs. entity-centric type ranking q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx t xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx t t Type-centric q X t X X t t xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx Entity-centric
  • 81. Hierarchical target type identiļ¬cation - Finding the single most speciļ¬c type [from an ontology] that is general enough to cover all entities that are relevant to the query. - Finding the right granularity is diļ¬ƒcultā€¦ - Models are good at ļ¬nding either general (top-level) or speciļ¬c (leaf-level) types
  • 82. Type-based similarity - Measuring similarity - Set-based - Content-based (based on type labels) - Need ā€œsoftā€ matching to deal with the imperfections of the category system - Lexical similarity of type labels - Distance based on the hierarchy - Query expansion P(q|e) = P(qT , qt|e) = P(qT |e)P(qt|e)
  • 83. Modeling types as probability distributions [Balog et al. 2011] - Analogously to term-based representations ! ! ! ! - Advantages - Sound modeling of uncertainty associated with category information - Category-based feedback is possible p(c|āœ“C q ) p(c|āœ“C e ) Query Entity KL(āœ“C q ||āœ“C e )
  • 84. Joint type detection and entity ranking [Sawant & Chakrabarti 2013] - Assumes ā€œtelegraphicā€ queries with target type - woodrow wilson president university - dolly clone institute - lead singer led zeppelin band - Type detection is integrated into the ranking - Multiple query interpretations are considered - Both generative and discriminative formulations
  • 85. Approach - Each query term is either a ā€œtype hintā€ ( ) or a ā€œword matcherā€ ( ) - Number of possible partitions is manageable ( )2|q| h(~q, ~z) s(~q, ~z) By comparison, the Padres have been to two World Series, losing in 1984 and 1998. losing baseball team world series 1998 Major league baseball teams mentionOf instanceOf San Diego PadresEntity Evidence ā€Ø snippets Type
  • 86. Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response Ranking. In WWW ā€™13. (see presentation) San Diego Padres! Major league ! baseball team! type context E T Padres have been to two World Series, losing in 1984 and 1998! Īø Type hint : baseball , team losing team baseball world series 1998 Z Ļ• Context matchers : ! lost , 1998, world seriesswitch! model! model! q losing team baseball world series 1998 Generative approach Generate query from entity
  • 87. Generative formulation P(e|q) / P(e) X t,~z P(t|e)P(~z)P(h(~q, ~z)|t)P(s(~q, ~z)|e) Type priorā€Ø Estimated from ā€Ø answer types ā€Ø in the past Type modelā€Ø Probability of observing t in the type model Entity modelā€Ø Probability of observing t in the entity model Query switchā€Ø Probability of the interpretation Entity prior
  • 88. Discriminative approach Separate correct and incorrect entities Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response Ranking. In WWW ā€™13. (see presentation) San_Diego_Padres! losing team baseball world series 1998 (baseball team)! losing team baseball world series 1998 (baseball team)! losing team baseball world series 1998 (t = baseball team)! 1998_World_Series! losing team baseball world series 1998 (series)! losing team baseball world series 1998 (series)! losing team baseball world series 1998 (t = series)! : losing team baseball world series 1998!q!
  • 89. Discriminative formulation (q, e, t, ~z) = h 1(q, e), 2(t, e), 3(q, ~z, t), 4(q, ~z, e)i Comparability between hint words and type Comparability between matchers and snippets that mention e Models the type prior P(t|e) Models the entity prior P(e)
  • 90. Evaluation - INEX Entity Ranking track - Entities are represented by Wikipedia articles - Topic deļ¬nition includes target categories Movies with eight or more Academy Awardsā€Ø best picture oscar british ļ¬lms american ļ¬lms
  • 91.
  • 94.
  • 95.
  • 96. Searching for arbitrary relations* airlines that currently use Boeing 747 planesā€Ø ORG Boeing 747 Members of The Beaux Arts Trioā€Ø PER The Beaux Arts Trio What countries does Eurail operate in?ā€Ø LOC Eurail *given an input entity and target type
  • 97. Modeling related entity ļ¬nding [Bron et al. 2010] - Ranking entities of a given type (T) that stand in a required relation (R) with an input entity (E) - Three-component model p(e|E, T, R) / p(e|E) Ā· p(T|e) Ā· p(R|E, e) Context model Type ļ¬ltering Co-occurrence model xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 98. Evaluation - TREC Entity track - Given - Input entity (deļ¬ned by name and homepage) - Type of the target entity (PER/ORG/LOC) - Narrative (describing the nature of the relation in free text) - Return (homepages of) related entities
  • 99. Wrapping up - Entity retrieval in diļ¬€erent ļ¬‚avors using generative approaches based on language modeling techniques - Increasingly more discriminative approaches over generative ones - Increasing amount of components (and parameters) - Easier to incrementally add informative but correlated features - But, (massive amounts of ) training data is required!
  • 100. Future challenges - Itā€™s ā€œeasyā€ when the ā€œquery intentā€ is known - Desired results: single entity, ranked list, set, ā€¦ - Query type: ad-hoc, list search, related entity ļ¬nding, ā€¦ - Methods speciļ¬cally tailored to speciļ¬c types ā€Ø of requests - Understanding query intent still has a long ā€Ø way to go