Entity Search: The Last Decade and the Next

En#ty Search
The Last Decade and the Next
Krisz#an Balog
University of Stavanger 
@krisz'anbalog
10th Russian Summer School in Informa'on Retrieval (RuSSIR 2016) | Saratov, Russia, 2016

WHAT IS AN ENTITY?
• An en#ty is an "object" or
"thing" in the real world that
can be dis'nctly iden'ﬁed and
is characterized by the following
proper#es:
• unique iden#ﬁer(s)
• name(s)
• type(s)
• aRributes (or descrip#on)
• (typed) rela#onships to other en##es
people
products
organiza#ons
loca#ons

OUTLINE
2 
Present
1 
Past
3 
Future
now-10y +10y

THE PAST
1
PART
The core problem of en#ty ranking and its inves#ga#on at various
benchmarking evalua#on campaigns

EVALUATION CYCLE
02. Experimental
design
03. Method
development
05. Repor'ng
REVISION
04. Experimental
evalua'on
IDEA
01. Task deﬁni'on

ENTITY RANKING TASK
search query
retrieval
method
search results

EVALUATION CAMPAIGNS
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task: expert ﬁnding
Input: keyword query
Data collec'on: enterprise intranet
En'ty ID: email address
ontology engineering climate change

xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
TREC ENTERPRISE EXPERT FINDING
• How to rank en##es that have no direct
representa#ons?
• Idea: Look at co-occurrences of en##es and query
terms in documents
xx xxxx xxxxx xxx x
xxxxxxx
query terms
en#ty men#on
documents

PROFILE-BASED METHODS
• Build a direct term-based en#ty
representa#on based on
associated language usage
• "You shall know a word by the
company it keeps." [Firth, 1957]
• Use document retrieval
techniques for ranking en#ty
proﬁle documents
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
x xxxx x xxx xx
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
x xxxx x xxx xx
e
x xxxx x xxx xx
x xxxx x xxx xx
e
e

DOCUMENT-BASED METHODS
• First rank documents  
(or document snippets)
• Then aggregate evidence for
the associated en##es
q
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
x xxxx x xxx xx
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
X
e
X
X
e
e

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task: en#ty ranking in Wikipedia
Input:
keyword++ query  
(target types/examples)
Data collec'on: Wikipedia
En'ty ID: Wikipedia ar#cle ID
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh ﬁlms
+category: american ﬁlms

INEX ENTITY RANKING
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh ﬁlms
+category: american ﬁlms
Term-based representa'on
Category-based representa'on

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task: related en#ty ﬁnding
Input:
keyword++ query  
(input en#ty, target type)
Data collec'on: Web
En'ty ID: en#ty homepage URL
airlines that currently use Boeing-747 planes
+en'ty: Boeing-747 (clueweb09-..292)
+target type: organiza#on

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task:
en#ty search
in the Web of Data
Data collec'on: RDF triples
En'ty ID: URI
nokia e73
boroughs of New York City
disney orlando

FIELDED DOCUMENT REPRESENTATION
FROM RDF TRIPLES
dbpedia:Audi_A4
subject object
predicate
subject
predicate
literal
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built
[...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task: ques#on answering over RDF data
Input: natural language query
Data collec'on: RDF triples
En'ty ID: URI
Which German ci#es have more than
250000 inhabitants?
Who is the youngest Pulitzer Prize winner?

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Task: ad-hoc en#ty retrieval
Data collec'on: Wikipedia + RDF triples
En'ty ID: Wikipedia ar#cle ID
NASA country German

DATA EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise TREC Entity
INEX Entity Ranking
SemSearch
unstructured
structured
semistructured
INEX
• Clear trend moving towards structured data
• No meaningful/successful aRempt at combining unstructured and
structured data

QUERY EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
keyword
natural language
keyword++
INEX
• Keyword queries are s#ll the most common way to search
• From providing explicit seman#c annota#ons to natural language
ques#ons

WHAT HAVE WE BEEN DOING?
• Core focus has been on retrieval models, and more
speciﬁcally on en'ty representa'ons
• In terms of associated language usage, descrip#on, types,
aRributes
• Richer query representa#ons (i.e., query
annota#ons) were taken for granted

image source: hRps://www.pinterest.com/pin/382946774535857111/

THE BIGGER PICTURE
Understanding
informa'on needs
Data source(s)
Result presenta'on  
& user interac'on
Retrieval method

THE PRESENT
2
PART
Current research themes on various aspects of en#ty search.

DATA
J. Benetka, K. Balog, and K. Nørvåg.  
Towards Building a Knowledge Base of Monetary
Transac'ons from a News Collec'on.  
JCDL’17.

KNOWLEDGE BASES
• Modern en#ty-oriented search features are fueled
by knowledge bases—need con#nuous upda#ng
• Cri#cal to be able to verify the validity of data
• Supply provenance informa#on for each statement
• Validity check (s#ll) needs to be performed by a human
• Can we help human editors to maintain and
expand knowledge bases?

acquisitionFinancial event:OracleSubject: Find events
InsertConfidence
2004
NYT
USD 10 300 000 000
Value
NYT
Year
56%
2007
USD 1 500 000
… from the PeopleSoft purchase …
2005 NYT
2004
NYT
Snippet
NYT
82.8% …Oracle finally acquired PeopleSoft for…
pleSoft finally capitulated to Oracle's …
Link
2004
… which acquired PeopleSoft last year …
USD 11
75.3% USD 20 000 000 000
78.9%
66.7% PeopleSoft for $5.1 billion in cash.
USD 7 700 000 000
Counterpart Event attributes
Hyperion Solutions
Siebel Systems
Retek
PeopleSoft
BUILDING A KNOWLEDGE BASE OF
MONETARY TRANSACTIONS
Subject en'ty Predicate filter
Object en'ty
Extracted informa'on
A Boom in Merger Activity
In December 2004, after a
battle for control that grew
nasty, Oracle finally acquired
PeopleSoft for about $10.3
billion, becoming the second-
largest maker of business-
management software.

APPROACH
• Generate all possible event
interpreta#ons (quintuples)
Event representa'on
• Monetary value recogni#on
• Economic event recogni#on
• En#ty recogni#on
• Date extrac#on
• Seman#c role labeling
Seman'c annota'on of sentences
• Grouping sentences that discuss
the same economic event
Clustering events
• Assigning conﬁdence score to
each interpreta#on
Supervised learning
s#1
s#2
s#3
s#4
s#5
s#1
s#1
s#2
s#5
s#3
s#4
0.85
0.65
0.91
0.43
0.45
0.77
1
2 3
4
s#1
s#2
s#5
A B
A B
A B
s#3
s#4
C D
C D
e#1
[C] <rel> [D]
e#2
[A] <rel> [B]
{
{

RESULTS
F1
0
0,1
0,2
0,3
0,4
Events A]ributes (strict) A]ributes (relaxed)
First repor#ng Last repor#ng Most frequent Supervised learning

SUMMARY
• Building a domain-speciﬁc knowledge base
• NLP pipeline for informa#on extrac#on
• ML for establishing conﬁdence for human processing
• Open research problems
• Long-tail en##es
• En##es "not worthy" of a Wikipedia page
• What are the aRributes that ma#er?

UNDERSTANDING
INFORMATION NEEDS
F. Hasibi, K. Balog, and S. E. Bratsberg.  
Exploi'ng En'ty Linking in Queries for En'ty Retrieval.  
ICTIR’16.

ANNOTATING QUERIES WITH ENTITIES
• Seman#c annota#ons of queries
were taken for granted so far
• How can automa'c en'ty
annota'ons of queries be
leveraged to improve en'ty
retrieval?
barack obama parents

APPROACH
<Barack_Obama>
Annotations:
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
<dbo:wikiPageWikiLink>:
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
United States Family Barack Obama
Term-based representa'on
En'ty-based representa'on
<Barack_Obama>
Annotations:
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
United States Family Barack Obama
<Barack_Obama>
en'ty annota'on
(automa'c)

RESULTS
MAP
0,00
0,06
0,11
0,17
0,22
LM MLM-tc MLM-all PRMS SDM FSDM
baseline +ELR

SUMMARY
• Automa#cally annota#ng queries with en##es can
signiﬁcantly improve retrieval performance
• Open research problem:
• How should a query be answered (list, fact, table, etc.)?

ENTITY SUMMARIES
F. Hasibi, K. Balog, and S. E. Bratsberg.  
Dynamic Factual Summaries for En'ty Cards.  
SIGIR’17.

ENTITY SUMMARIES
• Summaries serve a dual purpose
• Synopsis of the en#ty
• Provide evidence why the en#ty is a good answer
for the given query
• How to generate dynamic en'ty
summaries that can directly address
users’ informa'on needs?
• Two subtasks
• Fact ranking — What should be in the summary?
• Summary genera#on — How should it be presented?

EXAMPLE
einstein awards
Sta'c (query-independent) summary Dynamic (query-dependent) summary
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Inﬂuenced by: Isaac Newton, Mahatma Gandhi, more
Spouse: Elsa Einstein, Mileva Marić
Children: Eduard Einstein, Lieserl Einstein, Hans A. Einstein
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Awards: Barnard Medal, Nobel Prize in Physics, more
Educa'on: Swiss Federal Polytechnic, University of Zurich
Inﬂuenced by: Isaac Newton, Mahatma Gandhi, more

FACT RANKING
• Ranking en#ty facts according to various
"goodness" criteria
• Importance: how well it describes the en#ty
• Relevance: how well it supports/explains why the en#ty is a
relevant result for the given query (informa#on need)
• U'lity: combines importance and relevance
• Learning-to-rank approach with speciﬁc features
designed for capturing importance and relevance

SUMMARY GENERATION
• A summary is more than a ranked list of facts
Seman'cally
iden'cal
predicates
Presenta'on  
(human-readable labels, size constraints)
Mul'-valued
predicates
<dbo:capital> <dbpedia:Oslo>
<dbo:currency> <dbpedia:Norwegian_krone>
<dbo:leader> <dbpedia:Harald_V_of_Norway>
<dbp:establishedDate> 1814-05-17
<dbp:leaderName> <dbpedia:Harald_V_of_Norway>
<foaf:homepage> <hRp://www.norway.no/>
<dbo:language> <dbpedia:Norwegian_language>
<dbo:language> <dbpedia:Romani_language>
<dbo:language> <dbpedia:Scandoromani_language>
<dbp:website> <hRp://www.norway.no/>
<dbo:leaderTitle> President of the Stor#ng
<dbp:areaKm> 385178
vs.
Capital: Oslo
Currency: Norwegian krone
Leader: Harald V of Norway
Homepage: hRp://www.norway.no/
Language: Norwegian, Romani, more

SUMMARY GENERATION ALGORITHM
… …
headingiheadingi valueivaluei
height(⌧h)height(⌧h)
width(⌧w)width(⌧w)
lineilinei
1. Selec'ng line headings
• Recognizing seman#cally iden#cal predicates
• Mapping predicates to human readable labels
2. Collec'ng line values
• Grouping values for mul#-valued predicates
• Adhering to size constraints

FACT RANKING EVALUATION
All facts Facts with URI-only objects
NGCD@10
0
0,2
0,4
0,6
0,8
Importance U'lity
RELIN DynES/imp DynES
NGCD@10
0,00
0,23
0,45
0,68
0,90
Importance U'lity
RELIN LinkSUM SUMMARUM
DynES/imp DynES

END-TO-END (SUMMARY) EVALUATION
• How do sta#c and dynamic summaries compare
against each other?
Oracle (perfect) fact ranking
Automa#c fact ranking
0 25 50 75 100
31
37
23
16
46
47
Dynamic summary wins Sta#c summary wins

SUMMARY
• Addressed the problem of genera#ng dynamic
(query-dependent) en#ty summaries
• What should be on the en#ty card?
• Other forms of result presenta#on (tables, lists, graphs, etc.)

ANTICIPATING
INFORMATION NEEDS
J. Benetka, K. Balog, and K. Nørvåg.  
An'cipa'ng Informa'on Needs Based on Check-in Ac'vity.  
WSDM’17.

ZERO-QUERY SEARCH
• ProacAve instead of reacAve search
• "An#cipate user needs and respond with
informa#on appropriate to the current
context without the user having to enter a
query" — (Allan et al., SIGIR Forum 2012)
• Using a person's check-in ac'vity
as context, can we an'cipate her
informa'on needs, and respond
with a set of informa'on cards
that directly address those needs?
Terminal
Weather
21ºC
Trafﬁc

INFORMATION NEEDS FOR ACTIVITIES
• What are relevant informa#on needs in the context of
a given ac#vity?
• Use POI categories (Foursquare) to represent ac#vi#es
• Mine informa#on needs from search sugges#ons

ANTICIPATING INFORMATION NEEDS
• Maximize the likelihood of sa#sfying the user's
informa#on needs by considering each possible ac#vity
that might follow next
• Transi#on probabili#es are es#mated based on historical
check-in data
Activity A
Activity B
Activity C
Activity D
45%
34%
21%
?

Train Test80%
User 3
User 2
User 1
Check-in dataset
EVALUATION METHODOLOGY
Terminal
Weather
21ºC
Trafﬁc

RESULTSNGCD@5
0,00
0,23
0,45
0,68
0,90
Top level Second level
Most frequent informa#on needs,
regardless of the last ac#vity
M0
Consider informa#on needs for all
possible upcoming ac#vi#es
In addi#on, consider the informa#on
needs relevant to the past ac#vity
(ﬁxed weight for all info needs)
Consider the temporal sensi#vity of
each informa#on need individually
M1
M2
M3

SUMMARY
• Iden#fying informa#on needs that are relevant in the
context of a given ac#vity and proac#vely presen#ng
informa#on cards addressing those needs
• Other contexts
• (Access to data, privacy...)

THE FUTURE
3
PART
Making the right informa#on available to the right person at the right #me.

IMAGINARY SCENARIO
WITH AN INTELLIGENT PERSONAL ASSISTANT

I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?Sure. I want an ac've holiday with
the family in beau'ful nature.
It sounds like you would deﬁnitely
love Norway. A cabin in the
mountains maybe?
Could be. But I want to go kayaking
and also catch some ﬁsh.  
And not too much rain, please.
And something fun for the kids
nearby, I suppose?
Of course.
How does Oltedal sound?
People have been quite successful
with catching lake trout based on
what I found on Instagram.
There is also a theme park and
horse riding, both within 50kms.

And what about the weather? You know we’re talking about
Norway, right…?
Anyway, based on sta's'cs from the
past 30 years, this is one of the areas
with the least amount of rain if you
go in August.
I see. What about accommoda'on?
Here is a list of places that I think you
might like.
Any opinions on this one?
According to the reviews that I can
ﬁnd on the web, the cabins are well
equipped, the staﬀ is nice and they
even allow guests to borrow their
kayaks.

OK. Let’s find a date that works for
everyone. According to your wife's calendar, her
parents will be visi'ng you in the first
week of August. School starts for the
kids on the week of Aug 22. So there
is a two week window between Aug
8 and 21, assuming that I can cancel
the regular weekly mee'ngs with
your PhD students.
That's fine. The students won't mind.
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.
Guys,
What are your plans for the summer?
Please upload your away times to the
group wiki.
-Kr
To: XXX, YYY, ZZZ
Send
Agenda item Summer planning added

In the mean'me, I called the cabin to
check availability. Their online
booking system is down at the
moment. They s'll have some cabins
available. Do you want to see them?
No, I had enough of this for today.
Mail the pictures to my wife with
some kind words.
Anything else I can do for you?Order a water ﬁlter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Darling,
You will love the place I found for us for a
vacation in August. It is by the water; at
night we will hear the waves. We will be
able to take our morning breakfasts on
the balcony, which ...
To: Wife
Send

UNDERSTANDING  
INFORMATION NEEDS
• Natural language
conversa#onal interface
• An#cipa#ng informa#on needs
• Proac#ve recommenda#ons
It sounds like you would deﬁnitely
love Norway. A cabin in the
mountains maybe?
And something fun for the kids
nearby, I suppose?
I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?

DATA
• Long-tail en##es
• On-the-fly informa#on extrac#on
• "Personal" knowledge base
• "Wife", "My students", "my group", "my
espresso machine", ... en##es I care about
Here is a list of places that I think you
might like.
According to the reviews that I can
find on the web, ...
Order a water filter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Breville BES860XL Barista
Express Espresso Machine

RESULT PRESENTATION  
& USER INTERACTION
• Providing evidence
• "Ac#onable" en##es
• Make booking, order item, write email, ...
• Helping the user to get things
done
• Support for task comple#on
... based on sta's'cs from the past
30 years, ...
According to your wife's calendar, ...
Agenda item Summer planning added
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.

SUMMARY
Understanding
informa'on needs
Data source(s)
Result presenta'on  
& user interac'on
Retrieval method
• Seman#c annota#ons
• An#cipa#ng info needs
• Natural language  
conversa#onal interfaces
• Long tail en##es
• Personal knowledge base
• On-the-ﬂy informa#on extrac#on
• Hybrid approaches
• En#ty cards
• Ac#onable en##es
• Support for task comple#on

ACKNOWLEDGMENTS
• Joint work with
• Faegheh Hasibi
• Jan Benetka
• Darío Gariglioz
• Kje#l Nørvåg
• Svein Erik Bratsberg

QUESTIONS?
@krisz'anbalog  
krisz#anbalog.com

Entity Search: The Last Decade and the Next

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Entity Search: The Last Decade and the Next

Similar to Entity Search: The Last Decade and the Next (20)

More from krisztianbalog

More from krisztianbalog (15)

Recently uploaded

Recently uploaded (20)

Entity Search: The Last Decade and the Next