Keynote talk given at the 10th Russian Summer School in Information Retrieval (RuSSIR ’16), Saratov, Russia, August 2016.
Note: part of the work is under still review; those slides are not yet included.
1. En#ty Search
The Last Decade and the Next
Krisz#an Balog
University of Stavanger
@krisz'anbalog
10th Russian Summer School in Informa'on Retrieval (RuSSIR 2016) | Saratov, Russia, 2016
2. WHAT IS AN ENTITY?
• An en#ty is an "object" or
"thing" in the real world that
can be dis'nctly iden'fied and
is characterized by the following
proper#es:
• unique iden#fier(s)
• name(s)
• type(s)
• aRributes (or descrip#on)
• (typed) rela#onships to other en##es
people
products
organiza#ons
loca#ons
11. xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
TREC ENTERPRISE EXPERT FINDING
• How to rank en##es that have no direct
representa#ons?
• Idea: Look at co-occurrences of en##es and query
terms in documents
xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
query terms
en#ty men#on
documents
12. PROFILE-BASED METHODS
• Build a direct term-based en#ty
representa#on based on
associated language usage
• "You shall know a word by the
company it keeps." [Firth, 1957]
• Use document retrieval
techniques for ranking en#ty
profile documents
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
e
13. DOCUMENT-BASED METHODS
• First rank documents
(or document snippets)
• Then aggregate evidence for
the associated en##es
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
X
e
X
X
e
e
14. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: en#ty ranking in Wikipedia
Input:
keyword++ query
(target types/examples)
Data collec'on: Wikipedia
En'ty ID: Wikipedia ar#cle ID
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh films
+category: american films
15. INEX ENTITY RANKING
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh films
+category: american films
Term-based representa'on
Category-based representa'on
16. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: related en#ty finding
Input:
keyword++ query
(input en#ty, target type)
Data collec'on: Web
En'ty ID: en#ty homepage URL
airlines that currently use Boeing-747 planes
+en'ty: Boeing-747 (clueweb09-..292)
+target type: organiza#on
17. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task:
en#ty search
in the Web of Data
Input: keyword query
Data collec'on: RDF triples
En'ty ID: URI
nokia e73
boroughs of New York City
disney orlando
18. FIELDED DOCUMENT REPRESENTATION
FROM RDF TRIPLES
dbpedia:Audi_A4
subject object
predicate
subject
predicate
literal
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built
[...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS
19. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: ques#on answering over RDF data
Input: natural language query
Data collec'on: RDF triples
En'ty ID: URI
Which German ci#es have more than
250000 inhabitants?
Who is the youngest Pulitzer Prize winner?
20. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: ad-hoc en#ty retrieval
Input: keyword query
Data collec'on: Wikipedia + RDF triples
En'ty ID: Wikipedia ar#cle ID
NASA country German
22. DATA EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise TREC Entity
INEX Entity Ranking
SemSearch
Question Answering over Linked Data
unstructured
structured
semistructured
INEX
• Clear trend moving towards structured data
• No meaningful/successful aRempt at combining unstructured and
structured data
23. QUERY EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
Question Answering over Linked Data
keyword
natural language
keyword++
INEX
• Keyword queries are s#ll the most common way to search
• From providing explicit seman#c annota#ons to natural language
ques#ons
24. WHAT HAVE WE BEEN DOING?
• Core focus has been on retrieval models, and more
specifically on en'ty representa'ons
• In terms of associated language usage, descrip#on, types,
aRributes
• Richer query representa#ons (i.e., query
annota#ons) were taken for granted
28. DATA
J. Benetka, K. Balog, and K. Nørvåg.
Towards Building a Knowledge Base of Monetary
Transac'ons from a News Collec'on.
JCDL’17.
29. KNOWLEDGE BASES
• Modern en#ty-oriented search features are fueled
by knowledge bases—need con#nuous upda#ng
• Cri#cal to be able to verify the validity of data
• Supply provenance informa#on for each statement
• Validity check (s#ll) needs to be performed by a human
• Can we help human editors to maintain and
expand knowledge bases?
30. acquisitionFinancial event:OracleSubject: Find events
InsertConfidence
2004
NYT
USD 10 300 000 000
Value
NYT
Year
56%
2007
USD 1 500 000
… from the PeopleSoft purchase …
2005 NYT
2004
NYT
Snippet
NYT
82.8% …Oracle finally acquired PeopleSoft for…
pleSoft finally capitulated to Oracle's …
Link
2004
… which acquired PeopleSoft last year …
USD 11
75.3% USD 20 000 000 000
78.9%
66.7% PeopleSoft for $5.1 billion in cash.
USD 7 700 000 000
Counterpart Event attributes
Hyperion Solutions
Siebel Systems
Retek
PeopleSoft
BUILDING A KNOWLEDGE BASE OF
MONETARY TRANSACTIONS
Subject en'ty Predicate filter
Object en'ty
Extracted informa'on
A Boom in Merger Activity
In December 2004, after a
battle for control that grew
nasty, Oracle finally acquired
PeopleSoft for about $10.3
billion, becoming the second-
largest maker of business-
management software.
31. APPROACH
• Generate all possible event
interpreta#ons (quintuples)
Event representa'on
• Monetary value recogni#on
• Economic event recogni#on
• En#ty recogni#on
• Date extrac#on
• Seman#c role labeling
Seman'c annota'on of sentences
• Grouping sentences that discuss
the same economic event
Clustering events
• Assigning confidence score to
each interpreta#on
Supervised learning
s#1
s#2
s#3
s#4
s#5
s#1
s#1
s#2
s#5
s#3
s#4
0.85
0.65
0.91
0.43
0.45
0.77
1
2 3
4
s#1
s#2
s#5
A B
A B
A B
s#3
s#4
C D
C D
e#1
[C] <rel> [D]
e#2
[A] <rel> [B]
{
{
33. SUMMARY
• Building a domain-specific knowledge base
• NLP pipeline for informa#on extrac#on
• ML for establishing confidence for human processing
• Open research problems
• Long-tail en##es
• En##es "not worthy" of a Wikipedia page
• What are the aRributes that ma#er?
35. ANNOTATING QUERIES WITH ENTITIES
• Seman#c annota#ons of queries
were taken for granted so far
• How can automa'c en'ty
annota'ons of queries be
leveraged to improve en'ty
retrieval?
barack obama parents
36. APPROACH
<Barack_Obama>
Annotations:
barack obama parents
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
<dbo:wikiPageWikiLink>:
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
<dbo:wikiPageWikiLink>:
United States Family Barack Obama
Term-based representa'on
En'ty-based representa'on
barack obama parents
<Barack_Obama>
Annotations:
barack obama parents
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
<dbo:wikiPageWikiLink>:
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
<dbo:wikiPageWikiLink>:
United States Family Barack Obama
<Barack_Obama>
en'ty annota'on
(automa'c)
39. SUMMARY
• Automa#cally annota#ng queries with en##es can
significantly improve retrieval performance
• Open research problem:
• How should a query be answered (list, fact, table, etc.)?
40. ENTITY SUMMARIES
F. Hasibi, K. Balog, and S. E. Bratsberg.
Dynamic Factual Summaries for En'ty Cards.
SIGIR’17.
41. ENTITY SUMMARIES
• Summaries serve a dual purpose
• Synopsis of the en#ty
• Provide evidence why the en#ty is a good answer
for the given query
• How to generate dynamic en'ty
summaries that can directly address
users’ informa'on needs?
• Two subtasks
• Fact ranking — What should be in the summary?
• Summary genera#on — How should it be presented?
42. EXAMPLE
einstein awards
Sta'c (query-independent) summary Dynamic (query-dependent) summary
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Influenced by: Isaac Newton, Mahatma Gandhi, more
Spouse: Elsa Einstein, Mileva Marić
Children: Eduard Einstein, Lieserl Einstein, Hans A. Einstein
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Awards: Barnard Medal, Nobel Prize in Physics, more
Educa'on: Swiss Federal Polytechnic, University of Zurich
Influenced by: Isaac Newton, Mahatma Gandhi, more
43. FACT RANKING
• Ranking en#ty facts according to various
"goodness" criteria
• Importance: how well it describes the en#ty
• Relevance: how well it supports/explains why the en#ty is a
relevant result for the given query (informa#on need)
• U'lity: combines importance and relevance
• Learning-to-rank approach with specific features
designed for capturing importance and relevance
44. SUMMARY GENERATION
• A summary is more than a ranked list of facts
Seman'cally
iden'cal
predicates
Presenta'on
(human-readable labels, size constraints)
Mul'-valued
predicates
<dbo:capital> <dbpedia:Oslo>
<dbo:currency> <dbpedia:Norwegian_krone>
<dbo:leader> <dbpedia:Harald_V_of_Norway>
<dbp:establishedDate> 1814-05-17
<dbp:leaderName> <dbpedia:Harald_V_of_Norway>
<foaf:homepage> <hRp://www.norway.no/>
<dbo:language> <dbpedia:Norwegian_language>
<dbo:language> <dbpedia:Romani_language>
<dbo:language> <dbpedia:Scandoromani_language>
<dbp:website> <hRp://www.norway.no/>
<dbo:leaderTitle> President of the Stor#ng
<dbp:areaKm> 385178
vs.
Capital: Oslo
Currency: Norwegian krone
Leader: Harald V of Norway
Homepage: hRp://www.norway.no/
Language: Norwegian, Romani, more
45. SUMMARY GENERATION ALGORITHM
… …
headingiheadingi valueivaluei
height(⌧h)height(⌧h)
width(⌧w)width(⌧w)
lineilinei
1. Selec'ng line headings
• Recognizing seman#cally iden#cal predicates
• Mapping predicates to human readable labels
2. Collec'ng line values
• Grouping values for mul#-valued predicates
• Adhering to size constraints
47. END-TO-END (SUMMARY) EVALUATION
• How do sta#c and dynamic summaries compare
against each other?
Oracle (perfect) fact ranking
Automa#c fact ranking
0 25 50 75 100
31
37
23
16
46
47
Dynamic summary wins Sta#c summary wins
48. SUMMARY
• Addressed the problem of genera#ng dynamic
(query-dependent) en#ty summaries
• Open research problems
• What should be on the en#ty card?
• Other forms of result presenta#on (tables, lists, graphs, etc.)
50. ZERO-QUERY SEARCH
• ProacAve instead of reacAve search
• "An#cipate user needs and respond with
informa#on appropriate to the current
context without the user having to enter a
query" — (Allan et al., SIGIR Forum 2012)
• Using a person's check-in ac'vity
as context, can we an'cipate her
informa'on needs, and respond
with a set of informa'on cards
that directly address those needs?
Terminal
Weather
21ºC
Traffic
51. INFORMATION NEEDS FOR ACTIVITIES
• What are relevant informa#on needs in the context of
a given ac#vity?
• Use POI categories (Foursquare) to represent ac#vi#es
• Mine informa#on needs from search sugges#ons
52. ANTICIPATING INFORMATION NEEDS
• Maximize the likelihood of sa#sfying the user's
informa#on needs by considering each possible ac#vity
that might follow next
• Transi#on probabili#es are es#mated based on historical
check-in data
Activity A
Activity B
Activity C
Activity D
45%
34%
21%
?
53. Train Test80%
User 3
User 2
User 1
Check-in dataset
EVALUATION METHODOLOGY
Terminal
Weather
21ºC
Traffic
54. RESULTSNGCD@5
0,00
0,23
0,45
0,68
0,90
Top level Second level
Most frequent informa#on needs,
regardless of the last ac#vity
M0
Consider informa#on needs for all
possible upcoming ac#vi#es
In addi#on, consider the informa#on
needs relevant to the past ac#vity
(fixed weight for all info needs)
Consider the temporal sensi#vity of
each informa#on need individually
M1
M2
M3
55. SUMMARY
• Iden#fying informa#on needs that are relevant in the
context of a given ac#vity and proac#vely presen#ng
informa#on cards addressing those needs
• Open research problems
• Other contexts
• (Access to data, privacy...)
58. I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?Sure. I want an ac've holiday with
the family in beau'ful nature.
It sounds like you would definitely
love Norway. A cabin in the
mountains maybe?
Could be. But I want to go kayaking
and also catch some fish.
And not too much rain, please.
And something fun for the kids
nearby, I suppose?
Of course.
How does Oltedal sound?
People have been quite successful
with catching lake trout based on
what I found on Instagram.
There is also a theme park and
horse riding, both within 50kms.
59. And what about the weather? You know we’re talking about
Norway, right…?
Anyway, based on sta's'cs from the
past 30 years, this is one of the areas
with the least amount of rain if you
go in August.
I see. What about accommoda'on?
Here is a list of places that I think you
might like.
Any opinions on this one?
According to the reviews that I can
find on the web, the cabins are well
equipped, the staff is nice and they
even allow guests to borrow their
kayaks.
60. OK. Let’s find a date that works for
everyone. According to your wife's calendar, her
parents will be visi'ng you in the first
week of August. School starts for the
kids on the week of Aug 22. So there
is a two week window between Aug
8 and 21, assuming that I can cancel
the regular weekly mee'ngs with
your PhD students.
That's fine. The students won't mind.
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.
Guys,
What are your plans for the summer?
Please upload your away times to the
group wiki.
-Kr
To: XXX, YYY, ZZZ
Send
Agenda item Summer planning added
61. In the mean'me, I called the cabin to
check availability. Their online
booking system is down at the
moment. They s'll have some cabins
available. Do you want to see them?
No, I had enough of this for today.
Mail the pictures to my wife with
some kind words.
Anything else I can do for you?Order a water filter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Darling,
You will love the place I found for us for a
vacation in August. It is by the water; at
night we will hear the waves. We will be
able to take our morning breakfasts on
the balcony, which ...
To: Wife
Send
63. UNDERSTANDING
INFORMATION NEEDS
• Natural language
conversa#onal interface
• An#cipa#ng informa#on needs
• Proac#ve recommenda#ons
It sounds like you would definitely
love Norway. A cabin in the
mountains maybe?
And something fun for the kids
nearby, I suppose?
I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?
64. DATA
• Long-tail en##es
• On-the-fly informa#on extrac#on
• "Personal" knowledge base
• "Wife", "My students", "my group", "my
espresso machine", ... en##es I care about
Here is a list of places that I think you
might like.
According to the reviews that I can
find on the web, ...
Order a water filter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Breville BES860XL Barista
Express Espresso Machine
65. RESULT PRESENTATION
& USER INTERACTION
• Providing evidence
• "Ac#onable" en##es
• Make booking, order item, write email, ...
• Helping the user to get things
done
• Support for task comple#on
... based on sta's'cs from the past
30 years, ...
According to your wife's calendar, ...
Agenda item Summer planning added
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.
66. SUMMARY
Understanding
informa'on needs
Data source(s)
Result presenta'on
& user interac'on
Retrieval method
• Seman#c annota#ons
• An#cipa#ng info needs
• Natural language
conversa#onal interfaces
• Long tail en##es
• Personal knowledge base
• On-the-fly informa#on extrac#on
• Hybrid approaches
• En#ty cards
• Ac#onable en##es
• Support for task comple#on
67. ACKNOWLEDGMENTS
• Joint work with
• Faegheh Hasibi
• Jan Benetka
• Darío Gariglioz
• Kje#l Nørvåg
• Svein Erik Bratsberg