1. ON ENTITIES AND EVALUATION
Krisztian Balog
University of Stavanger
@krisztianbalog
Keynote given at the 41st European Conference on Informa<on Retrieval (ECIR '19) | Cologne, Germany, April 2019
2. SPECIAL THANKS TO
• My former PhD advisor:
• Maarten de Rijke
• My former and current PhD students:
• Jan R. Benetka, Richard Berendsen, Marc Bron, Heng Ding,
Darío Garigliotti, Faegheh Hasibi, Trond Linjordet, Robert
Neumayer, Shuo Zhang
• Collaborators on material presented in this talk:
• Po-Yu Chuang, Peter Dekker, Maarten de Rijke, Kristian
Gingstad, Rolf Jagerman, Øyvind Jekteberg, Liadh Kelly, Tom
Kenter, Phillip Schaer, Anne Schuth, Narges Tavakolpoursaleh
4. OUTLINE FOR PART I
• What is an entity?
• Why care about entities?
• What research has been done on entities in IR?
• What’s next?
5. WHAT IS AN ENTITY?
An entity is an object or thing that can be uniquely identified.
entity catalog entity ID*
name(s)*
6. AN ENTITY
<dbr:Roger_Needham>
<dbo:Scientist>
<dbo:Person>
<dbo:Agent>
<owl:Thing>
<rdf:type>
<dbo:abstract>
"1935-08-26"
"Karen Spärck Jones"
<foaf:name>
<dbo:spouse>
<University_of_Cambridge>
<dbp:almaMater>
<dbr:Natural_language_processing>
<dbo:knownFor>
<dbc:Information_retrieval_researchers>
<dct:subject>
<dbc:British_women_computer_scientists>
<dbc:British_computer_scientists> <dbc:British_women_scientists>
"Karen Spärck Jones FBA (26 August
1935 – 4 April 2007) was a British
computer scientist."
<dbr:Karen_Spark_Jones>
<dbo:birthDate>
7. WHAT IS AN ENTITY?
An entity is a uniquely identifiable object or thing,
characterized by its name(s), type(s), attributes, and
relationships to other entities.
8. REPRESENTING ENTITIES
AND THEIR PROPERTIES
entity catalog entity ID*
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
9. REPRESENTING ENTITIES
AND THEIR PROPERTIES
entity catalog entity ID*
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
knowledge base (KB) /
knowledge graph (KG)
attributes
relationships (typed links)
10. WHY CARE ABOUT ENTITIES?
• From a user perspective, entities ...
• are natural units for organizing
information
• enable a richer and more effective
user experience
11. WHY CARE ABOUT ENTITIES?
• From a machine perspective, entities ...
• allow for a better understanding of queries,
document content, and of users
• enable search engines to be more
intelligent
Michael Schumacher (born 3 January 1969) is a German retired racing driver. He
is a seven-time Formula One World Champion and is widely regarded as one of
the greatest Formula One drivers of all time. He won two titles with Benetton in
1994 and 1995 before moving to Ferrari where he drove for eleven years. His
time with Ferrari yielded five consecutive titles between 2000 and 2004.
Michael Schumacher
Schuderia Ferrari
Benetton Formula
Racing driver
Formula One constructor
Formula One constructor
Formula One
Auto racing series
13. TRENDS IN THE IR LITERATURE
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities Wikipedia
knowledge base knowledge graph
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW
14. TRENDS IN THE IR LITERATURE
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities
Wikipedia OR "knowledge base" OR "knowledge graph"
15. #1 ENTITIES AS THE UNIT OF RETRIEVAL
• A significant portion of queries mention or target entities
• Those queries are better answered with a ranked list of
entities (as opposed to a list of documents)
• Term-based entity representations can be effectively
ranked using document-based retrieval models
• Semantically informed retrieval models utilize entity-
specific properties (attributes, types, and relationships)
16. #2 ENTITIES FOR KNOWLEDGE
REPRESENTATION
• Entities help to bridge the gap between unstructured and
structured data
<entity>
<entity>
Entity linking
Knowledge base
population
17. #3 ENTITIES FOR AN ENHANCED
SEARCH EXPERIENCE
• Improve the search experience through the entire search
process
• Understanding search queries
• Improving document retrieval performance
• Query assistance services (auto-completion, suggestions)
• Entity recommendations
19. OUTLINE FOR PART I
• What is an entity?
• Why care about entities?
• What research has been done on entities in IR?
• What’s next?
20. SCENARIO #1
I would like to get some new strings
for my guitar
AIOK, would that be your electric guitar or
the acoustic one?
The electric one.
AIAlright. I can repeat your Amazon order of
3 months ago, or you can go by a music
store on Elm street on the way to your
dentist appointment this afternoon.
21. TRULY PERSONAL AI
IS NOT POSSIBLE WITHOUT A
PERSONAL KNOWLEDGE GRAPH
22. PERSONAL KNOWLEDGE GRAPHS
A personal knowledge graph (PKG) is a source of
structured knowledge about entities and the relation
between them, where the entities and the relations
between them are of personal, rather than general,
importance.
23. PERSONAL KNOWLEDGE GRAPHS
User
Hometown
Mom
Social network
Jamie
High schoolGeneral
-purpose KG
Electric guitar
E-commerce
catalog
Mom’s dentist
Domain-specific KG
Acoustic guitar
Personal
Knowledge Graph
24. Part I En88es
A RESEARCH AGENDA
FOR PERSONAL KNOWLEDGE GRAPHS
Part I En88es
25. #1 KNOWLEDGE REPRESENTATION
• Task: representing entities and their properties
• KGs are organized according to a knowledge model (schema)
• Peculiarities/challenges:
• Entities need to be (directly/indirectly) connected to the user
• Not duplicating attributes, focusing on what is personal
• Information about entities can be very sparse
• Some entities may not have any digital presence
• Strong temporality (relations can be ephemeral)
26. #1 KNOWLEDGE REPRESENTATION
• Task: representing entities and their properties
• KGs are organized according to a knowledge model (schema)
• Peculiarities/challenges:
• Entities need to be (directly/indirectly) connected to the user
• Not duplicating attributes, focusing on what is personal
• Information about entities can be very sparse
• Some entities may not have any digital presence
• Strong temporality (relations can be ephemeral)
What is the best way of representing entities and their
properties and relations, considering the vast but
sparse set of possible predicates?
RQ1
27. #2 SEMANTIC ANNOTATION OF TEXT
• Task: annotating text with respect to a knowledge
repository (commonly known as entity linking)
• Usually involves mention detection, entity disambiguation,
and NIL-detection steps
• Challenges
• Entities might have little to no digital presence
• Entities are not necessarily proper nouns
• Linking, NIL-detection, and KG population are intertwined
28. #2 SEMANTIC ANNOTATION OF TEXT
• Task: annotating text with respect to a knowledge
repository (commonly known as entity linking)
• Usually involves mention detection, entity disambiguation,
and NIL-detection steps
• Challenges
• Entities might have little to no digital presence
• Entities are not necessarily proper nouns
• Linking, NIL-detection, and KG population are intertwined
How can entity linking be performed against a
personal knowledge graph, where structured entity
information to rely on is potentially absent?
RQ2a
When should entity linking be performed against a
personal knowledge graph as opposed to a general-
purpose KG?
RQ2b
29. SCENARIO #2
I need to see a dentist. Mom recommended
hers at dinner yesterday.
AII can try to help you find this person. Do
you have any more information?
I reckon that him and Mom graduated from
the same high school the same year.
AIOK, that's enough to narrow it down.
It must be Dr. John Pullman.
That must be him. I remember he had a fitting
name. Can you try make an appointment for
Thursday afternoon?
30. #3 POPULATION AND MAINTENANCE
• Task: extending a KG from external sources (KB
acceleration/population) or via internal inferencing
• Verification of facts in the KG
• Challenges:
• Single curator; more automation is desired than for KGs, but
the user should still be in control
• The first mention of an entity should trigger population
• Properties may be inferred from the context
31. #3 POPULATION AND MAINTENANCE
• Task: extending a KG from external sources (KB
acceleration/population) or via internal inferencing
• Verification of facts in the KG
• Challenges:
• Single curator; more automation is desired than for KGs, but
the user should still be in control
• The first mention of an entity should trigger population
• Properties may be inferred from the context
How can personal knowledge graphs be automatically
populated and reliably maintained?
RQ3
32. SCENARIO #3
AISince you're running a half marathon at
Hackney in May, may I suggest you
undertake a 10k run this weekend?
Yes, that sounds like a good idea. Any
suggestions for a not too popular route
that I haven't done before?
AISure thing. I'll upload some routes to the
running app on your phone.
Cheers mate!
33. #4 QUERYING
• Task: Retrieving information (entities, types, relations, etc.)
from the PKG or from KGs with the help of the PKG
• Challenges:
• Sparsity of data
• Soft, subjective constraints
34. #4 QUERYING
• Task: Retrieving information (entities, types, relations, etc.)
from the PKG or from KGs with the help of the PKG
• Challenges:
• Sparsity of data
• Soft, subjective constraints
How to leverage the semantically rich but sparse
information in personal knowledge graphs for
answering natural language queries?
RQ4
35. #5 INTEGRATION WITH EXTERNAL
SOURCES
• Task: recognizing the same entity across multiple data
sources (a.k.a. object resolution, record linkage, ...)
• Challenges:
• One-to-many, as opposed to one-to-one linkage
• Continuous process, not a one-off effort
• Two-way synchronization would be desired
• Conflicting facts or relations need resolving by the user
36. #5 INTEGRATION WITH EXTERNAL
SOURCES
• Task: recognizing the same entity across multiple data
sources (a.k.a. object resolution, record linkage, ...)
• Challenges:
• One-to-many, as opposed to one-to-one linkage
• Continuous process, not a one-off effort
• Two-way synchronization would be desired
• Conflicting facts or relations need resolving by the user
How to provide continuous two-way integration with
external knowledge sources with the user in the loop?
RQ5
37. RESEARCH QUESTIONS
FOR PERSONAL KNOWLEDGE GRAPHS
• What is the best way of representing entities and their properties
and relations, considering the vast but sparse set of possible
predicates?
• How can entity linking be performed against a personal knowledge
graph, where structured entity information to rely on is potentially
absent?
• How can personal knowledge graphs be automatically populated and
reliably maintained?
• How to leverage the semantically rich but sparse information in
personal knowledge graphs for answering natural language queries?
• How to provide continuous two-way integration with external
knowledge sources with the user in the loop?
38. THERE IS MORE...
• Implementation
• Where is it stored (on the device, cloud, etc.)?
• How can security and privacy be ensured?
• How to interact with a range of services with proper access
control?
• Evaluation
• How to build reusable test resources?
39. SUMMARY OF PART I
• Progress on entity-oriented search was enabled by large
open knowledge repositories
• Personal AI is not possible without the concept of a
personal knowledge graph
• Many interesting research opportunities are available
43. ONLINE EVALUATION 101
• See how regular users interact with a retrieval system
when just using it
• Observe implicit behavior
• Clicks, skips, saves, forwards, bookmarks, likes, etc.
• Try to infer differences in behavior from different flavors of
the live system
• A/B testing, interleaving
• Run statistical tests to confirm the difference is not due to
chance
44. CHALLENGES IN ONLINE EVALUATION
• It's a live service
• Complexity of modern SERPs
• Data is noisy
• There’s no “ground truth”
45. OFFLINE VS. ONLINE EVALUATION
Offline Online
Basic assumption Assessors tell you what is relevant
Observable user behavior can tell
you what is relevant
Quality Data is only as good as the guidelines
Real user data, real and
representative information needs
Realisticity
Simplified scenario, cannot go
beyond a certain level of complexity
Perfectly realistic setting (users are
not aware that they are guniea pigs)
Assessment cost Expensive Cheap
Scalability Doesn't scale Scales very well
Repeatability Repeatable Not repeatable
Throughput High Low
Risk None High
47. LIVING LABS
Living labs is a new evaluation paradigm for IR,
where the experimentation platform is an existing
search engine. Researchers have the opportunity to
replace components of this search engine and
evaluate these components using interactions with
real, "unsuspecting" users of this search engine.
49. ALL WE NEED IS A SITE:
LET'S TAKE AN EXISTING ONE
50. KEY IDEAS FOR OPERATIONALIZATION
• An API orchestrates all the data exchange between sites
(live search engines) and participants
• Focus on frequent (head) queries
• Enough traffic on them for experimentation
• Participants generate rankings offline and upload these
to the API
• Eliminates real-time requirement
• Freedom in choice of tools and environment
K. Balog, L. Kelly, andA. Schuth.Head First: Living Labs for Ad-hoc Search Evalua<on. CIKM'14
54. METHODOLOGY (3)
experimental
system
API
• When any of the test queries is fired on the live
site, it requests an experimental ranking from the
API and interleaves it with that of the producOon
system
query
interleaved
ranking
query
experimental
ranking
55. METHODOLOGY (3)
experimental
system
API
• When any of the test queries is fired on the live
site, it requests an experimental ranking from the
API and interleaves it with that of the producOon
system
query
interleaved
ranking
query
experimental
ranking
doc 1
doc 2
doc 3
doc 4
doc 5
doc 2
doc 4
doc 7
doc 1
doc 3
system A system B
doc 1
doc 2
doc 4
doc 3
doc 7
interleaved list
56. METHODOLOGY (4)
• ParOcipants get detailed feedback on user
interacOons (clicks)
experimental
system
users live site
API
57. METHODOLOGY (5)
• Evaluation measure:
• where the number of “wins” and “losses” is against
the production system, aggregated over a period
of time
• An Outcome of > 0.5 means beating the production
system
Outcome =
#Wins
#Wins + #Losses
58. LIMITATIONS
• Head queries only: Considerable portion of traffic, but
only popular info needs
• Lack of context: No knowledge of the searcher’s location,
previous searches, etc.
• No real-time feedback: API provides detailed feedback,
but it’s not immediate
• Limited control: Experimentation is limited to single
searches, where results are interleaved with those of the
production system; no control over the entire result list
60. EVALUATION CAMPAIGNS
Product search
(Hungarian toy store)
Product search
(Hungarian toy store)
Academic search
(CiteSeerX, SSOAR,
Microsoft Academic)
Academic search
(CiteSeerX, SSOAR)
Web search (Czech
web search engine)
OS‘16 OS‘17
61. TREC OPENSEARCH
• Sites: academic search engines
• Task: ad hoc scientific literature search
• Multiple evaluation rounds (6 weeks each)
• Train/test queries
• Training queries: feedback on individual impressions
• Test queries: only aggregated feedback at the end of the
evaluation period
63. EVALUATION RESULTS
CITESEERX, TREC-OS 2016, ROUND #3
Wins Ties Losses Outcome p-value
System 1 48 15 39 0.5517 0.3912
System 2 27 11 22 0.5510 0.5682
System 3 35 14 32 0.5224 0.8072
...
We would need to gather data for about six months for p 0.05 and
for about a year for p 0.01 (assuming a similar win/loss ratio).
64. LESSONS LEARNED
• Head first idea is feasible
• Running multiple campaigns without major technical hurdles
• Low traffic/click volume is an issue
• No statistically significant differences observed
• Possible remedy is to use more queries (tap into the long tail)
• Main challenges are more of an organizational than of a
technical nature
• Nontrivial infrastructure development on the service providers’ side
• Convincing large industrial partners as sites
• Attracting a large and active set of participants
R. Jagerman, K. Balog, and M. de Rijke.OpenSearch: Lessons Learned from an Online Evaluaon Campaign. Journal of
Data and Informaon Quality, 2018.
72. BUILDING A SERVICE FOR
SCIENTIFIC LITERATURE
RECOMMENDATION
Part II Evalua8on
73. ARXIVDIGEST: THE SERVICE
• Recommendation service to help keep up with scientific
literature published on arxiv.org
• Users sign up and indicate their interests by providing
keywords, Google Scholar/DBLP profile, etc.
• Users receive recommendations regularly in a digest email
• Articles can be liked
• Users agree that their profile, the articles recommended to
them, and their feedback would be made available to
experimental systems
75. ARXIVDIGEST: THE EVALUATION
PLATFORM
• Broker-based architecture
• RESTful API for accessing article and user data and for
uploading recommendations
• Participating teams are given a window each day to
download new content and to generate
recommendations for all users
• Users receive interleaved rankings
• Performance is monitored continuously over time
76. CURRENT STATUS AND
OPPORTUNITIES
• All components of the broker in place
• https://github.com/iai-group/ArXivDigest
• Ensuring GDPR-compliance is in progress
• Opportunities for studying
• Personalized recommender algorithms
• Explainable recommendations
• Interleaving
• ...
Fork
m
e
on
G
itH
ub
77. SUMMARY OF PART II
• The community needs open online evaluation platforms
• Lessons learned from previous evaluation benchmarks
• Proposal: develop a service that we'd use ourselves
78. TAKE-HOME MESSAGES
• A truly personal AI is not possible without a personal
knowledge graph
• The community needs open research platforms for online
evaluation