On Entities and Evaluation

ON ENTITIES AND EVALUATION
Krisztian Balog
University of Stavanger 
@krisztianbalog
Keynote given at the 41st European Conference on Informa<on Retrieval (ECIR '19) | Cologne, Germany, April 2019

SPECIAL THANKS TO
• My former PhD advisor:
• Maarten de Rijke
• My former and current PhD students:
• Jan R. Benetka, Richard Berendsen, Marc Bron, Heng Ding,
Darío Garigliotti, Faegheh Hasibi, Trond Linjordet, Robert
Neumayer, Shuo Zhang
• Collaborators on material presented in this talk:
• Po-Yu Chuang, Peter Dekker, Maarten de Rijke, Kristian
Gingstad, Rolf Jagerman, Øyvind Jekteberg, Liadh Kelly, Tom
Kenter, Phillip Schaer, Anne Schuth, Narges Tavakolpoursaleh

OUTLINE FOR PART I
• What is an entity?  
• Why care about entities?  
• What research has been done on entities in IR?  
• What’s next?

WHAT IS AN ENTITY?
An entity is an object or thing that can be uniquely identiﬁed.
entity catalog entity ID*
name(s)*

AN ENTITY
<dbr:Roger_Needham>
<dbo:Scientist>
<dbo:Person>
<dbo:Agent>
<owl:Thing>
<rdf:type>
<dbo:abstract>
"1935-08-26"
"Karen Spärck Jones"
<foaf:name>
<dbo:spouse>
<University_of_Cambridge>
<dbp:almaMater>
<dbr:Natural_language_processing>
<dbo:knownFor>
<dbc:Information_retrieval_researchers>
<dct:subject>
<dbc:British_women_computer_scientists>
<dbc:British_computer_scientists> <dbc:British_women_scientists>
"Karen Spärck Jones FBA (26 August
1935 – 4 April 2007) was a British
computer scientist."
<dbr:Karen_Spark_Jones>
<dbo:birthDate>

WHAT IS AN ENTITY?
An entity is a uniquely identiﬁable object or thing,
characterized by its name(s), type(s), attributes, and
relationships to other entities.

REPRESENTING ENTITIES  
AND THEIR PROPERTIES
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)

REPRESENTING ENTITIES  
AND THEIR PROPERTIES
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
knowledge base (KB) / 
knowledge graph (KG)
attributes
relationships (typed links)

WHY CARE ABOUT ENTITIES?
• From a user perspective, entities ...
• are natural units for organizing
information
• enable a richer and more eﬀective
user experience

WHY CARE ABOUT ENTITIES?
• From a machine perspective, entities ...
• allow for a better understanding of queries,
document content, and of users
• enable search engines to be more
intelligent
Michael Schumacher (born 3 January 1969) is a German retired racing driver. He
is a seven-time Formula One World Champion and is widely regarded as one of
the greatest Formula One drivers of all time. He won two titles with Benetton in
1994 and 1995 before moving to Ferrari where he drove for eleven years. His
time with Ferrari yielded ﬁve consecutive titles between 2000 and 2004.
Michael Schumacher
Schuderia Ferrari
Benetton Formula
Racing driver
Formula One constructor
Formula One constructor
Formula One
Auto racing series

Part I En88es
RESEARCH ON ENTITIES IN IR

TRENDS IN THE IR LITERATURE
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities Wikipedia
knowledge base knowledge graph
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW

TRENDS IN THE IR LITERATURE
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities
Wikipedia OR "knowledge base" OR "knowledge graph"

#1 ENTITIES AS THE UNIT OF RETRIEVAL
• A significant portion of queries mention or target entities
• Those queries are better answered with a ranked list of
entities (as opposed to a list of documents)
• Term-based entity representations can be effectively
ranked using document-based retrieval models
• Semantically informed retrieval models utilize entity-
specific properties (attributes, types, and relationships)

#2 ENTITIES FOR KNOWLEDGE
REPRESENTATION
• Entities help to bridge the gap between unstructured and
structured data
<entity>
<entity>
Entity linking
Knowledge base
population

#3 ENTITIES FOR AN ENHANCED
SEARCH EXPERIENCE
• Improve the search experience through the entire search
process
• Understanding search queries
• Improving document retrieval performance
• Query assistance services (auto-completion, suggestions)
• Entity recommendations

WANT TO KNOW MORE?
www.eos-book.org

SCENARIO #1
I would like to get some new strings
for my guitar
AIOK, would that be your electric guitar or
the acoustic one?
The electric one.
AIAlright. I can repeat your Amazon order of
3 months ago, or you can go by a music
store on Elm street on the way to your
dentist appointment this afternoon.

TRULY PERSONAL AI  
IS NOT POSSIBLE WITHOUT A
PERSONAL KNOWLEDGE GRAPH

PERSONAL KNOWLEDGE GRAPHS
A personal knowledge graph (PKG) is a source of
structured knowledge about entities and the relation
between them, where the entities and the relations
between them are of personal, rather than general,
importance.

PERSONAL KNOWLEDGE GRAPHS
User
Hometown
Mom
Social network
Jamie
High schoolGeneral
-purpose KG
Electric guitar
E-commerce
catalog
Mom’s dentist
Domain-specific KG
Acoustic guitar
Personal
Knowledge Graph

Part I En88es
A RESEARCH AGENDA  
FOR PERSONAL KNOWLEDGE GRAPHS
Part I En88es

#1 KNOWLEDGE REPRESENTATION
• Task: representing entities and their properties
• KGs are organized according to a knowledge model (schema)
• Peculiarities/challenges:
• Entities need to be (directly/indirectly) connected to the user
• Not duplicating attributes, focusing on what is personal
• Information about entities can be very sparse
• Some entities may not have any digital presence
• Strong temporality (relations can be ephemeral)

#1 KNOWLEDGE REPRESENTATION
• Task: representing entities and their properties
• KGs are organized according to a knowledge model (schema)
• Peculiarities/challenges:
• Entities need to be (directly/indirectly) connected to the user
• Not duplicating attributes, focusing on what is personal
• Information about entities can be very sparse
• Some entities may not have any digital presence
• Strong temporality (relations can be ephemeral)
What is the best way of representing entities and their
properties and relations, considering the vast but
sparse set of possible predicates?
RQ1

#2 SEMANTIC ANNOTATION OF TEXT
• Task: annotating text with respect to a knowledge
repository (commonly known as entity linking)
• Usually involves mention detection, entity disambiguation,
and NIL-detection steps
• Challenges
• Entities might have little to no digital presence
• Entities are not necessarily proper nouns
• Linking, NIL-detection, and KG population are intertwined

#2 SEMANTIC ANNOTATION OF TEXT
• Task: annotating text with respect to a knowledge
repository (commonly known as entity linking)
• Usually involves mention detection, entity disambiguation,
and NIL-detection steps
• Challenges
• Entities might have little to no digital presence
• Entities are not necessarily proper nouns
• Linking, NIL-detection, and KG population are intertwined
How can entity linking be performed against a
personal knowledge graph, where structured entity
information to rely on is potentially absent?
RQ2a
When should entity linking be performed against a
personal knowledge graph as opposed to a general-
purpose KG?
RQ2b

SCENARIO #2
I need to see a dentist. Mom recommended
hers at dinner yesterday.
AII can try to help you ﬁnd this person. Do
you have any more information?
I reckon that him and Mom graduated from
the same high school the same year.
AIOK, that's enough to narrow it down.  
It must be Dr. John Pullman.
That must be him. I remember he had a ﬁtting
name. Can you try make an appointment for
Thursday afternoon?

#3 POPULATION AND MAINTENANCE
• Task: extending a KG from external sources (KB
acceleration/population) or via internal inferencing
• Veriﬁcation of facts in the KG
• Challenges:
• Single curator; more automation is desired than for KGs, but
the user should still be in control
• The ﬁrst mention of an entity should trigger population
• Properties may be inferred from the context

#3 POPULATION AND MAINTENANCE
• Task: extending a KG from external sources (KB
acceleration/population) or via internal inferencing
• Veriﬁcation of facts in the KG
• Challenges:
• Single curator; more automation is desired than for KGs, but
the user should still be in control
• The ﬁrst mention of an entity should trigger population
• Properties may be inferred from the context
How can personal knowledge graphs be automatically
populated and reliably maintained?
RQ3

SCENARIO #3
AISince you're running a half marathon at
Hackney in May, may I suggest you
undertake a 10k run this weekend?
Yes, that sounds like a good idea. Any
suggestions for a not too popular route  
that I haven't done before?
AISure thing. I'll upload some routes to the
running app on your phone.
Cheers mate!

#4 QUERYING
• Task: Retrieving information (entities, types, relations, etc.)
from the PKG or from KGs with the help of the PKG
• Challenges:
• Sparsity of data
• Soft, subjective constraints

#4 QUERYING
• Task: Retrieving information (entities, types, relations, etc.)
from the PKG or from KGs with the help of the PKG
• Challenges:
• Sparsity of data
• Soft, subjective constraints
How to leverage the semantically rich but sparse
information in personal knowledge graphs for
answering natural language queries?
RQ4

#5 INTEGRATION WITH EXTERNAL
SOURCES
• Task: recognizing the same entity across multiple data
sources (a.k.a. object resolution, record linkage, ...)
• Challenges:
• One-to-many, as opposed to one-to-one linkage
• Continuous process, not a one-off effort
• Two-way synchronization would be desired
• Conflicting facts or relations need resolving by the user

#5 INTEGRATION WITH EXTERNAL
SOURCES
• Task: recognizing the same entity across multiple data
sources (a.k.a. object resolution, record linkage, ...)
• Challenges:
• One-to-many, as opposed to one-to-one linkage
• Continuous process, not a one-off effort
• Two-way synchronization would be desired
• Conflicting facts or relations need resolving by the user
How to provide continuous two-way integration with
external knowledge sources with the user in the loop?
RQ5

RESEARCH QUESTIONS  
FOR PERSONAL KNOWLEDGE GRAPHS
• What is the best way of representing entities and their properties
and relations, considering the vast but sparse set of possible
predicates?
• How can entity linking be performed against a personal knowledge
graph, where structured entity information to rely on is potentially
absent?
• How can personal knowledge graphs be automatically populated and
reliably maintained?
• How to leverage the semantically rich but sparse information in
personal knowledge graphs for answering natural language queries?
• How to provide continuous two-way integration with external
knowledge sources with the user in the loop?

THERE IS MORE...
• Implementation
• Where is it stored (on the device, cloud, etc.)?
• How can security and privacy be ensured?
• How to interact with a range of services with proper access
control?
• Evaluation
• How to build reusable test resources?

SUMMARY OF PART I
• Progress on entity-oriented search was enabled by large
open knowledge repositories
• Personal AI is not possible without the concept of a
personal knowledge graph
• Many interesting research opportunities are available

OUTLINE FOR PART II
• Online evaluation and why we need it 
• Living labs: methodology and lessons learned 
• What's next?

EVALUATION METHODOLOGIES
• Oﬄine evaluation ("TREC-style" studies)
• Online evaluation
• Lab-based studies
• Simulation of users
• ...

ONLINE EVALUATION 101
• See how regular users interact with a retrieval system
when just using it
• Observe implicit behavior
• Clicks, skips, saves, forwards, bookmarks, likes, etc.
• Try to infer differences in behavior from different flavors of
the live system
• A/B testing, interleaving
• Run statistical tests to confirm the difference is not due to
chance

CHALLENGES IN ONLINE EVALUATION
• It's a live service
• Complexity of modern SERPs
• Data is noisy
• There’s no “ground truth”

OFFLINE VS. ONLINE EVALUATION
Oﬄine Online
Basic assumption Assessors tell you what is relevant
Observable user behavior can tell
you what is relevant
Quality Data is only as good as the guidelines
Real user data, real and
representative information needs
Realisticity
Simpliﬁed scenario, cannot go
beyond a certain level of complexity
Perfectly realistic setting (users are
not aware that they are guniea pigs)
Assessment cost Expensive Cheap
Scalability Doesn't scale Scales very well
Repeatability Repeatable Not repeatable
Throughput High Low
Risk None High

THE COMMUNITY NEEDS  
OPEN RESEARCH PLATFORMS
FOR ONLINE EVALUATION

LIVING LABS
Living labs is a new evaluation paradigm for IR,
where the experimentation platform is an existing
search engine. Researchers have the opportunity to
replace components of this search engine and
evaluate these components using interactions with
real, "unsuspecting" users of this search engine.

OVERVIEW
experimental
systems
users live site
?
organizaOon

ALL WE NEED IS A SITE:
LET'S TAKE AN EXISTING ONE

KEY IDEAS FOR OPERATIONALIZATION
• An API orchestrates all the data exchange between sites
(live search engines) and participants
• Focus on frequent (head) queries
• Enough traﬃc on them for experimentation
• Participants generate rankings oﬄine and upload these
to the API
• Eliminates real-time requirement
• Freedom in choice of tools and environment
K. Balog, L. Kelly, andA. Schuth.Head First: Living Labs for Ad-hoc Search Evalua<on. CIKM'14

OVERVIEW
experimental
systems
users live site
API
K. Balog, L. Kelly, andA. Schuth.Head First: Living Labs for Ad-hoc Search Evalua<on. CIKM'14

METHODOLOGY (1)
experimental
system
users live site
API
• Sites make queries, candidate documents (items),
historical search and click data available through
the API

METHODOLOGY (2)
experimental
system
users live site
API
• Rankings are generated (oﬄine) for each query and
uploaded to the API

METHODOLOGY (3)
experimental
system
API
• When any of the test queries is ﬁred on the live
site, it requests an experimental ranking from the
API and interleaves it with that of the producOon
system
query
interleaved
ranking
query
experimental
ranking

METHODOLOGY (3)
experimental
system
API
• When any of the test queries is ﬁred on the live
site, it requests an experimental ranking from the
API and interleaves it with that of the producOon
system
query
interleaved
ranking
query
experimental
ranking
doc 1
doc 2
doc 3
doc 4
doc 5
doc 2
doc 4
doc 7
doc 1
doc 3
system A system B
doc 1
doc 2
doc 4
doc 3
doc 7
interleaved list

METHODOLOGY (4)
• ParOcipants get detailed feedback on user
interacOons (clicks)
experimental
system
users live site
API

METHODOLOGY (5)
• Evaluation measure:
• where the number of “wins” and “losses” is against
the production system, aggregated over a period
of time
• An Outcome of > 0.5 means beating the production
system
Outcome =
#Wins
#Wins + #Losses

LIMITATIONS
• Head queries only: Considerable portion of traﬃc, but
only popular info needs
• Lack of context: No knowledge of the searcher’s location,
previous searches, etc.
• No real-time feedback: API provides detailed feedback,
but it’s not immediate
• Limited control: Experimentation is limited to single
searches, where results are interleaved with those of the
production system; no control over the entire result list

EVALUATION CAMPAIGNS
Part II Evalua8on

EVALUATION CAMPAIGNS
Product search
(Hungarian toy store)
Product search
(Hungarian toy store)
Academic search
(CiteSeerX, SSOAR,
Microsoft Academic)
Academic search
(CiteSeerX, SSOAR)
Web search (Czech
web search engine)
OS‘16 OS‘17

TREC OPENSEARCH
• Sites: academic search engines
• Task: ad hoc scientiﬁc literature search
• Multiple evaluation rounds (6 weeks each)
• Train/test queries
• Training queries: feedback on individual impressions
• Test queries: only aggregated feedback at the end of the
evaluation period

CITESEERX @TREC-OS 2016
Round 1 Round 2 Round 3
Impressions 359 571 4829
Clicks 144 128 651
ery
0
100
200
300
Numberofimpressions
2016 Round 1 - CiteSeerX
ery
0
100
200
300
400
Numberofimpressions
ery
0
200
400
600
Numberofimpressions
ery
0
5
10
15
Numberofclicks
ery
0
5
10
15
20
Numberofclicks
ery
0
5
10
15
20
Numberofclicks

EVALUATION RESULTS 
CITESEERX, TREC-OS 2016, ROUND #3
Wins Ties Losses Outcome p-value
System 1 48 15 39 0.5517 0.3912
System 2 27 11 22 0.5510 0.5682
System 3 35 14 32 0.5224 0.8072
...
We would need to gather data for about six months for p 0.05 and
for about a year for p 0.01 (assuming a similar win/loss ratio).

LESSONS LEARNED
• Head first idea is feasible
• Running multiple campaigns without major technical hurdles
• Low traffic/click volume is an issue
• No statistically significant differences observed
• Possible remedy is to use more queries (tap into the long tail)
• Main challenges are more of an organizational than of a
technical nature
• Nontrivial infrastructure development on the service providers’ side
• Convincing large industrial partners as sites
• Attracting a large and active set of participants
R. Jagerman, K. Balog, and M. de Rijke.OpenSearch: Lessons Learned from an Online Evaluaon Campaign. Journal of
Data and Informaon Quality, 2018.

ALL WE NEED IS A SITE:
LET'S BUILD ONE

A SUCCESS STORY
Part II Evalua8on

run by GroupLens, a
research lab at the
University of Minnesota

OFFLINE DATASETS
• MovieLens-20M
• 20M item ratings
• 27K movies
• 138K users
• 465K tags
• links to YouTube trailers for 25K movies

ONLINE EXPERIMENTATION WITH  
NOVEL USER INTERFACES

BUILDING A SERVICE FOR
SCIENTIFIC LITERATURE
RECOMMENDATION
Part II Evalua8on

ARXIVDIGEST: THE SERVICE
• Recommendation service to help keep up with scientific
literature published on arxiv.org
• Users sign up and indicate their interests by providing
keywords, Google Scholar/DBLP profile, etc.
• Users receive recommendations regularly in a digest email
• Articles can be liked
• Users agree that their profile, the articles recommended to
them, and their feedback would be made available to
experimental systems

ARXIVDIGEST: THE EVALUATION
PLATFORM
• Broker-based architecture
• RESTful API for accessing article and user data and for
uploading recommendations
• Participating teams are given a window each day to
download new content and to generate
recommendations for all users
• Users receive interleaved rankings
• Performance is monitored continuously over time

CURRENT STATUS AND
OPPORTUNITIES
• All components of the broker in place
• https://github.com/iai-group/ArXivDigest
• Ensuring GDPR-compliance is in progress
• Opportunities for studying
• Personalized recommender algorithms
• Explainable recommendations
• Interleaving
• ...
Fork
m
e
on
G
itH
ub

SUMMARY OF PART II
• The community needs open online evaluation platforms
• Lessons learned from previous evaluation benchmarks
• Proposal: develop a service that we'd use ourselves

TAKE-HOME MESSAGES
• A truly personal AI is not possible without a personal
knowledge graph
• The community needs open research platforms for online
evaluation

On Entities and Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On Entities and Evaluation

Similar to On Entities and Evaluation (20)

More from krisztianbalog

More from krisztianbalog (14)

Recently uploaded

Recently uploaded (20)

On Entities and Evaluation