Measuring System Performance in Cultural Heritage Systems

Measuring System Performance in
Cultural Heritage Information Systems
Toine Bogers
Aalborg University Copenhagen, Denmark
‘Evaluating Cultural Heritage Information Systems’ workshop
iConference 2015, Newport Beach
March 24, 2015

Outline
• Types of cultural heritage (CH) information systems
- Deﬁnition
- Common evaluation practice
• Challenges
• Case study: Social Book Search track
2

Types of cultural heritage information systems
• Large variety in the types of cultural heritage collections → many
diﬀerent ways of unlocking this material
• Four main types of cultural heritage information systems
- Search
- Browsing
- Recommendation
- Enrichment
3

Search (Deﬁnition)
• Search engines provide direct access to the collection
- Search engine indexes representations of the collection objects
(occasionally full-text)
- User interacts by actively submitting queries describing their information
need(s)
- Search engine ranks the collection documents by (topical) relevance for
the query
• Examples
- Making museum collection metadata accessible (Koolen et al., 2009)
- Searching through war-time radio broadcasts (Heeren et al., 2007)
- Unlocking television broadcast archives (Hollink et al., 2009) 4

5
p of the GUI used for the system. Top left the high-level concept search part and right
Example: Searching broadcast video archives

Search (Evaluation practice)
• What do we need?
- Realistic collection of objects with textual representations
- Representative set of real-world information needs → for reliable
evaluation we typically need ≥50 topics
- Relevance judgments → (semi-)complete list of correct
answers for each of these topics, preferably from the original users
• How do we evaluate?
- Unranked → Precision (what did we get right?) & Recall (what did we miss?)
- Ranked → MRR (where is the ﬁrst relevant result?), MAP (are the relevant
results all near the top?), and nDCG (are the most relevant results returned
before the less relevant ones?)
6
‘test
collection’

Browsing (Deﬁnition)
• Browsing supports free to semi-guided exploration of collections
- Object metadata allows for links between objects → clicking on a link
shows all other objects that share that property
- Exploration can also take place along other dimensions (e.g., temporal or
geographical)
- Taxonomies & ontologies can be used to link objects in different ways
- Users can explore with or without an direct information need
• Examples
- Exploring digital cultural heritage spaces in PATHS (Hall et al., 2012)
- Semantic portals for cultural heritage (Hyvönen, 2009)
7

8
Example: Providing multiple paths through collection using PATHS

Browsing (Evaluation practice)
- System-based evaluation of performance is hard to do → browsing is the
most user-focused of the four system types
- If historical interaction logs are available, then these could be used to
identify potential browsing ‘shortcuts’
- Known-item evaluation → Shortest path lengths to randomly selected
items can provide a hint about best possible outcome
‣ Needs to be complemented with user-based studies of actual browsing behavior!
- ‘Novel’ information need → User-based evaluation is required to draw any
meaningful conclusions (about satisfaction, effectiveness, and efﬁciency)
9

Recommendation (Deﬁnition)
• Recommender systems provide suggestions for new content
- Non-personalized → “More like this” functionality
- Personalized → Suggestions for new content based on past interactions
‣ System records implicit (or explicit) evidence of user interest (e.g., views,
bookmarks, prints, ...)
‣ Find interesting, related content based on content-based and/or social similarity &
generate a personalized ranking of the related content by training a model of the
users and item space
‣ User’s role is passive: interactions are recorded & suggestions are pushed on the user
• Examples
- Personalized museum tours (Ardissono et al, 2012; Bohnert et al., 2008;
De Gemmis et al., 2008; Wang et al., 2009) 10

11
Fig. 4. Screenshot of the CHIP Recommender
Example: Personalized museum tours using CHIP

Recommendation (Evaluation practice)
- User proﬁles for each user, containing a sufﬁciently large number (≥20) of
user preferences (views, plays, bookmarks, prints, ratings, etc.)
- Problematic in the start-up phase of a system, leading to the cold-start
problem
‣ Possible solution → combining multiple algorithms to provide recommendations until we
have collected enough information
- Backtesting (combination of information retrieval & machine learning evaluation)
‣ We hide a small number (e.g., 10) of a user’s preferences, train our algorithm and check
whether we can successfully predict interest in the ‘missing’ items
- Evaluation metrics are similar to search engine evaluation
12

Enrichment (Deﬁnition)
• Enrichment covers all approaches that add extra layers of
information to collection objects
- Many different types of ‘added information’: entities, events, errors/
corrections, geo-tagging, clustering, etc.
- Typically use machine learning to predict additional information relevant for
an object
‣ Supervised learning uses labeled examples to learn patterns
‣ Unsupervised learning attempts to ﬁnd patterns without examples
• Examples
- Automatically correcting database entry errors (Van den Bosch et al., 2009)
- Historical event detection in text (Cybulska & Vossen, 2011) 13

14gure 3. Details of an animal specimen database entry result retu
Example: Automatic error correction in databases

15
Back to theme page
Slachtoffers gemaakt door de Nederlandse troepen op weg
naar Jogyakarta. Kinderschilderij van de inname van Jogyakarta
tijdens de tweede politionele actie, december 1948.
NG-1998-7-10
Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta (Object) Associated Events
DepictsEvent: Tweede politionele actie
biographical aspects
Creator:Toha Adimidjojo, Mohammed (4) Date:1948-12-19 (3) 1949-06-30 (3) 20e eeuw (18) tweede kwart 20e eeuw (17)
material aspects
Type: aquarel (3)tekening (3)
Technique: aquarelleren (3)
Material: hardboard (4)
semiotic aspects
Subject: Jogyakarta (4)Tweede politionele actie (7)
1948-12-19 (4) 1949-06-31 (1)
militaire geschiedenis (12)
Associated Objects (25) < prev 1 2 3 4 5 next >
Your Navigation Path < prev 1 next >
Navigation Path Details
President Soekarno g...
Associated Press
Sinkin panjang met s...
Anonymous
Indonesië vrij!
Hatta, Mohammad
Schild van een Atjeher
Anonymous
Aankomst van Van Spi...
Anonymous
Het kasteel van Bata...
Beeckman, Andries
Figure 1: Screenshot of object page in the Agora Event Browsing Demonstrator
GORA DEMONSTRATOR 7. ADDITIONAL AUTHORS
Example: Historical event extraction from text

Enrichment (Evaluation practice)
- Most enrichment approaches use machine learning algorithms to predict
which annotations to add to an object
- Data set with a large number (>1000) of labeled examples, each of which
contain different features about this object and the actual output label
- Including humans in a feedback loop can reduce the number of examples
needed for good performance, but results in a longer training phase
- Metrics from machine learning are commonly used
‣ Precision (what did we get right?) & Recall (what did we miss?)
‣ F-score (harmonic mean of Precision & Recall)
16

Challenges
• Propagation of errors
- Unlocking cultural heritage is inherently a multi-stage process
‣ Digitization → correction → enrichment → access
- Errors will propagate and inﬂuence all subsequent stages → difﬁcult to
tease apart what caused errors at the later stage
‣ Only possible with additional manual labor!
• Language
- Historical spelling variants need to be detected and incorporated
- Multilinguality → many collections contain content in multiple languages,
which present problems for both algorithms and evaluation
17

Challenges
• Measuring system performance still requires user input!
- Queries, relevance judgments, user preferences, pre-classiﬁed examples, ...
• Different groups provide diﬀerent input affecting the performance →
how do we reach them and how do we strike a balance?
- Experts
‣ Interviews, observation
- Amateurs & enthusiasts
‣ Dedicated websites & online communities
- General public
‣ Search logs
18

Challenges
• Scaling up from cases to databases
- Can we scale up small-scale user-based evaluation to large-scale system-
based evaluation?
- Which evaluation aspects can we measure reliably?
- How much should the human be in the loop?
• No two cultural heritage systems are the same!
- Means evaluation needs to be tailored to each situation (in collaboration
with end users)
19

Case study: Social Book Search
• The Social Book Search track (2011-2015) is a search challenge
focused on book search & discovery
- Originally at INEX (2011-2014), now at CLEF (2015- )
• What do we need to investigate book search & discovery?
- Collection of book records
‣ Amazon/LibraryThing collection containing 2.8 million book metadata records
‣ Mix of metadata from Amazon and Librarything
‣ Controlled metadata from Library of Congress (LoC) and British Library (BL)
- Representative set of book requests & information needs
- Relevance judgments (preferably graded)
20

Challenge: Information needs & relevance judgments
• Getting a large, varied & representative set of book information
needs and relevance judgment is far from trivial!
- Each method has its own pros and cons in terms of realism and size
21
Information
needs
Relevance
judgments
Size
Interview ✓ ✓ ✗
Surveys ✓ ✗ ✓
Search engine logs ✗ ✗ ✓
Web mining ✓ ✓ ✓

Solution: Mining the LibraryThing fora
• Book discussion fora contain discussions on many different topics
- Analyses of single or related books
- Author discussions & comparisons
- Reading behavior discussions
- Requests for new books to read & discover
- Re-ﬁnding known but forgotten books
• Example: LibraryThing fora
22

Annotated LT topic
24
Group name
Topic title
Narrative
Recommended
books

Solution: Mining the LibraryThing fora
• LibraryThing fora provided us with 10,000+ rich, realistic,
representative information needs captured in discussion threads
- Annotated 1000+ threads with additional aspects of the information needs
- Graded relevance judgments based on
‣ Number of mentions by other LibraryThing users
‣ Interest by original requester
25

Catalog additions
26
Forums suggestions added
after the topic was posted

Not just true for the book domain!
27

Relevance for designing CH information systems
• Beneﬁts
- Better understanding of the needs of amateurs, enthusiasts, and the
general public
- Easy & cheap way of collecting many examples of information needs
- Should not be seen as a substitute, but as an addition
• Caveat
- Example needs might not be available on the Web for every domain...
28

Conclusions
• Different types of systems require different evaluation approaches
• Many challenges exist that can inﬂuence performance
• Some of these challenges can be addressed by leveraging the power
and the breadth of the Web
29
Want to hear more about what we can learn from the Social
Book Search track? Come to our Tagging vs. Controlled
Vocabulary: Which is More Helpful for Book Search? talk
in the ”Extracting, Comparing and Creating Book and Journal
Data” session (Wednesday, 10:30-12:00, Salon D)

Questions? Comments? Suggestions?
30

Measuring System Performance in Cultural Heritage Systems

Recommended

Recommended

More Related Content

Similar to Measuring System Performance in Cultural Heritage Systems

Similar to Measuring System Performance in Cultural Heritage Systems (20)

More from Toine Bogers

More from Toine Bogers (15)

Recently uploaded

Recently uploaded (20)

Measuring System Performance in Cultural Heritage Systems