Measuring System Performance in
Cultural Heritage Information Systems
Toine Bogers
Aalborg University Copenhagen, Denmark
‘Evaluating Cultural Heritage Information Systems’ workshop
iConference 2015, Newport Beach
March 24, 2015
Outline
• Types of cultural heritage (CH) information systems
- Definition
- Common evaluation practice
• Challenges
• Case study: Social Book Search track
2
Types of cultural heritage information systems
• Large variety in the types of cultural heritage collections → many
different ways of unlocking this material
• Four main types of cultural heritage information systems
- Search
- Browsing
- Recommendation
- Enrichment
3
Search (Definition)
• Search engines provide direct access to the collection
- Search engine indexes representations of the collection objects
(occasionally full-text)
- User interacts by actively submitting queries describing their information
need(s)
- Search engine ranks the collection documents by (topical) relevance for
the query
• Examples
- Making museum collection metadata accessible (Koolen et al., 2009)
- Searching through war-time radio broadcasts (Heeren et al., 2007)
- Unlocking television broadcast archives (Hollink et al., 2009) 4
5
p of the GUI used for the system. Top left the high-level concept search part and right
Example: Searching broadcast video archives
Search (Evaluation practice)
• What do we need?
- Realistic collection of objects with textual representations
- Representative set of real-world information needs → for reliable
evaluation we typically need ≥50 topics
- Relevance judgments → (semi-)complete list of correct
answers for each of these topics, preferably from the original users
• How do we evaluate?
- Unranked → Precision (what did we get right?) & Recall (what did we miss?)
- Ranked → MRR (where is the first relevant result?), MAP (are the relevant
results all near the top?), and nDCG (are the most relevant results returned
before the less relevant ones?)
6
‘test
collection’
Browsing (Definition)
• Browsing supports free to semi-guided exploration of collections
- Object metadata allows for links between objects → clicking on a link
shows all other objects that share that property
- Exploration can also take place along other dimensions (e.g., temporal or
geographical)
- Taxonomies & ontologies can be used to link objects in different ways
- Users can explore with or without an direct information need
• Examples
- Exploring digital cultural heritage spaces in PATHS (Hall et al., 2012)
- Semantic portals for cultural heritage (Hyvönen, 2009)
7
8
Example: Providing multiple paths through collection using PATHS
Browsing (Evaluation practice)
• What do we need?
- System-based evaluation of performance is hard to do → browsing is the
most user-focused of the four system types
- If historical interaction logs are available, then these could be used to
identify potential browsing ‘shortcuts’
• How do we evaluate?
- Known-item evaluation → Shortest path lengths to randomly selected
items can provide a hint about best possible outcome
‣ Needs to be complemented with user-based studies of actual browsing behavior!
- ‘Novel’ information need → User-based evaluation is required to draw any
meaningful conclusions (about satisfaction, effectiveness, and efficiency)
9
Recommendation (Definition)
• Recommender systems provide suggestions for new content
- Non-personalized → “More like this” functionality
- Personalized → Suggestions for new content based on past interactions
‣ System records implicit (or explicit) evidence of user interest (e.g., views,
bookmarks, prints, ...)
‣ Find interesting, related content based on content-based and/or social similarity &
generate a personalized ranking of the related content by training a model of the
users and item space
‣ User’s role is passive: interactions are recorded & suggestions are pushed on the user
• Examples
- Personalized museum tours (Ardissono et al, 2012; Bohnert et al., 2008;
De Gemmis et al., 2008; Wang et al., 2009) 10
11
Fig. 4. Screenshot of the CHIP Recommender
Example: Personalized museum tours using CHIP
Recommendation (Evaluation practice)
• What do we need?
- User profiles for each user, containing a sufficiently large number (≥20) of
user preferences (views, plays, bookmarks, prints, ratings, etc.)
- Problematic in the start-up phase of a system, leading to the cold-start
problem
‣ Possible solution → combining multiple algorithms to provide recommendations until we
have collected enough information
• How do we evaluate?
- Backtesting (combination of information retrieval & machine learning evaluation)
‣ We hide a small number (e.g., 10) of a user’s preferences, train our algorithm and check
whether we can successfully predict interest in the ‘missing’ items
- Evaluation metrics are similar to search engine evaluation
12
Enrichment (Definition)
• Enrichment covers all approaches that add extra layers of
information to collection objects
- Many different types of ‘added information’: entities, events, errors/
corrections, geo-tagging, clustering, etc.
- Typically use machine learning to predict additional information relevant for
an object
‣ Supervised learning uses labeled examples to learn patterns
‣ Unsupervised learning attempts to find patterns without examples
• Examples
- Automatically correcting database entry errors (Van den Bosch et al., 2009)
- Historical event detection in text (Cybulska & Vossen, 2011) 13
14gure 3. Details of an animal specimen database entry result retu
Example: Automatic error correction in databases
15
Back to theme page
Slachtoffers gemaakt door de Nederlandse troepen op weg
naar Jogyakarta. Kinderschilderij van de inname van Jogyakarta
tijdens de tweede politionele actie, december 1948.
NG-1998-7-10
Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta (Object) Associated Events
DepictsEvent: Tweede politionele actie
biographical aspects
Creator:Toha Adimidjojo, Mohammed (4) Date:1948-12-19 (3) 1949-06-30 (3) 20e eeuw (18) tweede kwart 20e eeuw (17)
material aspects
Type: aquarel (3)tekening (3)
Technique: aquarelleren (3)
Material: hardboard (4)
semiotic aspects
Subject: Jogyakarta (4)Tweede politionele actie (7)
1948-12-19 (4) 1949-06-31 (1)
militaire geschiedenis (12)
Associated Objects (25) < prev 1 2 3 4 5 next >
Your Navigation Path < prev 1 next >
Navigation Path Details
President Soekarno g...
Associated Press
Sinkin panjang met s...
Anonymous
Indonesië vrij!
Hatta, Mohammad
Schild van een Atjeher
Anonymous
Aankomst van Van Spi...
Anonymous
Het kasteel van Bata...
Beeckman, Andries
Figure 1: Screenshot of object page in the Agora Event Browsing Demonstrator
GORA DEMONSTRATOR 7. ADDITIONAL AUTHORS
Example: Historical event extraction from text
Enrichment (Evaluation practice)
• What do we need?
- Most enrichment approaches use machine learning algorithms to predict
which annotations to add to an object
- Data set with a large number (>1000) of labeled examples, each of which
contain different features about this object and the actual output label
- Including humans in a feedback loop can reduce the number of examples
needed for good performance, but results in a longer training phase
• How do we evaluate?
- Metrics from machine learning are commonly used
‣ Precision (what did we get right?) & Recall (what did we miss?)
‣ F-score (harmonic mean of Precision & Recall)
16
Challenges
• Propagation of errors
- Unlocking cultural heritage is inherently a multi-stage process
‣ Digitization → correction → enrichment → access
- Errors will propagate and influence all subsequent stages → difficult to
tease apart what caused errors at the later stage
‣ Only possible with additional manual labor!
• Language
- Historical spelling variants need to be detected and incorporated
- Multilinguality → many collections contain content in multiple languages,
which present problems for both algorithms and evaluation
17
Challenges
• Measuring system performance still requires user input!
- Queries, relevance judgments, user preferences, pre-classified examples, ...
• Different groups provide different input affecting the performance →
how do we reach them and how do we strike a balance?
- Experts
‣ Interviews, observation
- Amateurs & enthusiasts
‣ Dedicated websites & online communities
- General public
‣ Search logs
18
Challenges
• Scaling up from cases to databases
- Can we scale up small-scale user-based evaluation to large-scale system-
based evaluation?
- Which evaluation aspects can we measure reliably?
- How much should the human be in the loop?
• No two cultural heritage systems are the same!
- Means evaluation needs to be tailored to each situation (in collaboration
with end users)
19
Case study: Social Book Search
• The Social Book Search track (2011-2015) is a search challenge
focused on book search & discovery
- Originally at INEX (2011-2014), now at CLEF (2015- )
• What do we need to investigate book search & discovery?
- Collection of book records
‣ Amazon/LibraryThing collection containing 2.8 million book metadata records
‣ Mix of metadata from Amazon and Librarything
‣ Controlled metadata from Library of Congress (LoC) and British Library (BL)
- Representative set of book requests & information needs
- Relevance judgments (preferably graded)
20
Challenge: Information needs & relevance judgments
• Getting a large, varied & representative set of book information
needs and relevance judgment is far from trivial!
- Each method has its own pros and cons in terms of realism and size
21
Information
needs
Relevance
judgments
Size
Interview ✓ ✓ ✗
Surveys ✓ ✗ ✓
Search engine logs ✗ ✗ ✓
Web mining ✓ ✓ ✓
Solution: Mining the LibraryThing fora
• Book discussion fora contain discussions on many different topics
- Analyses of single or related books
- Author discussions & comparisons
- Reading behavior discussions
- Requests for new books to read & discover
- Re-finding known but forgotten books
• Example: LibraryThing fora
22
Annotated LT topic
23
Annotated LT topic
24
Group name
Topic title
Narrative
Recommended
books
Solution: Mining the LibraryThing fora
• LibraryThing fora provided us with 10,000+ rich, realistic,
representative information needs captured in discussion threads
- Annotated 1000+ threads with additional aspects of the information needs
- Graded relevance judgments based on
‣ Number of mentions by other LibraryThing users
‣ Interest by original requester
25
Catalog additions
26
Forums suggestions added
after the topic was posted
Not just true for the book domain!
27
Relevance for designing CH information systems
• Benefits
- Better understanding of the needs of amateurs, enthusiasts, and the
general public
- Easy & cheap way of collecting many examples of information needs
- Should not be seen as a substitute, but as an addition
• Caveat
- Example needs might not be available on the Web for every domain...
28
Conclusions
• Different types of systems require different evaluation approaches
• Many challenges exist that can influence performance
• Some of these challenges can be addressed by leveraging the power
and the breadth of the Web
29
Want to hear more about what we can learn from the Social
Book Search track? Come to our Tagging vs. Controlled
Vocabulary: Which is More Helpful for Book Search? talk
in the ”Extracting, Comparing and Creating Book and Journal
Data” session (Wednesday, 10:30-12:00, Salon D)
Questions? Comments? Suggestions?
30

Measuring System Performance in Cultural Heritage Systems

  • 1.
    Measuring System Performancein Cultural Heritage Information Systems Toine Bogers Aalborg University Copenhagen, Denmark ‘Evaluating Cultural Heritage Information Systems’ workshop iConference 2015, Newport Beach March 24, 2015
  • 2.
    Outline • Types ofcultural heritage (CH) information systems - Definition - Common evaluation practice • Challenges • Case study: Social Book Search track 2
  • 3.
    Types of culturalheritage information systems • Large variety in the types of cultural heritage collections → many different ways of unlocking this material • Four main types of cultural heritage information systems - Search - Browsing - Recommendation - Enrichment 3
  • 4.
    Search (Definition) • Searchengines provide direct access to the collection - Search engine indexes representations of the collection objects (occasionally full-text) - User interacts by actively submitting queries describing their information need(s) - Search engine ranks the collection documents by (topical) relevance for the query • Examples - Making museum collection metadata accessible (Koolen et al., 2009) - Searching through war-time radio broadcasts (Heeren et al., 2007) - Unlocking television broadcast archives (Hollink et al., 2009) 4
  • 5.
    5 p of theGUI used for the system. Top left the high-level concept search part and right Example: Searching broadcast video archives
  • 6.
    Search (Evaluation practice) •What do we need? - Realistic collection of objects with textual representations - Representative set of real-world information needs → for reliable evaluation we typically need ≥50 topics - Relevance judgments → (semi-)complete list of correct answers for each of these topics, preferably from the original users • How do we evaluate? - Unranked → Precision (what did we get right?) & Recall (what did we miss?) - Ranked → MRR (where is the first relevant result?), MAP (are the relevant results all near the top?), and nDCG (are the most relevant results returned before the less relevant ones?) 6 ‘test collection’
  • 7.
    Browsing (Definition) • Browsingsupports free to semi-guided exploration of collections - Object metadata allows for links between objects → clicking on a link shows all other objects that share that property - Exploration can also take place along other dimensions (e.g., temporal or geographical) - Taxonomies & ontologies can be used to link objects in different ways - Users can explore with or without an direct information need • Examples - Exploring digital cultural heritage spaces in PATHS (Hall et al., 2012) - Semantic portals for cultural heritage (Hyvönen, 2009) 7
  • 8.
    8 Example: Providing multiplepaths through collection using PATHS
  • 9.
    Browsing (Evaluation practice) •What do we need? - System-based evaluation of performance is hard to do → browsing is the most user-focused of the four system types - If historical interaction logs are available, then these could be used to identify potential browsing ‘shortcuts’ • How do we evaluate? - Known-item evaluation → Shortest path lengths to randomly selected items can provide a hint about best possible outcome ‣ Needs to be complemented with user-based studies of actual browsing behavior! - ‘Novel’ information need → User-based evaluation is required to draw any meaningful conclusions (about satisfaction, effectiveness, and efficiency) 9
  • 10.
    Recommendation (Definition) • Recommendersystems provide suggestions for new content - Non-personalized → “More like this” functionality - Personalized → Suggestions for new content based on past interactions ‣ System records implicit (or explicit) evidence of user interest (e.g., views, bookmarks, prints, ...) ‣ Find interesting, related content based on content-based and/or social similarity & generate a personalized ranking of the related content by training a model of the users and item space ‣ User’s role is passive: interactions are recorded & suggestions are pushed on the user • Examples - Personalized museum tours (Ardissono et al, 2012; Bohnert et al., 2008; De Gemmis et al., 2008; Wang et al., 2009) 10
  • 11.
    11 Fig. 4. Screenshotof the CHIP Recommender Example: Personalized museum tours using CHIP
  • 12.
    Recommendation (Evaluation practice) •What do we need? - User profiles for each user, containing a sufficiently large number (≥20) of user preferences (views, plays, bookmarks, prints, ratings, etc.) - Problematic in the start-up phase of a system, leading to the cold-start problem ‣ Possible solution → combining multiple algorithms to provide recommendations until we have collected enough information • How do we evaluate? - Backtesting (combination of information retrieval & machine learning evaluation) ‣ We hide a small number (e.g., 10) of a user’s preferences, train our algorithm and check whether we can successfully predict interest in the ‘missing’ items - Evaluation metrics are similar to search engine evaluation 12
  • 13.
    Enrichment (Definition) • Enrichmentcovers all approaches that add extra layers of information to collection objects - Many different types of ‘added information’: entities, events, errors/ corrections, geo-tagging, clustering, etc. - Typically use machine learning to predict additional information relevant for an object ‣ Supervised learning uses labeled examples to learn patterns ‣ Unsupervised learning attempts to find patterns without examples • Examples - Automatically correcting database entry errors (Van den Bosch et al., 2009) - Historical event detection in text (Cybulska & Vossen, 2011) 13
  • 14.
    14gure 3. Detailsof an animal specimen database entry result retu Example: Automatic error correction in databases
  • 15.
    15 Back to themepage Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta. Kinderschilderij van de inname van Jogyakarta tijdens de tweede politionele actie, december 1948. NG-1998-7-10 Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta (Object) Associated Events DepictsEvent: Tweede politionele actie biographical aspects Creator:Toha Adimidjojo, Mohammed (4) Date:1948-12-19 (3) 1949-06-30 (3) 20e eeuw (18) tweede kwart 20e eeuw (17) material aspects Type: aquarel (3)tekening (3) Technique: aquarelleren (3) Material: hardboard (4) semiotic aspects Subject: Jogyakarta (4)Tweede politionele actie (7) 1948-12-19 (4) 1949-06-31 (1) militaire geschiedenis (12) Associated Objects (25) < prev 1 2 3 4 5 next > Your Navigation Path < prev 1 next > Navigation Path Details President Soekarno g... Associated Press Sinkin panjang met s... Anonymous Indonesië vrij! Hatta, Mohammad Schild van een Atjeher Anonymous Aankomst van Van Spi... Anonymous Het kasteel van Bata... Beeckman, Andries Figure 1: Screenshot of object page in the Agora Event Browsing Demonstrator GORA DEMONSTRATOR 7. ADDITIONAL AUTHORS Example: Historical event extraction from text
  • 16.
    Enrichment (Evaluation practice) •What do we need? - Most enrichment approaches use machine learning algorithms to predict which annotations to add to an object - Data set with a large number (>1000) of labeled examples, each of which contain different features about this object and the actual output label - Including humans in a feedback loop can reduce the number of examples needed for good performance, but results in a longer training phase • How do we evaluate? - Metrics from machine learning are commonly used ‣ Precision (what did we get right?) & Recall (what did we miss?) ‣ F-score (harmonic mean of Precision & Recall) 16
  • 17.
    Challenges • Propagation oferrors - Unlocking cultural heritage is inherently a multi-stage process ‣ Digitization → correction → enrichment → access - Errors will propagate and influence all subsequent stages → difficult to tease apart what caused errors at the later stage ‣ Only possible with additional manual labor! • Language - Historical spelling variants need to be detected and incorporated - Multilinguality → many collections contain content in multiple languages, which present problems for both algorithms and evaluation 17
  • 18.
    Challenges • Measuring systemperformance still requires user input! - Queries, relevance judgments, user preferences, pre-classified examples, ... • Different groups provide different input affecting the performance → how do we reach them and how do we strike a balance? - Experts ‣ Interviews, observation - Amateurs & enthusiasts ‣ Dedicated websites & online communities - General public ‣ Search logs 18
  • 19.
    Challenges • Scaling upfrom cases to databases - Can we scale up small-scale user-based evaluation to large-scale system- based evaluation? - Which evaluation aspects can we measure reliably? - How much should the human be in the loop? • No two cultural heritage systems are the same! - Means evaluation needs to be tailored to each situation (in collaboration with end users) 19
  • 20.
    Case study: SocialBook Search • The Social Book Search track (2011-2015) is a search challenge focused on book search & discovery - Originally at INEX (2011-2014), now at CLEF (2015- ) • What do we need to investigate book search & discovery? - Collection of book records ‣ Amazon/LibraryThing collection containing 2.8 million book metadata records ‣ Mix of metadata from Amazon and Librarything ‣ Controlled metadata from Library of Congress (LoC) and British Library (BL) - Representative set of book requests & information needs - Relevance judgments (preferably graded) 20
  • 21.
    Challenge: Information needs& relevance judgments • Getting a large, varied & representative set of book information needs and relevance judgment is far from trivial! - Each method has its own pros and cons in terms of realism and size 21 Information needs Relevance judgments Size Interview ✓ ✓ ✗ Surveys ✓ ✗ ✓ Search engine logs ✗ ✗ ✓ Web mining ✓ ✓ ✓
  • 22.
    Solution: Mining theLibraryThing fora • Book discussion fora contain discussions on many different topics - Analyses of single or related books - Author discussions & comparisons - Reading behavior discussions - Requests for new books to read & discover - Re-finding known but forgotten books • Example: LibraryThing fora 22
  • 23.
  • 24.
    Annotated LT topic 24 Groupname Topic title Narrative Recommended books
  • 25.
    Solution: Mining theLibraryThing fora • LibraryThing fora provided us with 10,000+ rich, realistic, representative information needs captured in discussion threads - Annotated 1000+ threads with additional aspects of the information needs - Graded relevance judgments based on ‣ Number of mentions by other LibraryThing users ‣ Interest by original requester 25
  • 26.
    Catalog additions 26 Forums suggestionsadded after the topic was posted
  • 27.
    Not just truefor the book domain! 27
  • 28.
    Relevance for designingCH information systems • Benefits - Better understanding of the needs of amateurs, enthusiasts, and the general public - Easy & cheap way of collecting many examples of information needs - Should not be seen as a substitute, but as an addition • Caveat - Example needs might not be available on the Web for every domain... 28
  • 29.
    Conclusions • Different typesof systems require different evaluation approaches • Many challenges exist that can influence performance • Some of these challenges can be addressed by leveraging the power and the breadth of the Web 29 Want to hear more about what we can learn from the Social Book Search track? Come to our Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search? talk in the ”Extracting, Comparing and Creating Book and Journal Data” session (Wednesday, 10:30-12:00, Salon D)
  • 30.