Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Measuring System Performance in Cultural Heritage Systems


Published on

This talk presents a high-level overview of the different components of cultural heritage information systems—search, browsing, recommendation, and enrichment—and their evaluation, and the common challenges.

(Invited talk at the "Evaluating Cultural Heritage Information Systems" workshop at the iConference 2015 in Newport Beach, CA)

Published in: Science
  • Be the first to comment

  • Be the first to like this

Measuring System Performance in Cultural Heritage Systems

  1. 1. Measuring System Performance in Cultural Heritage Information Systems Toine Bogers Aalborg University Copenhagen, Denmark ‘Evaluating Cultural Heritage Information Systems’ workshop iConference 2015, Newport Beach March 24, 2015
  2. 2. Outline • Types of cultural heritage (CH) information systems - Definition - Common evaluation practice • Challenges • Case study: Social Book Search track 2
  3. 3. Types of cultural heritage information systems • Large variety in the types of cultural heritage collections → many different ways of unlocking this material • Four main types of cultural heritage information systems - Search - Browsing - Recommendation - Enrichment 3
  4. 4. Search (Definition) • Search engines provide direct access to the collection - Search engine indexes representations of the collection objects (occasionally full-text) - User interacts by actively submitting queries describing their information need(s) - Search engine ranks the collection documents by (topical) relevance for the query • Examples - Making museum collection metadata accessible (Koolen et al., 2009) - Searching through war-time radio broadcasts (Heeren et al., 2007) - Unlocking television broadcast archives (Hollink et al., 2009) 4
  5. 5. 5 p of the GUI used for the system. Top left the high-level concept search part and right Example: Searching broadcast video archives
  6. 6. Search (Evaluation practice) • What do we need? - Realistic collection of objects with textual representations - Representative set of real-world information needs → for reliable evaluation we typically need ≥50 topics - Relevance judgments → (semi-)complete list of correct answers for each of these topics, preferably from the original users • How do we evaluate? - Unranked → Precision (what did we get right?) & Recall (what did we miss?) - Ranked → MRR (where is the first relevant result?), MAP (are the relevant results all near the top?), and nDCG (are the most relevant results returned before the less relevant ones?) 6 ‘test collection’
  7. 7. Browsing (Definition) • Browsing supports free to semi-guided exploration of collections - Object metadata allows for links between objects → clicking on a link shows all other objects that share that property - Exploration can also take place along other dimensions (e.g., temporal or geographical) - Taxonomies & ontologies can be used to link objects in different ways - Users can explore with or without an direct information need • Examples - Exploring digital cultural heritage spaces in PATHS (Hall et al., 2012) - Semantic portals for cultural heritage (Hyvönen, 2009) 7
  8. 8. 8 Example: Providing multiple paths through collection using PATHS
  9. 9. Browsing (Evaluation practice) • What do we need? - System-based evaluation of performance is hard to do → browsing is the most user-focused of the four system types - If historical interaction logs are available, then these could be used to identify potential browsing ‘shortcuts’ • How do we evaluate? - Known-item evaluation → Shortest path lengths to randomly selected items can provide a hint about best possible outcome ‣ Needs to be complemented with user-based studies of actual browsing behavior! - ‘Novel’ information need → User-based evaluation is required to draw any meaningful conclusions (about satisfaction, effectiveness, and efficiency) 9
  10. 10. Recommendation (Definition) • Recommender systems provide suggestions for new content - Non-personalized → “More like this” functionality - Personalized → Suggestions for new content based on past interactions ‣ System records implicit (or explicit) evidence of user interest (e.g., views, bookmarks, prints, ...) ‣ Find interesting, related content based on content-based and/or social similarity & generate a personalized ranking of the related content by training a model of the users and item space ‣ User’s role is passive: interactions are recorded & suggestions are pushed on the user • Examples - Personalized museum tours (Ardissono et al, 2012; Bohnert et al., 2008; De Gemmis et al., 2008; Wang et al., 2009) 10
  11. 11. 11 Fig. 4. Screenshot of the CHIP Recommender Example: Personalized museum tours using CHIP
  12. 12. Recommendation (Evaluation practice) • What do we need? - User profiles for each user, containing a sufficiently large number (≥20) of user preferences (views, plays, bookmarks, prints, ratings, etc.) - Problematic in the start-up phase of a system, leading to the cold-start problem ‣ Possible solution → combining multiple algorithms to provide recommendations until we have collected enough information • How do we evaluate? - Backtesting (combination of information retrieval & machine learning evaluation) ‣ We hide a small number (e.g., 10) of a user’s preferences, train our algorithm and check whether we can successfully predict interest in the ‘missing’ items - Evaluation metrics are similar to search engine evaluation 12
  13. 13. Enrichment (Definition) • Enrichment covers all approaches that add extra layers of information to collection objects - Many different types of ‘added information’: entities, events, errors/ corrections, geo-tagging, clustering, etc. - Typically use machine learning to predict additional information relevant for an object ‣ Supervised learning uses labeled examples to learn patterns ‣ Unsupervised learning attempts to find patterns without examples • Examples - Automatically correcting database entry errors (Van den Bosch et al., 2009) - Historical event detection in text (Cybulska & Vossen, 2011) 13
  14. 14. 14gure 3. Details of an animal specimen database entry result retu Example: Automatic error correction in databases
  15. 15. 15 Back to theme page Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta. Kinderschilderij van de inname van Jogyakarta tijdens de tweede politionele actie, december 1948. NG-1998-7-10 Slachtoffers gemaakt door de Nederlandse troepen op weg naar Jogyakarta (Object) Associated Events DepictsEvent: Tweede politionele actie biographical aspects Creator:Toha Adimidjojo, Mohammed (4) Date:1948-12-19 (3) 1949-06-30 (3) 20e eeuw (18) tweede kwart 20e eeuw (17) material aspects Type: aquarel (3)tekening (3) Technique: aquarelleren (3) Material: hardboard (4) semiotic aspects Subject: Jogyakarta (4)Tweede politionele actie (7) 1948-12-19 (4) 1949-06-31 (1) militaire geschiedenis (12) Associated Objects (25) < prev 1 2 3 4 5 next > Your Navigation Path < prev 1 next > Navigation Path Details President Soekarno g... Associated Press Sinkin panjang met s... Anonymous Indonesië vrij! Hatta, Mohammad Schild van een Atjeher Anonymous Aankomst van Van Spi... Anonymous Het kasteel van Bata... Beeckman, Andries Figure 1: Screenshot of object page in the Agora Event Browsing Demonstrator GORA DEMONSTRATOR 7. ADDITIONAL AUTHORS Example: Historical event extraction from text
  16. 16. Enrichment (Evaluation practice) • What do we need? - Most enrichment approaches use machine learning algorithms to predict which annotations to add to an object - Data set with a large number (>1000) of labeled examples, each of which contain different features about this object and the actual output label - Including humans in a feedback loop can reduce the number of examples needed for good performance, but results in a longer training phase • How do we evaluate? - Metrics from machine learning are commonly used ‣ Precision (what did we get right?) & Recall (what did we miss?) ‣ F-score (harmonic mean of Precision & Recall) 16
  17. 17. Challenges • Propagation of errors - Unlocking cultural heritage is inherently a multi-stage process ‣ Digitization → correction → enrichment → access - Errors will propagate and influence all subsequent stages → difficult to tease apart what caused errors at the later stage ‣ Only possible with additional manual labor! • Language - Historical spelling variants need to be detected and incorporated - Multilinguality → many collections contain content in multiple languages, which present problems for both algorithms and evaluation 17
  18. 18. Challenges • Measuring system performance still requires user input! - Queries, relevance judgments, user preferences, pre-classified examples, ... • Different groups provide different input affecting the performance → how do we reach them and how do we strike a balance? - Experts ‣ Interviews, observation - Amateurs & enthusiasts ‣ Dedicated websites & online communities - General public ‣ Search logs 18
  19. 19. Challenges • Scaling up from cases to databases - Can we scale up small-scale user-based evaluation to large-scale system- based evaluation? - Which evaluation aspects can we measure reliably? - How much should the human be in the loop? • No two cultural heritage systems are the same! - Means evaluation needs to be tailored to each situation (in collaboration with end users) 19
  20. 20. Case study: Social Book Search • The Social Book Search track (2011-2015) is a search challenge focused on book search & discovery - Originally at INEX (2011-2014), now at CLEF (2015- ) • What do we need to investigate book search & discovery? - Collection of book records ‣ Amazon/LibraryThing collection containing 2.8 million book metadata records ‣ Mix of metadata from Amazon and Librarything ‣ Controlled metadata from Library of Congress (LoC) and British Library (BL) - Representative set of book requests & information needs - Relevance judgments (preferably graded) 20
  21. 21. Challenge: Information needs & relevance judgments • Getting a large, varied & representative set of book information needs and relevance judgment is far from trivial! - Each method has its own pros and cons in terms of realism and size 21 Information needs Relevance judgments Size Interview ✓ ✓ ✗ Surveys ✓ ✗ ✓ Search engine logs ✗ ✗ ✓ Web mining ✓ ✓ ✓
  22. 22. Solution: Mining the LibraryThing fora • Book discussion fora contain discussions on many different topics - Analyses of single or related books - Author discussions & comparisons - Reading behavior discussions - Requests for new books to read & discover - Re-finding known but forgotten books • Example: LibraryThing fora 22
  23. 23. Annotated LT topic 23
  24. 24. Annotated LT topic 24 Group name Topic title Narrative Recommended books
  25. 25. Solution: Mining the LibraryThing fora • LibraryThing fora provided us with 10,000+ rich, realistic, representative information needs captured in discussion threads - Annotated 1000+ threads with additional aspects of the information needs - Graded relevance judgments based on ‣ Number of mentions by other LibraryThing users ‣ Interest by original requester 25
  26. 26. Catalog additions 26 Forums suggestions added after the topic was posted
  27. 27. Not just true for the book domain! 27
  28. 28. Relevance for designing CH information systems • Benefits - Better understanding of the needs of amateurs, enthusiasts, and the general public - Easy & cheap way of collecting many examples of information needs - Should not be seen as a substitute, but as an addition • Caveat - Example needs might not be available on the Web for every domain... 28
  29. 29. Conclusions • Different types of systems require different evaluation approaches • Many challenges exist that can influence performance • Some of these challenges can be addressed by leveraging the power and the breadth of the Web 29 Want to hear more about what we can learn from the Social Book Search track? Come to our Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search? talk in the ”Extracting, Comparing and Creating Book and Journal Data” session (Wednesday, 10:30-12:00, Salon D)
  30. 30. Questions? Comments? Suggestions? 30