Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Collection Understanding in Web Archives

731 views

Published on

We propose using visualization of representative mementos to aide in collection understanding of web archive collections, as inspired by AlNomanay's work.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Improving Collection Understanding in Web Archives

  1. 1. @shawnmjones @WebSciDL Improving Collection Understanding in Web Archives Shawn M. Jones Web Science and Digital Libraries Research Group Advisors: Michael L. Nelson and Michele C. Weigle Thanks to:
  2. 2. @shawnmjones @WebSciDL Researchers Create Their Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  3. 3. @shawnmjones @WebSciDL Web Archive Collections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  4. 4. @shawnmjones @WebSciDL Different Versions Allow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 11, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks
  5. 5. @shawnmjones @WebSciDL Different Versions Allow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018
  6. 6. @shawnmjones @WebSciDL Archive-It Provides For Easy Collection Creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 6
  7. 7. @shawnmjones @WebSciDL The Problem of Collection Understanding What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 7
  8. 8. @shawnmjones @WebSciDL 8 31 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?
  9. 9. @shawnmjones @WebSciDL Archive-It provides fields for metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields
  10. 10. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 10 9 seeds with metadata 132,599 seeds no metadata
  11. 11. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 11 132,599 seeds no metadata 9 seeds with metadata Paradox of metadata: More seeds = more effort
  12. 12. @shawnmjones @WebSciDL Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 12
  13. 13. @shawnmjones @WebSciDL More Archive-It collections are added every year More than 8000 collections exist as of the end of 2016 13 More Archive-It collections are added each year
  14. 14. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 14
  15. 15. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 15
  16. 16. @shawnmjones @WebSciDL The proposal: a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a visualization of ~28 social cards 16 Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
  17. 17. @shawnmjones @WebSciDL Background and Related Work 17
  18. 18. @shawnmjones @WebSciDL Looking at Archive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes 18 Archive–It Collections
  19. 19. @shawnmjones @WebSciDL Document collections have aspects  Metadata on a publication:  used as a surrogate for understanding  answers anticipated questions  Aspects:  The central concepts of the corpus  For example: aspects about a disaster  time  place  cause  countermeasures  Aspects correspond to the questions that a user might have about a collection 19 Archive–It Collections Summarize with Aspects Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12), 1727–1733.
  20. 20. @shawnmjones @WebSciDL How can we surface aspects?  Named Entity Recognition can answer questions of who or where?  Natural Language Processing can answer questions of what time period?  Topic modeling can surface general concepts from the corpus  And we have to be cognizant of these concepts over time 20 Archive-It Collection 8121: “The Obama White House” Archive-It Collection 8513: “Donald J Trump White House” Archive–It Collections Summarize with Aspects
  21. 21. @shawnmjones @WebSciDL Visualizing web resources (surrogates) 21 Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) Visualize MementosArchive–It Collections Summarize with Aspects
  22. 22. @shawnmjones @WebSciDL Which surrogate is best for web resources? Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 22 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  23. 23. @shawnmjones @WebSciDL Which surrogate is best for web resources? Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 23 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  24. 24. @shawnmjones @WebSciDL Visualizing Archive-It Collections 24 Other attempts at visualizing Archive-It collections tried to visualize everything. Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821 http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html
  25. 25. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 25 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508 http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
  26. 26. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 26 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html x
  27. 27. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 27 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary x
  28. 28. @shawnmjones @WebSciDL Preliminary Work 28
  29. 29. @shawnmjones @WebSciDL Growth curves for understanding collection creation behavior 29 Archive–It Collections • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  30. 30. @shawnmjones @WebSciDL Structural features of Archive-It collections  difference between seed curve AUC and diagonal  difference between seed memento curve AUC and diagonal  difference between seed memento curve AUC and seed curve AUC  number of seeds  number of mementos  seed URI domain diversity  seed URI path depth diversity  most frequent seed URI path depth  % query string usage in seed URIs  lifespan of collection 30 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  31. 31. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous 31 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  32. 32. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous 32 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  33. 33. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous 33 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  34. 34. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 34 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  35. 35. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections 35 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  36. 36. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany 36 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  37. 37. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features on the previous slide, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 37 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  38. 38. @shawnmjones @WebSciDL Research Plan 38
  39. 39. @shawnmjones @WebSciDL Developing a Flexible Framework Off-Topic Memento Toolkit Representative Memento Selection Utilities Archive-It Utilities MementoEmbed DSA Visualization Interface Web Archive Collection Visualized Summary Dark and Stormy Archives (DSA) 2.0 A framework based by AlNoamany’s work Two concepts are embodied in this framework: 1. Selecting representative mementos 2. Visualizing those mementos 39 Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018.
  40. 40. @shawnmjones @WebSciDL Not just Archive-It 40 Our methods will be applicable to any web archive collection, like those developed by Rhizome’s Webrecorder.
  41. 41. @shawnmjones @WebSciDL Evaluation 41
  42. 42. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 42
  43. 43. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 43 Who is X? Where is Y? When does Z take place?
  44. 44. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 3. How well do users complete the tasks? 44 Who is X? Where is Y? When does Z take place?
  45. 45. @shawnmjones @WebSciDL RQ1: How do we select representative mementos for the different semantic types of collections?  Summarizing a collection involves: 1. Grouping the mementos by their commonalities 2. Select the highest quality mementos from each group  Different semantic categories may require different algorithms  We want to reuse existing tools where possible:  Stanford NLP  Archives Unleashed Toolkit  gensim  SpaCy 45 Archive–It Collections Summarize with Aspects
  46. 46. @shawnmjones @WebSciDL RQ1 Evaluation 1. How many user tasks were addressed by the mementos chosen? How many user tasks failed? 2. How many mementos produced are not useful for any user task? 3. Which algorithm surfaces aspects satisfying the highest mean number of user tasks for a given collection type? 4. What is the mean minimum number of mementos necessary to address the most user tasks? 46 Archive–It Collections Summarize with Aspects
  47. 47. @shawnmjones @WebSciDL RQ2: What visualizations (surrogates) work best for understanding individual mementos?  There are many different possibilities for surrogates  Does the choice in surrogate change depending on the collection’s semantic category? 47 Visualize MementosArchive–It Collections Summarize with Aspects
  48. 48. @shawnmjones @WebSciDL RQ2 Evaluation 1. Does the depth, domain, or category of the URI play a factor in which surrogate performs better? 2. Do different surrogates work better for different semantic categories? 3. For social cards, which elements of the social card need to be present to understand the underlying memento? 4. For thumbnails, what size thumbnail works best for understanding? How much of the web page needs to be rendered for a thumbnail to be useful for understanding? 48 Visualize MementosArchive–It Collections Summarize with Aspects Evaluated via:
  49. 49. @shawnmjones @WebSciDL RQ3: How well do visualizations of groups of mementos produced by different summarization algorithms work for collection understanding?  Once we have:  Candidate summarization algorithms  Evaluated surrogates for individual mementos  We can then evaluate the combination of summarization and visualization.  There are many options:  arranging surrogates  headings  metadata 49 RQ1: Summarization Algorithms RQ2: Visualization Elements RQ3: Visualization of Summary Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
  50. 50. @shawnmjones @WebSciDL RQ3 Evaluation 1. How many user tasks are addressed by the visualization chosen? How many fail? 2. How many visualized mementos were not needed for any given user task? 3. Given an aspect of the collection, can the user address a user task concerning it by visually scanning the visualization? 4. Given multiple aspects of the collection, can the user successfully compare different individual memento visualizations to address a user task? 5. Which visualizations work better for certain semantic types? 50 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Evaluated via:
  51. 51. @shawnmjones @WebSciDL Research Plan 51 03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020 Preliminary work Implement a flexible framework Addressing RQ1: Develop new algorithms for selecting representative mementos Addressing RQ2: Evaluation of individual memento visualizations Dissertation Candidacy Exam Addressing RQ1: Evaluation of algorithms for selecting representative mementos Addressing RQ3: Develop candidate visualizations of groups of mementos Addressing RQ3: Evaluation of visualization of groups of mementos Disseration Composition Dissertation Defense SIGIR 2020 CHI 2020 iPres 2018 iPres 2019 JCDL 2019 CHI 2020 JCDL 2020 JCDL 2021
  52. 52. @shawnmjones @WebSciDL Conclusion 52
  53. 53. @shawnmjones @WebSciDL Summary  Collection understanding is a problem with web archive collections  Inconsistent metadata  1000s of mementos  1000s of collections  Costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events 53
  54. 54. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 54
  55. 55. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 55 Thanks:

×