Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Social Cards Probably Provide For Better Understanding Of Web Archive Collections


Published on

Presented at ACM CIKM 2019. Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.

Published in: Science
  • Login to see the comments

  • Be the first to like this

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

  1. 1. Shawn M. Jones Michele C. Weigle Michael L. Nelson @shawnmjones @weiglemc @phonedude_mln Social Cards Probably Provide For Better Understanding Of Web Archive Collections Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL Thanks to:
  2. 2. @shawnmjones @WebSciDL Curators Build Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  3. 3. @shawnmjones @WebSciDL Web archive collections consist of mementos – different versions of the same page over time 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  4. 4. @shawnmjones @WebSciDL Archive-It allows curators to easily create web archive collections Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 4
  5. 5. @shawnmjones @WebSciDL … and these collections are used by other researchers 5 The collection curator is not the only user of the collection! These collections live a life after their curator has stopped adding to them.
  6. 6. @shawnmjones @WebSciDL The problem…  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 6
  7. 7. @shawnmjones @WebSciDL Collections provide web pages based on a theme – we can summarize those collections by using the best mementos supporting that theme… 7 Web sites may group some content, but curators theme some of this content into collections which we can reduce to stories. Our stories consist of ~28 representative mementos, making them much smaller than the collections from which they come.
  8. 8. @shawnmjones @WebSciDL Surrogates provide a visual summary of the content behind a URI… 8,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI represented by a browser thumbnail surrogate: The same URI represented by a social card surrogate:
  9. 9. @shawnmjones @WebSciDL Social media storytelling uses surrogates to provide a “summary of summaries” 9 2 resources are shown from this Wakelet story6 resources are shown from this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.
  10. 10. @shawnmjones @WebSciDL We want to help the user answer the question of “What does the underlying collection contain?” 10 There are many types of surrogates. Which one best conveys the concepts of the underlying collection? If we already have the mementos describing a collection, we need to visualize them with some type of surrogate.
  11. 11. @shawnmjones @WebSciDL Storytelling implemented 11 From this: 31,863 seed mementos of 955 seeds, potentially 31K+ documents to review To this: a story built from a representative sample of ~30 mementos visualized as surrogates
  12. 12. @shawnmjones @WebSciDL Archive-It provides fields for metadata 12 Collection-wide metadata Metadata on individual seeds Dublin Core + Custom Fields
  13. 13. @shawnmjones @WebSciDL But, alas the metadata does not help… 13 A variety of metadata fields can be used for surrogates, with the most popular field being Title. The metadata on seeds and collections is optional. As the number of seeds increases, the less metadata is present per seed.Even though this is the case, 54.60% of Archive-It surrogates consist of only the seed URL and capture dates.
  14. 14. @shawnmjones @WebSciDL Understanding Archive-It Collections… Manually 14 Step 1: Decide upon search terms and find a collection via the search engine Step 0: Decide on your information need Step 2: Select a collection from the many results available
  15. 15. @shawnmjones @WebSciDL Understanding Archive-It Collections… Manually 15 Step 3: View the seed-centric collection page Step 4: Choose a seed from this list that may meet the information need. This collection contains 1,149 seeds. How do you choose one without metadata to guide you?
  16. 16. @shawnmjones @WebSciDL Understanding Archive-It Collections… Manually 16 Step 5: View the mementos associated with that seed. This seed has 923 seed mementos. There are more mementos linked from these mementos Step 6: Read the text of the memento to learn about its contents.
  17. 17. @shawnmjones @WebSciDL Understanding Archive-It Collections… Manually 17 Step 7: Follow links and review more content until you reach a page that was not archived. Step 8: Repeat steps 4-7 until enough information about has been amassed to determine if the collection meets your information need. This collection contained 80,484 seed mementos.
  18. 18. @shawnmjones @WebSciDL In spite of the lack of information in Archive-It surrogates, some information can be gleaned from the URI… 18 Thus, surrogates like these may still yield enough information for collection understanding.
  19. 19. @shawnmjones @WebSciDL Existing surrogate services create a confusing experience for mementos 19 Who published these resources? Archive-It? CNN? Is the story author sharing fake news? S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-, 2018. social card social card
  20. 20. @shawnmjones @WebSciDL Neither social media services nor card services were reliable for storytelling, so we created MementoEmbed… 20 Information in the MementoEmbed social card is separated to avoid issues of confusion about attribution. MementoEmbed is archive-aware. It can locate information about the memento that is not available in other cards. S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-, 2018.
  21. 21. @shawnmjones @WebSciDL We reused 4 stories generated by human Archive-It curators from their own collections… 21
  22. 22. @shawnmjones @WebSciDL We shared these stories, rendered using 6 different surrogate types, with MT participants… 22 Archive-It like Social Card Browser thumbnails Social Card With Thumbnail as Image (sc/t) Social Card With Thumbnail to Right (sc+t) Social Card with Thumbnail on Hover (sc^t) • 4 stories of 15-17 mementos selected by human Archive-It curators from their collections • 6 different surrogate types • Social cards and thumbnails were produced by MementoEmbed • 24 different story-surrogate combinations • 120 MT participants • They were given 30 seconds to view each story
  23. 23. @shawnmjones @WebSciDL And then we asked them which of 2 of 6 mementos come from the same collection… 23 • Each participant was shown a list of 6 surrogates of the same type as the story they just viewed. • They were asked to choose the 2 that they thought came from the same collection. • They were given as much time as they wished to answer the question. • This is similar to the Sentence Verification Task from reading comprehension studies.
  24. 24. @shawnmjones @WebSciDL Response times per surrogate had interesting means, but p-values were not statistically significant at p < 0.05 24 p = 0.190 p = 0.202
  25. 25. @shawnmjones @WebSciDL Correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate 25 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean p = 0.0569 p = 0.0770
  26. 26. @shawnmjones @WebSciDL Whenever thumbnails are present, more users interact with them 26 We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted with a thumbnail, regardless of surrogate. For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the surrogate.
  27. 27. @shawnmjones @WebSciDL Conclusions  54.60% of Archive-It surrogates have only a URL and capture dates  in spite of this, some information could still be gleaned from the URL  Results from our MT study with 120 participants:  response times were not statistically significant at p < 0.05  for correct answers: social cards probably outperform the existing Archive-It surrogate at p = 0.0569  when thumbnails are present, more participants interacted with them  when thumbnails are present, more participants click through to view the page underneath rather than relying upon the surrogate  Conclusions:  thumbnails encourage more interaction, specifically clicking through to the underlying page, than social cards  social cards outperform the existing status quo Archive-It surrogate in terms of correct answers at p = 0.0569  social cards probably provide for better understanding of web archive collections  For more information, see 27