Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Summarizing archival collections using storytelling techniques

1,958 views

Published on

Summarizing archival collections using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson

Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14


Published in: Technology
  • Be the first to comment

Summarizing archival collections using storytelling techniques

  1. 1. Summarizing archival collections using storytelling techniques Yasmin AlNoamany Michele C. Weigle Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group www.cs.odu.edu/~mln/ @phonedude_mln Research Funded by IMLS LG-71-15-0077-15 Dodging the Memory Hole Los Angeles, CA, 2016-10-14
  2. 2. Archive-It, a subscription-based service, allows creation of collections 2 > 3,000 collections > 340 institutions > 10B archived pages
  3. 3. 3 Collection title Collection categorization based on the curator Seed URI Metadata about the collection Text search box The group that the resource belongs to List of the seed URIs Timespan of the resource and the number of times it has been captured
  4. 4. Collection understanding and collection summarization are not currently supported Not easy to answer “what’s in that collection?” or “how is this collection different from others”? 4
  5. 5. There is more than one collection about “Egyptian Revolution” 5 • “2010-2011 Arab Spring” https://archive-it.org/collections/3101 • “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349 • “Egypt Revolution and Politics” https://archive-it.org/collections/2358
  6. 6. 6 One of at least seven Human Rights collections…
  7. 7. 7
  8. 8. 8
  9. 9. Our early attempts at collection understanding tried to include everything… 9 “Visualizing digital collections at Archive-It”, JCDL 2012. http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  10. 10. 1000s of seeds X 1000s of archived pages == Conventional Vis Methods Not Applicable 10
  11. 11. Idea: Storytelling 11
  12. 12. Stories in literature Story elements: setting, characters, sequence, exposition, conflict, climax, resolution Once upon a time http://www.learner.org/interactives/story/ 12
  13. 13. Stories in social media “It's hard to define a story, but I know it when I see it” (Alexander, 2008) basically, just arranging web pages in time 13
  14. 14. “Storytelling” is becoming a popular technique in social media 14
  15. 15. What are the limitations of storytelling services? 15
  16. 16. The Egyptian Revolution on Storify 16
  17. 17. Bookmarking, not preserving! 17
  18. 18. Despite these limitations, how do we combine storytelling & archives? 18
  19. 19. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 19
  20. 20. We sample k mementos from N pages of the collection (k << N) to create a summary story S 1 S 2 S 3 S 4 S 2 S 1 S 3 Collection Y S 3 S 2 S 1 Collection Z Collection X 20
  21. 21. Yasmin hand-crafted stories to summarize the Egyptian Revolution collection for her son, Yousof https://storify.com/yasmina_anwar/the-egyptian-revolution- on-archive-it-collection https://storify.com/yasmina_anwar/the-story-of-the-egyptian- revolution-from-archive- 21
  22. 22. How do we generate this automatically? 22
  23. 23. Collections have two dimensions: {Fixed, Sliding} X {Page, Time} t1 t3t2 t5t4 tk … URI Time t6 23 … …
  24. 24. Fixed Page, Fixed Time A desktop Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Android Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013. Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 . 24
  25. 25. Feb 1 Feb 1 Feb 2 Feb 4 Feb 5 Feb 7 Feb 9 Feb 11 Feb 11 25 Fixed Page, Sliding Time
  26. 26. Feb. 11, 2011 Mubarak resigns 26 Sliding Page, Fixed Time
  27. 27. Jan 27 Jan 31 Feb 7Feb 4 Feb 11 Feb 11 Feb 2 Jan 25 Feb 10 27 Sliding Page, Sliding Time
  28. 28. The Dark and Stormy Archives (DSA) framework Establish a baseline Reduce the candidate pool of archived pages Select good representative pages Characteristics of human-generated Stories Characteristics of Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 28 https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
  29. 29. Establish a baseline of social media stories "Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016. 29
  30. 30. What is the length of a story (the number of resources per story)? This story has 31 resources 1 3 2 30
  31. 31. What are the types of resources that compose a story? Quotes Video 31 This story has • 19 quotes • 8 images • 4 videos
  32. 32. What are the most frequently used domains? Twitter.com Twitter.com Twitter.com 32 This story has • 90% twitter.com • 7% instagram.com • 3% facebook.com
  33. 33. Top 25 domains represents 92% of all domains 33
  34. 34. What differentiates a popular story? (popular = stories with the top 25% of views) 19,795 views 64 views 34
  35. 35. The distributions for the features of the stories • Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features • Popular stories tend to have: • more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories 35
  36. 36. Do popular stories have a lower decay rate? The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories 36
  37. 37. We found that 28 mementos is a good number for the resources in the stories. 37
  38. 38. Establish a baseline of current Archive-It collections "Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016. 38
  39. 39. The mean and median number of URIs in a collection This collection has 435 seed URIs 39
  40. 40. The mean and median number of mementos per URI This seed URI has 16 mementos 40
  41. 41. The most frequent used domains abcnews.go.com blogspot.com This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com 41
  42. 42. Archive-It top 25 is fundamentally different than Storify top 25 42
  43. 43. Archive-It top 25 is fundamentally different than Storify top 25 43 Twitter is #10 not #1
  44. 44. What we archive and what we share on social media are different subsets of the web (seeds != shares) 44 see also: Brunelle, et al., “The impact of JavaScript on archivability”, IJDL 2015
  45. 45. Detecting off-topic pages "Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016. 45
  46. 46. Archive-It provides their partners with tools that allow them to build themed collections 46
  47. 47. Archive-It tools are about HTTP events / mechanics, not “content” 47
  48. 48. These tools won’t detect that > 60% of mementos of hamdeensabahy.com are off-topic May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired. http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com 48
  49. 49. How do we automatically detect off-topic pages? 49
  50. 50. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 50
  51. 51. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 Method Similarity cosine 0.0 TF-Intersection 0.0 Jaccard 0.0 51
  52. 52. Semantics of the text Web based kernel function using the search engine (SE) 52 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  53. 53. Semantics of the text Web based kernel function using the search engine (SE) Method Similarity SE-Kernel 0.7 53 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  54. 54. Structural methods no. of words, content-length 100 109 Method % change WordCount 0.09 54
  55. 55. Structural methods no. of words, content-length 100 109 100 5 Method % change WordCount 0.09 Method % change WordCount -0.95 55
  56. 56. We built a gold standard data set to evaluate the methods 56
  57. 57. We manually labeled 15,760 mementos Egypt Revolution and Politics URI-Rs: 136 URI-Ms: 6,886 Off-topic URI-Ms: 384 Occupy Movement URI-Rs: 255 URI-Ms: 6,570 Off-topic URI-Ms: 458 Columbia Univ. Human Rights collection URI-Rs: 198 URI-Ms: 2,304 Off-topic URI-Ms: 94 57
  58. 58. Evaluated 6 methods + combos at 21 thresholds Averaged the results at each threshold over the three gold standard collections Similarity Measure Threshold FP FN FP+FN ACC F1 AUC (Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968 (Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934 Cosine 0.15 31 22 53 0.983 0.881 0.961 (WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885 WordCount -0.85 6 44 50 0.982 0.806 0.870 SEKernel 0.05 64 83 147 0.965 0.683 0.865 Bytes -0.65 28 133 161 0.962 0.584 0.746 Jaccard 0.05 74 86 159 0.962 0.538 0.809 TF-Intersection 0.00 49 104 153 0.967 0.537 0.740 58
  59. 59. Average precision of 0.89 on 18 different Archive-It collections 59 (Cosine,WordCount) with (0.10,-0.85) thresholds
  60. 60. How do we dynamically divide the collections into appropriate slices? (in other words, how do we pick just 28?) 60
  61. 61. We expected most collections to look like this… The Global Food Crisis collection at Archive-It 61
  62. 62. This is what we found Egypt Revolution and Politics Human RightsApril 16 Archive Virginia Tech Shooting Jasmine Revolution 2011 Wikileaks Document Release 62
  63. 63. Selecting representative pages for generating stories (skipping clustering details, but goal is k=28) 63
  64. 64. Quality metrics for selecting mementos • In the DSA, memento quality Mq is calculated as following: Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc • Dm is the memento damage (Brunelle, JCDL 2014) • Sql is the snippet quality based on the URI level • Sqc is the snippet quality based on URI category • wm, wql, wqc are the weights of memento damage, level, and category 64
  65. 65. We prefer a higher quality memento (Dm) http://wayback.archive-it.org/2358/20110201231457/ http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/20110201231622/ http://www.bbc.co.uk/news/world/middle_east/ 65 Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
  66. 66. We prefer pages with attractive snippets https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country- to-auction-treasury-bills/ https://wayback.archive- it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1 66
  67. 67. We prefer deep links over high level domains (Sql) Feb. 11, 2011: the homepage of BBC on Storify Feb. 11, 2011: the homepage of BBC Middle East section on Storify Feb. 11, 2011: the article of BBC on Storify https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/ 67
  68. 68. Social media pages may not produce good snippets (Sqc) http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk 68
  69. 69. Visualizing stories in Storify 69
  70. 70. Remember Yasmin’s hand-crafted stories? 70
  71. 71. Remember Yasmin’s hand-crafted stories? 71
  72. 72. We extract the metadata of the pages and order them chronologically { "elements":[ { "permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16- virginia-tech_N.htm", "type":"link", "source":{"href":"http://www.usatoday.com", "name":"www.usatoday.com @ 23, May 2007"} }, { "permalink":"http://wayback.archive- it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link", "source":{"href":"http://www.time.com", "name":"www.time.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/", "type":"link", "source":{"href":"http://www.collegiatetimes.com", "name":"www.collegiatetimes.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/", "type":"link", "source":{ "href":"http://hokies416.wordpress.com", "name":"hokies416.wordpress.com @ 06, Jun 2007" } }, … { "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/", "type":"link", "source":{"href":"http://www.hokiesports.com", "name":"www.hokiesports.com @ 20, Jun 2007" } }, ], "description":"This is an automatically generated story from Archive-It collection.", "title":"April 16 Archive ” } 72 Using the Storify API, we override the default metadata to generate more attractive snippets
  73. 73. Example of an automatically generated story 73 Notice the good metadata: images, titles with dates, favicons
  74. 74. Evaluating the Dark and Stormy Archive framework (how good are the automatically generated stories?) 74
  75. 75. Evaluation is tricky! (two perfectly good stories could have non-overlapping k=28 elements!) • We use human evaluators (via Amazon's Mechanical Turk) to compare: • Human-generated stories • DSA (automatically) generated stories • Randomly generated stories • Successful evaluation means: • Human and DSA stories are indistinguishable • Human and DSA stories are better than Random 75
  76. 76. Our guidelines for expert archivists at Archive-It for generating stories from the collections 76
  77. 77. We received 23 stories for 10 Archive-It collections SPST is “Sliding Page, Sliding Time” SPFT is “Sliding Page, Fixed Time” FPST is “Fixed Page, Sliding Time” 77
  78. 78. https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 78 • Generated by domain experts • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  79. 79. Automatically generated stories from archived collections 1. Obtain the seed list and the TimeMap of URIs from the front-end interface of Archive- It 2. Extract the HTML of the mementos from the WARC files (locally hosted at ODU) and download the collections that we do not have in the ODU mirror from Archive-It 3. Extract the text of the page using the Boilerpipe library 4. Eliminate the off-topic pages based on the best-performing method ((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85)) 5. Exclude duplicates in each TimeMap 6. Eliminate the non-English language pages 7. Slice the collection dynamically and then cluster the mementos of each slice using DBSCAN algorithm 8. Apply the quality metrics to select the best representative pages 9. Sort the selected mementos chronologically then put them and their metadata in a JSON object 79
  80. 80. https://storify.com/mturk_exp/3649b0s 80 • Automatically generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  81. 81. Random stories 28 mementos were randomly selected from each collection before excluding off-topic and duplicate pages 81
  82. 82. https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 82 • Randomly generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  83. 83. https://storify.com/mturk_exp/3649bads 83 if someone prefers this story, we exclude their results • Poorly generated story • The same memento, 28 times • The Boston Marathon Bombing collection
  84. 84. MT experiment setup • Three HITs for each story (69 HITs to evaluate 23 stories); two comparisons per HIT: • HIT1: human vs. automatic, human vs. poor • HIT2: human vs. random, human vs. poor • HIT3: random vs. automatic, automatic vs. poor • 15 distinct turkers with master qualification (i.e., high acceptance rate) for each HIT • We rejected the submissions contained poorly-generated stories and the HITs that were completed in less than 10 seconds (mean time per HIT = 7 minutes) • 989 out of 1,035 (69*15) valid HITs • We awarded the turker $0.50 per HIT 84https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
  85. 85. A sample HIT 85
  86. 86. DSA == Human (Human,DSA) > Random 86
  87. 87. Automatic versus Human 87 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  88. 88. Human versus Random Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time 88
  89. 89. Automatic versus Random 89 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  90. 90. Success! DSA-generated stories are just as good as stories generated by human experts 90
  91. 91. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 91 All the code, datasets, papers, slides, etc.: http://bit.ly/YasminPhD

×