Dr. Michael Nelson is a professor of computer science at Old Dominion University. Prior to joining ODU, he worked at NASA Langley Research Center from 1991 to 2002. He is a co-editor of the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), OAI-ORE (Open Archives Initiative Object Reuse and Exchange), Memento and ResourceSync specifications. His research interests include repository-object interaction and alternative approaches to digital preservation.
1. Summarizing archival collections
using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
2. Archive-It, a subscription-based service,
allows creation of collections
2
> 3,000
collections
~340
institutions
> 10B archived
pages
4. Collection understanding and collection
summarization are not supported currently
Not easy to answer “what’s in that collection?” or
“how is this collection different from others”?
4
5. There is more than one collection about
“Egyptian Revolution”
5
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101
• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349
• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
9. Our early attempts at collection understanding
tried to include everything…
9
“Visualizing digital collections at Archive-It”, JCDL 2012.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
10. 1000s of Seeds X 1000s of archived pages ==
Conventional Vis Methods Not Applicable
10
12. Stories in literature
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
Once upon a time
http://www.learner.org/interactives/story/
12
13. Stories in social media
“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
basically, just arranging web pages in time
13
19. Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
19
20. We sample k mementos from N (k << N)
pages of the collection to create a summary
story
S
1
S
2
S
3
S
4
S
2
S
1
S
3
Collection Y
S
3
S
2
S
1
Collection Z
Collection X
20
21. Yasmin hand-crafted stories to summarize the
Egyptian Revolution collection for her son, Yousof
https://storify.com/yasmina_anwar/the-egyptian-revolution-
on-archive-it-collection
https://storify.com/yasmina_anwar/the-story-of-the-egyptian-
revolution-from-archive- 21
23. Collections have two dimensions:
{Fixed, Sliding} X {Page, Time}
R1
1
R1
2
R1
3
R1
n
t1 t3t2 t5t4 tk
…
R2
1
R2
2
R2
3
R2
n
…
R3
1
R3
2
R3
3
R3
n
…
R4
1
R4
2
R4
3
R4
n
…
R5
1
R5
2
R5
3
R5
n
…
R6
1
R6
2
R6
3
R6
n
…
…
…
…
…
URI
Time
Rk
1
Rk
2
Rk
3
Rk
n
…
t6
23
24. Fixed Page, Fixed Time
A desktop Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Android Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
24
25. Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
25
Fixed Page, Sliding Time
27. Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
27
Sliding Page, Sliding Time
28. The Dark and Stormy Archives (DSA) framework
Establish a
baseline
Reduce the candidate
pool of archived pages
Select good
representative
pages
Characteristics of
human-generated
Stories
Characteristics of
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
28
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
29. Establish a baseline of
social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
29
30. What is the length of a story
(the number of resources per story)?
This story has
31 resources
1
3
2
30
31. What are the types of resources that
compose a story?
Quotes
Video
31
This story has
• 19 quotes
• 8 images
• 4 videos
32. What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
32
This story has
• 90% twitter.com
• 7% instagram.com
• 3% facebook.com
34. What differentiates a popular story?
(popular = stories with the top 25% of views)
19,795 views 64 views
34
35. The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the
unpopular stories are different in terms of most of the features
• Popular stories tend to have:
• more web elements (medians of 28 vs. 21)
• longer timespan (5 hours vs. 2 hours) than the unpopular stories
35
36. Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources,
while it is 15% in the unpopular stories
36
37. We found that 28 mementos is a good
number for the resources in the stories.
37
38. Establish a baseline of current
Archive-IT collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
38
39. The mean and median number of
URIs in a collection
This collection has 435 seed URIs 39
40. The mean and median number of
mementos per URI
This seed URI has 16 mementos 40
41. The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
41
48. Over 60% of archived versions of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
48
49. How do we automatically detect
off-topic pages?
49
50. We investigated 6 similarity metrics
• Textual Content
• cosine similarity of TF-IDF
• intersection of the 20 most frequent terms
• Jaccard similarity coefficient
• Semantics
• Web-based kernel function using a search engine (SE)
• Structural
• the change in number of words
• the change in content length
50
51. Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
51
52. Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
52
53. Semantics of the text
Web based kernel function using the search engine (SE)
53
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
54. Semantics of the text
Web based kernel function using the search engine (SE)
Method Similarity
SE-Kernel 0.7
54
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
57. We built a gold standard data set to
evaluate the methods
57
58. We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94 58
59. Evaluated 6 methods at 21 thresholds
• Assumed first memento was on-topic
• Combined two methods ('OR') to find best
combination method
• 15 combinations
• 6,615 tests (15 combinations x 21 thresholds x 21
thresholds)
• Averaged the results at each threshold over the three
collections
59
64. How do we dynamically divide the
collections into appropriate slices?
64
65. We expected to see more like this…
The Global Food Crisis collection at Archive-It
65
66. This is what we found
Egypt Revolution and Politics
Human RightsApril 16 Archive Virginia Tech Shooting
Jasmine Revolution 2011 Wikileaks Document Release
66
68. Quality metrics for selecting mementos
• In the DSA, memento quality Mq is calculated as
following:
Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level
• Sqc is the snippet quality based on URI category
• wm, wql, wqc are the weights of memento damage, level,
and category
68
69. We prefer a higher quality memento (Dm)
http://wayback.archive-it.org/2358/20110201231457/
http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/20110201231622/
http://www.bbc.co.uk/news/world/middle_east/
69
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
70. We consider the page that gives an
attractive snippet
https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-
to-auction-treasury-bills/
https://wayback.archive-
it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
70
71. We prefer deep links over
high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East section on Storify
Feb. 11, 2011: the article of BBC on Storify
https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/
71
72. Social media pages may not produce
good snippets (Sqc)
http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk
72
76. We extract the metadata of the pages
and order them chronologically
{ "elements":[
{
"permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-
virginia-tech_N.htm", "type":"link",
"source":{"href":"http://www.usatoday.com",
"name":"www.usatoday.com
@ 23, May 2007"}
},
{
"permalink":"http://wayback.archive-
it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link",
"source":{"href":"http://www.time.com",
"name":"www.time.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/",
"type":"link", "source":{"href":"http://www.collegiatetimes.com",
"name":"www.collegiatetimes.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/",
"type":"link", "source":{ "href":"http://hokies416.wordpress.com",
"name":"hokies416.wordpress.com
@ 06, Jun 2007" }
},
…
{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/",
"type":"link", "source":{"href":"http://www.hokiesports.com",
"name":"www.hokiesports.com
@ 20, Jun 2007" } },
],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April
16 Archive ”
}
76
We override the default
metadata to generate more
attractive snippets
77. Example of an automatically generated story
77
Notice the good
metadata: images,
titles with dates,
favicons
79. What a successful evaluation looks like!
• We use Amazon's Mechanical Turk to compare the
following stories:
• Human-generated stories
• DSA (automatically) generated stories
• Randomly generated stories
• Successful evaluation should result in:
• Human and DSA stories are indistinguishable
• Human and DSA stories are better than Random
79
80. Our guidelines for expert archivists at Archive-It
for generating stories from the collections
80
81. We received 23 stories for 10
Archive-It collections
SPST is “Sliding Page, Sliding Time”
SPFT is “Sliding Page, Fixed Time”
FPST is “Fixed Page, Sliding Time” 81
83. Automatically generated stories from
archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end
interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally
hosted at ODU) and download the collections that we do not have in
the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library
4. Eliminate the off-topic pages based on the best-performing method
((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85))
5. Exclude the duplicates of each TimeMap
6. Eliminate the non-English language pages
7. Slice the collection dynamically and then cluster the mementos of
each slice using DBSCAN algorithm
8. Apply the quality metrics to select the best representative pages
9. Sort the selected mementos chronologically then put them and their
metadata in a JSON object
83
88. MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two
comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor
• HIT2: human vs. random, human vs. poor
• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master (have high acceptance rate) qualification
for each HIT
• We rejected the submissions contained poorly-generated stories and
the HITs that were completed in less than 10 seconds (mean time per
HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
88https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
95. Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
95
All the code, datasets,
papers, slides, etc.:
http://bit.ly/YasminPhD
Editor's Notes
First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content.
Lori created the collections and entered metadata about them,description, title, etc
Collection level metadata but it doesn’t help a lot
Archive-It provides faceted browsing and search services on the resulting collection
, there are about 3 or 4 collections about egyptian revolution in Archive-it,
If I want to know about the egy rev, which collection should I browse??
Collection is two dimensions <<URIs, and copies of these URIs>>
Historian with more than one collection will not know where to start
Even we have these vis, we still do not have what is the content of these collections. Users have to go manually through the mementos to understand the collection, so the user has to inspect manually a lot things
We concluded that the conventional viz methods, which we try to visualize everything in the collection are not applicable.
The conventional methods are not applicable for
So how about using storytelling ??
Every story is made up of a set of events.
Stories in literature has elemnts, such as setting, characters, sequence, etc.
We use ``story'' in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc
What we mean here by Storytelling here is using visualizations to put a set of web pages from web archives in a narrative structure, ordered by time
Story def. in social media much looser and more relaxed.
in social media, it is more arranging resources through time
Storytelling may be seen as the set of cultural practices for representing events
chronologically.
Because of the sheer volume of information on the web,
“storytelling” is becoming a popular technique in social media for selecting
representative tweets, videos, web pages, etc. and arranging them in chronological order to support
a particular narrative or “story”4.
We use “story” in its current, loose context of social media, which is sometimes missing elements from the more
formal literary tradition of dramatic structure, morality, humor, improvisation, etc
Storytelling looks promising but what are the problems of storytelling
This is an example story for the egyptian revolution on storify, Storify
Storify is a storytelling service that lets the user create stories or narratives using social media and web pages. Storify was launched in September 2010, and has been open to the public since April 2011.
storytelling” is best typified by the company Storify
http://storify.com/nzherald/mu
http://storify.com/nzherald/mu
http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705546
The problem is that storify operate as bookmarking, it doesn’t preserve the links
You have no clue of what the person is saying about the link
http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523
http://storify.com/nzherald/mu
So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.
So if this is the web, the archived collections are subsets from the web, we will sample from these collections to create a story…..
Then place those generated samples in a social media interface that people already know: Storify
I went through these collection and sampled what I thought interesting pages, ordered them by time, and put them on storify so yousof and his generation can see it later.
I took hours in selecting the resources in these handcrafted stories
Although that I know the egyptian revolution very well, it wasn’t easy to select these pages from all the pages in the collection to represent the story.
For example, here is the same page, it is different based on desktop and mobile.
The archives typically don’t have those versions, so currently we can’t generate this story.
http://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2
http://america.aljazeera.com/
Personalized Web resources offer different representations based on the user-agent string and other values in the HTTP request headers, GeoIP, and other environmental factors.
Currently web archives don’t support browsing different representation.
This means Web crawlers capturing content for archives may receive representations based on the crawl environment which will differ from the representations returned to the interactive users.
For example, here is cnn blog at it evolved over time.
You can get a sense of how the story is evolved through time from looking at the images here.
http://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/*/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
For example,
This is feb. 11. we can see how cnn reported it, and we can see how bbc covered the news.
Here is feb 11 from different news sites
This story is very important for humanities researchers,
https://wayback.archive-it.org/2358/20110211074248/http://www.globalpost.com/dispatch/egypt/110210/mubarak-resign-obama-egypt
https://wayback.archive-it.org/2358/20110211191445/http://www.cnn.com/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211192142/http://www.modernegypt.info/
https://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
https://wayback.archive-it.org/2358/20110211191423/http://www.arabist.net/
https://wayback.archive-it.org/2358/20110211194239/http://www.globalpost.com/dispatch/egypt/110211/mubarak-quits-resigns-egypt-cairo
And here I want to get the broadest coverage possible for the egyptian revolution
sampling from the entire collection
For example, we can see here the news from cnn about shutting down the internet.
Also the news about mubarak resigning on feb 11 from bbc
To generate these stories, we introduced the Dark and stormy archives framework.
The framework has three main components:
First, Establishing a basline of human generated stories and ait collections
First, we check the human generated stories using stories from storify
To have a descriptive model of how good stories look like
So we quantified , the number of resources in the stories,
Twitter is the most popular domain in storify stories, and you can notice here that twitter dominate the top list with large %
We looked at what make a good story
We looked at five this, and we found two things
Popular stories tend to have:
more web elements (medians of 28 vs. 21)
longer timespan (5 hours vs. 2 hours) than the unpopular stories
longer editing time intervals than the unpopular stories
It shows that the resources of the pop- ular stories tends to stay longer than the resources of the unpopular.
This is the most important thing that you need to remember
This is will be used as template for our automatically generated stories.
What we archive and what we put in our stories are different subsets of the web
Archive-It provides their partners with tools that allow them to build themed collections of archived Web pages hosted on Archive-It's machines. This is done by the user manually specifying a set of \emph{seeds}, Uniform Resource Identifiers (URIs) that should be crawled periodically (the frequency is tunable by the user), and to what depth (e.g., follow the pages linked to from the seeds two levels out)
Archive-It provide curators about http events, like how many html file, pdfs, http responses…
File types and so on
However, the tools are currently focused on issues such as the mechanics of HTTP (e.g., how many HTML files vs. PDFs, how many HTTP 404 responses) and domain information (e.g., how many .uk sites vs. .com sites). Currently, there are no content-based tools that allow curators to detect when seed URIs go off-topic.
Here is a dude running for office,
Starts as off-topic, went off-topci because of DB error, then went off because of financial problems,
Then it went on-on topic again, then it was hacked, then the domain was expired for sale.
We don’t necessarily want to get rid of this because it documents what happened
But if r gonna choose pages for a story, db error won’t be a good candidate for a story
These are the scores of two similar pages
The textual content:
cosine similarity
intersection of the most frequent terms
Jaccard coefficient
The semantics of the text:
Web based kernel function using the search engine (SE)
Structural methods:
the change in number of words
the change in content length
And these are the score for two mementos in which one is off-topic because the domain is lost
These are the scores of two similar pages
The textual content:
cosine similarity
intersection of the most frequent terms
Jaccard coefficient
The semantics of the text:
Web based kernel function using the search engine (SE)
Structural methods:
the change in number of words
the change in content length
These two mementos are about egypt but termwise, they don’t overlap.
For three collections, we went through and manually determined whther the mementos are off-topic and on-topic
to the mementos in the URI-rs, like the page I showed b4 to be off
I would never wanna do this again
Cosine similarity at threshold = 0.15 is the best single method
If cosine similarity between candidate memento and first memento < 0.15, then candidate memento is marked as 'off-topic'
If cosine similarity between candidate memento and first memento < 0.10 OR word count between candidate memento and first memento has decreased by more than 85%, then candidate memento is marked as 'off-topic'
FP - classified as off-topic, but really on-topic
There are some collections where we didn’’t find off-topic URIs,
And we have other collections that have big chunk of off-topic mementos for lost of seeds URIs.
For some collections, there are around 10-15 % of the mementos of the collection are off-topic
The URIs in the collecion are being crawled frequently may be daily, weekly, monthly, yearly and so on
http://wayback.archive-it.org/2358/20110201013235/http://news.egypt.com/en/
For gaining insight about how to slice the collection, we visualized the memento-datetimes (the crawl time of th URIs) of many collection,
Earlier we talked about two dimensions for stories,
In this collection, here are the two dimensions.
In the x-axis,we have time, and y-axis has URIs
Pretty much for each Url, the crawling are has the same amount of time.
Textual methods will combine these two pages in the same cluster. So we needed an automated method to
The same news from two different sites, we pick the better memento in terms of quality
Remember this, this is not a good snippet.
These are not good snippet,
We can do better than this.
we ovverrise the favicon, add he date to the title,
Remember this, this is not a good snippet.
These are not good snippet,
We can do better than this.
we override the favicon, add he date to the title,
So we generate a json object with the metadata of the stories and push them on storify using Storify API.
After we select the best set of pages that represent a story, we extract the metadata of the mementos and put them into json format for visualization
We have done a lot of work to override the default metadata that storify extracts.
We consider a story to be ``good'' if a person considers it to be indistinguishable from a human-generated story. Furthermore, the human and the automatic stories should be better than the random stories.
A successful evaluation should have:
Human and dsa should be indistinguishable
And human and dsa should be better than random
They are familiar of the collections, I gave them these guidelines
We obtained 23 stories for 10 collections.
Some collections do not have stories because
Random stories.
When you choose random you get what get, sometimes is good and sometimes is bad.
The turkers are presented with two sits of comparisons, each comparison has two stories and we ask them which story better summarize the topic.
They can scroll down, through the stories they can click on any memento.
So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.