Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques

Summarizing archival collections
using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14

Archive-It, a subscription-based service,
allows creation of collections
2
> 3,000
collections
~340
institutions
> 10B archived
pages

3
Collection
title
Collection
categorization
based on the
curator
Seed URI
Metadata
about the
collection
Text search
box
The group that
the resource
belongs to
List of
the seed
URIs
Timespan of the
resource
and the number
of times it has
been captured

Collection understanding and collection
summarization are not supported currently
Not easy to answer “what’s in that collection?” or
“how is this collection different from others”?
4

There is more than one collection about
“Egyptian Revolution”
5
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101
• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349
• “Egypt Revolution and Politics” https://archive-it.org/collections/2358

Our early attempts at collection understanding
tried to include everything…
9
“Visualizing digital collections at Archive-It”, JCDL 2012.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html

1000s of Seeds X 1000s of archived pages ==
Conventional Vis Methods Not Applicable
10

Stories in literature
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
Once upon a time
http://www.learner.org/interactives/story/
12

Stories in social media
“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
basically, just arranging web pages in time
13

“Storytelling” is becoming a popular
technique in social media
14

What are the limitations of
storytelling services?
15

The Egyptian Revolution on Storify
16

Bookmarking, not preserving!
17

Despite these limitations, how do we
combine storytelling & archives?
18

Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
19

We sample k mementos from N (k << N)
pages of the collection to create a summary
story
S
1
S
2
S
3
S
4
S
2
S
1
S
3
Collection Y
S
3
S
2
S
1
Collection Z
Collection X
20

Yasmin hand-crafted stories to summarize the
Egyptian Revolution collection for her son, Yousof
https://storify.com/yasmina_anwar/the-egyptian-revolution-
on-archive-it-collection
https://storify.com/yasmina_anwar/the-story-of-the-egyptian-
revolution-from-archive- 21

How do we generate this automatically?
22

Collections have two dimensions:
{Fixed, Sliding} X {Page, Time}
R1
1
R1
2
R1
3
R1
n
t1 t3t2 t5t4 tk
…
R2
1
R2
2
R2
3
R2
n
…
R3
1
R3
2
R3
3
R3
n
…
R4
1
R4
2
R4
3
R4
n
…
R5
1
R5
2
R5
3
R5
n
…
R6
1
R6
2
R6
3
R6
n
…
…
…
…
…
URI
Time
Rk
1
Rk
2
Rk
3
Rk
n
…
t6
23

Fixed Page, Fixed Time
A desktop Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Android Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
24

Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
25
Fixed Page, Sliding Time

Feb. 11, 2011
Mubarak resigns
26
Sliding Page, Fixed Time

Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
27
Sliding Page, Sliding Time

The Dark and Stormy Archives (DSA) framework
Establish a
baseline
Reduce the candidate
pool of archived pages
Select good
representative
pages
Characteristics of
human-generated
Stories
Characteristics of
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
28
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg

Establish a baseline of
social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
29

What is the length of a story
(the number of resources per story)?
This story has
31 resources
1
3
2
30

What are the types of resources that
compose a story?
Quotes
Video
31
This story has
• 19 quotes
• 8 images
• 4 videos

What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
32
This story has
• 90% twitter.com
• 7% instagram.com
• 3% facebook.com

Top 25 domains represents 92%
of all domains
33

What differentiates a popular story?
(popular = stories with the top 25% of views)
19,795 views 64 views
34

The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the
unpopular stories are different in terms of most of the features
• Popular stories tend to have:
• more web elements (medians of 28 vs. 21)
• longer timespan (5 hours vs. 2 hours) than the unpopular stories
35

Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources,
while it is 15% in the unpopular stories
36

We found that 28 mementos is a good
number for the resources in the stories.
37

Establish a baseline of current
Archive-IT collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
38

The mean and median number of
URIs in a collection
This collection has 435 seed URIs 39

The mean and median number of
mementos per URI
This seed URI has 16 mementos 40

The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
41

Archive-It top 25 is fundamentally
different than Storify top 25
42

Archive-It top 25 is fundamentally
different than Storify top 25
43
Twitter
is #10
not #1

What we archive and what we put in our
stories are different subsets of the web
44

Detecting off-topic pages
"Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
45

Archive-It provides their partners with tools
that allow them to build themed collections
46

Archive-It tools are about HTTP events /
mechanics, not “content”
47

Over 60% of archived versions of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
48

How do we automatically detect
off-topic pages?
49

We investigated 6 similarity metrics
• Textual Content
• cosine similarity of TF-IDF
• intersection of the 20 most frequent terms
• Jaccard similarity coefficient
• Semantics
• Web-based kernel function using a search engine (SE)
• Structural
• the change in number of words
• the change in content length
50

Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
51

Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
52

Semantics of the text
Web based kernel function using the search engine (SE)
53
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006

Semantics of the text
Web based kernel function using the search engine (SE)
Method Similarity
SE-Kernel 0.7
54
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006

Structural methods
no. of words, content-length
100 109
Method % change
WordCount 0.09
55

Structural methods
no. of words, content-length
100 109
100 5
Method % change
WordCount 0.09
Method % change
WordCount -0.95
56

We built a gold standard data set to
evaluate the methods
57

We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94 58

Evaluated 6 methods at 21 thresholds
• Assumed first memento was on-topic
• Combined two methods ('OR') to find best
combination method
• 15 combinations
• 6,615 tests (15 combinations x 21 thresholds x 21
thresholds)
• Averaged the results at each threshold over the three
collections
59

Cosine Similarity performed well
Similarity Measure Threshold FP FN FP+FN ACC F1 AUC
(Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968
(Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934
Cosine 0.15 31 22 53 0.983 0.881 0.961
(WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885
WordCount -0.85 6 44 50 0.982 0.806 0.870
SEKernel 0.05 64 83 147 0.965 0.683 0.865
Bytes -0.65 28 133 161 0.962 0.584 0.746
Jaccard 0.05 74 86 159 0.962 0.538 0.809
TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
60

Average precision of 0.89 on 18
Archive-It collections
61
(Cosine,WordCount) with (0.10,-0.85) thresholds

Detecting duplicates in a TimeMap
62

9 mementos for news.egypt.com,
but 5 are duplicates
63

How do we dynamically divide the
collections into appropriate slices?
64

We expected to see more like this…
The Global Food Crisis collection at Archive-It
65

This is what we found
Egypt Revolution and Politics
Human RightsApril 16 Archive Virginia Tech Shooting
Jasmine Revolution 2011 Wikileaks Document Release
66

Selecting representative pages for
generating stories
(skipping clustering details, but goal is k=28)
67

Quality metrics for selecting mementos
• In the DSA, memento quality Mq is calculated as
following:
Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level
• Sqc is the snippet quality based on URI category
• wm, wql, wqc are the weights of memento damage, level,
and category
68

We prefer a higher quality memento (Dm)
http://wayback.archive-it.org/2358/20110201231457/
http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/20110201231622/
http://www.bbc.co.uk/news/world/middle_east/
69
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014

We consider the page that gives an
attractive snippet
https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-
to-auction-treasury-bills/
https://wayback.archive-
it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
70

We prefer deep links over
high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East section on Storify
Feb. 11, 2011: the article of BBC on Storify
https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/
71

Social media pages may not produce
good snippets (Sqc)
http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk
72

Visualizing stories in Storify
73

Remember Yasmin’s hand-crafted stories?
74

Remember Yasmin’s hand-crafted stories?
75

We extract the metadata of the pages
and order them chronologically
{ "elements":[
{
"permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-
virginia-tech_N.htm", "type":"link",
"source":{"href":"http://www.usatoday.com",
"name":"www.usatoday.com
@ 23, May 2007"}
},
{
"permalink":"http://wayback.archive-
it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link",
"source":{"href":"http://www.time.com",
"name":"www.time.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/",
"type":"link", "source":{"href":"http://www.collegiatetimes.com",
"name":"www.collegiatetimes.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/",
"type":"link", "source":{ "href":"http://hokies416.wordpress.com",
"name":"hokies416.wordpress.com
@ 06, Jun 2007" }
},
…
{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/",
"type":"link", "source":{"href":"http://www.hokiesports.com",
"name":"www.hokiesports.com
@ 20, Jun 2007" } },
],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April
16 Archive ”
}
76
We override the default
metadata to generate more
attractive snippets

Example of an automatically generated story
77
Notice the good
metadata: images,
titles with dates,
favicons

Evaluating the Dark and Stormy
Archive framework
78

What a successful evaluation looks like!
• We use Amazon's Mechanical Turk to compare the
following stories:
• Human-generated stories
• DSA (automatically) generated stories
• Randomly generated stories
• Successful evaluation should result in:
• Human and DSA stories are indistinguishable
• Human and DSA stories are better than Random
79

Our guidelines for expert archivists at Archive-It
for generating stories from the collections
80

We received 23 stories for 10
Archive-It collections
SPST is “Sliding Page, Sliding Time”
SPFT is “Sliding Page, Fixed Time”
FPST is “Fixed Page, Sliding Time” 81

https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 82
• Generated by domain experts
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection

Automatically generated stories from
archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end
interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally
hosted at ODU) and download the collections that we do not have in
the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library
4. Eliminate the off-topic pages based on the best-performing method
((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85))
5. Exclude the duplicates of each TimeMap
6. Eliminate the non-English language pages
7. Slice the collection dynamically and then cluster the mementos of
each slice using DBSCAN algorithm
8. Apply the quality metrics to select the best representative pages
9. Sort the selected mementos chronologically then put them and their
metadata in a JSON object
83

https://storify.com/mturk_exp/3649b0s
84
• Automatically generated story
Bombing collection

Random stories
28 mementos were randomly selected from each
collection before excluding off-topic and duplicate
pages
85

https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 86
• Randomly generated story
Bombing collection

https://storify.com/mturk_exp/3649bads 87
if someone prefers this story,
we exclude their results
• Poorly generated story
Bombing collection

MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two
comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor
• HIT2: human vs. random, human vs. poor
• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master (have high acceptance rate) qualification
for each HIT
• We rejected the submissions contained poorly-generated stories and
the HITs that were completed in less than 10 seconds (mean time per
HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
88https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker

DSA == Human
(Human,DSA) > Random
90

Automatic versus Human
91
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time

Human versus Random
92

Automatic versus Random
93

Success!
DSA-generated stories are just as good as stories
generated by human experts
94

Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
95
All the code, datasets,
papers, slides, etc.:
http://bit.ly/YasminPhD

Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques

More Related Content

Viewers also liked

Similar to Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques

More from Reynolds Journalism Institute (RJI)

Recently uploaded

Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques

Editor's Notes