SlideShare a Scribd company logo
1 of 95
Summarizing archival collections
using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
Archive-It, a subscription-based service,
allows creation of collections
2
> 3,000
collections
~340
institutions
> 10B archived
pages
3
Collection
title
Collection
categorization
based on the
curator
Seed URI
Metadata
about the
collection
Text search
box
The group that
the resource
belongs to
List of
the seed
URIs
Timespan of the
resource
and the number
of times it has
been captured
Collection understanding and collection
summarization are not supported currently
Not easy to answer “what’s in that collection?” or
“how is this collection different from others”?
4
There is more than one collection about
“Egyptian Revolution”
5
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101
• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349
• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
6
7
8
Our early attempts at collection understanding
tried to include everything…
9
“Visualizing digital collections at Archive-It”, JCDL 2012.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
1000s of Seeds X 1000s of archived pages ==
Conventional Vis Methods Not Applicable
10
Idea:
Storytelling
11
Stories in literature
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
Once upon a time
http://www.learner.org/interactives/story/
12
Stories in social media
“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
basically, just arranging web pages in time
13
“Storytelling” is becoming a popular
technique in social media
14
What are the limitations of
storytelling services?
15
The Egyptian Revolution on Storify
16
Bookmarking, not preserving!
17
Despite these limitations, how do we
combine storytelling & archives?
18
Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
19
We sample k mementos from N (k << N)
pages of the collection to create a summary
story
S
1
S
2
S
3
S
4
S
2
S
1
S
3
Collection Y
S
3
S
2
S
1
Collection Z
Collection X
20
Yasmin hand-crafted stories to summarize the
Egyptian Revolution collection for her son, Yousof
https://storify.com/yasmina_anwar/the-egyptian-revolution-
on-archive-it-collection
https://storify.com/yasmina_anwar/the-story-of-the-egyptian-
revolution-from-archive- 21
How do we generate this automatically?
22
Collections have two dimensions:
{Fixed, Sliding} X {Page, Time}
R1
1
R1
2
R1
3
R1
n
t1 t3t2 t5t4 tk
…
R2
1
R2
2
R2
3
R2
n
…
R3
1
R3
2
R3
3
R3
n
…
R4
1
R4
2
R4
3
R4
n
…
R5
1
R5
2
R5
3
R5
n
…
R6
1
R6
2
R6
3
R6
n
…
…
…
…
…
URI
Time
Rk
1
Rk
2
Rk
3
Rk
n
…
t6
23
Fixed Page, Fixed Time
A desktop Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Android Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
24
Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
25
Fixed Page, Sliding Time
Feb. 11, 2011
Mubarak resigns
26
Sliding Page, Fixed Time
Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
27
Sliding Page, Sliding Time
The Dark and Stormy Archives (DSA) framework
Establish a
baseline
Reduce the candidate
pool of archived pages
Select good
representative
pages
Characteristics of
human-generated
Stories
Characteristics of
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
28
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
Establish a baseline of
social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
29
What is the length of a story
(the number of resources per story)?
This story has
31 resources
1
3
2
30
What are the types of resources that
compose a story?
Quotes
Video
31
This story has
• 19 quotes
• 8 images
• 4 videos
What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
32
This story has
• 90% twitter.com
• 7% instagram.com
• 3% facebook.com
Top 25 domains represents 92%
of all domains
33
What differentiates a popular story?
(popular = stories with the top 25% of views)
19,795 views 64 views
34
The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the
unpopular stories are different in terms of most of the features
• Popular stories tend to have:
• more web elements (medians of 28 vs. 21)
• longer timespan (5 hours vs. 2 hours) than the unpopular stories
35
Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources,
while it is 15% in the unpopular stories
36
We found that 28 mementos is a good
number for the resources in the stories.
37
Establish a baseline of current
Archive-IT collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
38
The mean and median number of
URIs in a collection
This collection has 435 seed URIs 39
The mean and median number of
mementos per URI
This seed URI has 16 mementos 40
The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
41
Archive-It top 25 is fundamentally
different than Storify top 25
42
Archive-It top 25 is fundamentally
different than Storify top 25
43
Twitter
is #10
not #1
What we archive and what we put in our
stories are different subsets of the web
44
Detecting off-topic pages
"Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
45
Archive-It provides their partners with tools
that allow them to build themed collections
46
Archive-It tools are about HTTP events /
mechanics, not “content”
47
Over 60% of archived versions of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
48
How do we automatically detect
off-topic pages?
49
We investigated 6 similarity metrics
• Textual Content
• cosine similarity of TF-IDF
• intersection of the 20 most frequent terms
• Jaccard similarity coefficient
• Semantics
• Web-based kernel function using a search engine (SE)
• Structural
• the change in number of words
• the change in content length
50
Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
51
Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
52
Semantics of the text
Web based kernel function using the search engine (SE)
53
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
Semantics of the text
Web based kernel function using the search engine (SE)
Method Similarity
SE-Kernel 0.7
54
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
Structural methods
no. of words, content-length
100 109
Method % change
WordCount 0.09
55
Structural methods
no. of words, content-length
100 109
100 5
Method % change
WordCount 0.09
Method % change
WordCount -0.95
56
We built a gold standard data set to
evaluate the methods
57
We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94 58
Evaluated 6 methods at 21 thresholds
• Assumed first memento was on-topic
• Combined two methods ('OR') to find best
combination method
• 15 combinations
• 6,615 tests (15 combinations x 21 thresholds x 21
thresholds)
• Averaged the results at each threshold over the three
collections
59
Cosine Similarity performed well
Similarity Measure Threshold FP FN FP+FN ACC F1 AUC
(Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968
(Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934
Cosine 0.15 31 22 53 0.983 0.881 0.961
(WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885
WordCount -0.85 6 44 50 0.982 0.806 0.870
SEKernel 0.05 64 83 147 0.965 0.683 0.865
Bytes -0.65 28 133 161 0.962 0.584 0.746
Jaccard 0.05 74 86 159 0.962 0.538 0.809
TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
60
Average precision of 0.89 on 18
Archive-It collections
61
(Cosine,WordCount) with (0.10,-0.85) thresholds
Detecting duplicates in a TimeMap
62
9 mementos for news.egypt.com,
but 5 are duplicates
63
How do we dynamically divide the
collections into appropriate slices?
64
We expected to see more like this…
The Global Food Crisis collection at Archive-It
65
This is what we found
Egypt Revolution and Politics
Human RightsApril 16 Archive Virginia Tech Shooting
Jasmine Revolution 2011 Wikileaks Document Release
66
Selecting representative pages for
generating stories
(skipping clustering details, but goal is k=28)
67
Quality metrics for selecting mementos
• In the DSA, memento quality Mq is calculated as
following:
Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level
• Sqc is the snippet quality based on URI category
• wm, wql, wqc are the weights of memento damage, level,
and category
68
We prefer a higher quality memento (Dm)
http://wayback.archive-it.org/2358/20110201231457/
http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/20110201231622/
http://www.bbc.co.uk/news/world/middle_east/
69
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
We consider the page that gives an
attractive snippet
https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-
to-auction-treasury-bills/
https://wayback.archive-
it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
70
We prefer deep links over
high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East section on Storify
Feb. 11, 2011: the article of BBC on Storify
https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/
71
Social media pages may not produce
good snippets (Sqc)
http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk
72
Visualizing stories in Storify
73
Remember Yasmin’s hand-crafted stories?
74
Remember Yasmin’s hand-crafted stories?
75
We extract the metadata of the pages
and order them chronologically
{ "elements":[
{
"permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-
virginia-tech_N.htm", "type":"link",
"source":{"href":"http://www.usatoday.com",
"name":"www.usatoday.com
@ 23, May 2007"}
},
{
"permalink":"http://wayback.archive-
it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link",
"source":{"href":"http://www.time.com",
"name":"www.time.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/",
"type":"link", "source":{"href":"http://www.collegiatetimes.com",
"name":"www.collegiatetimes.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/",
"type":"link", "source":{ "href":"http://hokies416.wordpress.com",
"name":"hokies416.wordpress.com
@ 06, Jun 2007" }
},
…
{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/",
"type":"link", "source":{"href":"http://www.hokiesports.com",
"name":"www.hokiesports.com
@ 20, Jun 2007" } },
],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April
16 Archive ”
}
76
We override the default
metadata to generate more
attractive snippets
Example of an automatically generated story
77
Notice the good
metadata: images,
titles with dates,
favicons
Evaluating the Dark and Stormy
Archive framework
78
What a successful evaluation looks like!
• We use Amazon's Mechanical Turk to compare the
following stories:
• Human-generated stories
• DSA (automatically) generated stories
• Randomly generated stories
• Successful evaluation should result in:
• Human and DSA stories are indistinguishable
• Human and DSA stories are better than Random
79
Our guidelines for expert archivists at Archive-It
for generating stories from the collections
80
We received 23 stories for 10
Archive-It collections
SPST is “Sliding Page, Sliding Time”
SPFT is “Sliding Page, Fixed Time”
FPST is “Fixed Page, Sliding Time” 81
https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 82
• Generated by domain experts
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
Automatically generated stories from
archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end
interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally
hosted at ODU) and download the collections that we do not have in
the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library
4. Eliminate the off-topic pages based on the best-performing method
((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85))
5. Exclude the duplicates of each TimeMap
6. Eliminate the non-English language pages
7. Slice the collection dynamically and then cluster the mementos of
each slice using DBSCAN algorithm
8. Apply the quality metrics to select the best representative pages
9. Sort the selected mementos chronologically then put them and their
metadata in a JSON object
83
https://storify.com/mturk_exp/3649b0s
84
• Automatically generated story
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
Random stories
28 mementos were randomly selected from each
collection before excluding off-topic and duplicate
pages
85
https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 86
• Randomly generated story
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
https://storify.com/mturk_exp/3649bads 87
if someone prefers this story,
we exclude their results
• Poorly generated story
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two
comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor
• HIT2: human vs. random, human vs. poor
• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master (have high acceptance rate) qualification
for each HIT
• We rejected the submissions contained poorly-generated stories and
the HITs that were completed in less than 10 seconds (mean time per
HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
88https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
A sample HIT
89
DSA == Human
(Human,DSA) > Random
90
Automatic versus Human
91
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
Human versus Random
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
92
Automatic versus Random
93
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
Success!
DSA-generated stories are just as good as stories
generated by human experts
94
Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
95
All the code, datasets,
papers, slides, etc.:
http://bit.ly/YasminPhD

More Related Content

Viewers also liked

Viewers also liked (7)

Documento dinamico modulo 1
Documento dinamico modulo 1Documento dinamico modulo 1
Documento dinamico modulo 1
 
how to control frustration
how to control frustrationhow to control frustration
how to control frustration
 
Bahan web malam
Bahan web malamBahan web malam
Bahan web malam
 
Molife
MolifeMolife
Molife
 
Docencia compartida
Docencia compartidaDocencia compartida
Docencia compartida
 
El empirismo
El empirismoEl empirismo
El empirismo
 
Another Type of Treatment
Another Type of TreatmentAnother Type of Treatment
Another Type of Treatment
 

Similar to Summarizing Archives with Stories

Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Shawn Jones
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Dr. Aparna Varde
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitShawn Jones
 
Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library DataRichard Wallis
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 
An Ecosystem for Digital Costume Collections
An Ecosystem for Digital Costume CollectionsAn Ecosystem for Digital Costume Collections
An Ecosystem for Digital Costume CollectionsArden Kirkland
 
The Web of Data is Our Oyster
The Web of Data is Our OysterThe Web of Data is Our Oyster
The Web of Data is Our OysterRichard Wallis
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 

Similar to Summarizing Archives with Stories (20)

Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
 
Big data
Big dataBig data
Big data
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento Toolkit
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 
Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library Data
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 
Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"
 
An Ecosystem for Digital Costume Collections
An Ecosystem for Digital Costume CollectionsAn Ecosystem for Digital Costume Collections
An Ecosystem for Digital Costume Collections
 
The Web of Data is Our Oyster
The Web of Data is Our OysterThe Web of Data is Our Oyster
The Web of Data is Our Oyster
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 

More from Reynolds Journalism Institute (RJI)

John Rampton: Advanced content promotion — how the big boys are doing it
John Rampton: Advanced content promotion — how the big boys are doing itJohn Rampton: Advanced content promotion — how the big boys are doing it
John Rampton: Advanced content promotion — how the big boys are doing itReynolds Journalism Institute (RJI)
 
Welsh, Ben: The framework fix: how to build better archives by helping news n...
Welsh, Ben: The framework fix: how to build better archives by helping news n...Welsh, Ben: The framework fix: how to build better archives by helping news n...
Welsh, Ben: The framework fix: how to build better archives by helping news n...Reynolds Journalism Institute (RJI)
 
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...Reynolds Journalism Institute (RJI)
 
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...Reynolds Journalism Institute (RJI)
 

More from Reynolds Journalism Institute (RJI) (20)

Art Holliday, BJ ’76, alumni speaker
Art Holliday, BJ ’76, alumni speakerArt Holliday, BJ ’76, alumni speaker
Art Holliday, BJ ’76, alumni speaker
 
Christopher Guess presents Push
Christopher Guess presents PushChristopher Guess presents Push
Christopher Guess presents Push
 
Hurley Symposium 2017 fake news survey
Hurley Symposium 2017 fake news surveyHurley Symposium 2017 fake news survey
Hurley Symposium 2017 fake news survey
 
Archie Thornton: Content and the Fourth Industrial Revolution
Archie Thornton: Content and the Fourth Industrial RevolutionArchie Thornton: Content and the Fourth Industrial Revolution
Archie Thornton: Content and the Fourth Industrial Revolution
 
Victor Hernandez: 50 things we learned at RJI-Distribution
Victor Hernandez: 50 things we learned at RJI-DistributionVictor Hernandez: 50 things we learned at RJI-Distribution
Victor Hernandez: 50 things we learned at RJI-Distribution
 
Kaizar Campwala: Distribution in service of brand loyalty
Kaizar Campwala: Distribution in service of brand loyaltyKaizar Campwala: Distribution in service of brand loyalty
Kaizar Campwala: Distribution in service of brand loyalty
 
Zahra rasool presentation
Zahra rasool presentationZahra rasool presentation
Zahra rasool presentation
 
Uzo Iweala: Who speaks for Africa
Uzo Iweala: Who speaks for AfricaUzo Iweala: Who speaks for Africa
Uzo Iweala: Who speaks for Africa
 
Adam Falk: Subscribe now or forever hold your audience?
Adam Falk: Subscribe now or forever hold your audience?Adam Falk: Subscribe now or forever hold your audience?
Adam Falk: Subscribe now or forever hold your audience?
 
Ben Norskov and Mohini Duta: Playing news
Ben Norskov and Mohini Duta: Playing newsBen Norskov and Mohini Duta: Playing news
Ben Norskov and Mohini Duta: Playing news
 
Katherine Bell: Beyond the funnel
Katherine Bell: Beyond the funnelKatherine Bell: Beyond the funnel
Katherine Bell: Beyond the funnel
 
John Rampton: Advanced content promotion — how the big boys are doing it
John Rampton: Advanced content promotion — how the big boys are doing itJohn Rampton: Advanced content promotion — how the big boys are doing it
John Rampton: Advanced content promotion — how the big boys are doing it
 
Sarah Hill: The uncanny valley of VR distribution
Sarah Hill: The uncanny valley of VR distributionSarah Hill: The uncanny valley of VR distribution
Sarah Hill: The uncanny valley of VR distribution
 
Alejandro González: The story is your mothership
Alejandro González: The story is your mothershipAlejandro González: The story is your mothership
Alejandro González: The story is your mothership
 
Kari Paul: Brand loyalty as a distribution strategy
Kari Paul: Brand loyalty as a distribution strategyKari Paul: Brand loyalty as a distribution strategy
Kari Paul: Brand loyalty as a distribution strategy
 
Welsh, Ben: The framework fix: how to build better archives by helping news n...
Welsh, Ben: The framework fix: how to build better archives by helping news n...Welsh, Ben: The framework fix: how to build better archives by helping news n...
Welsh, Ben: The framework fix: how to build better archives by helping news n...
 
Skinner, Katherine: Alignment and Reciprocity
Skinner, Katherine: Alignment and ReciprocitySkinner, Katherine: Alignment and Reciprocity
Skinner, Katherine: Alignment and Reciprocity
 
Leetaru, Kalev: The GDELT Project
Leetaru, Kalev: The GDELT ProjectLeetaru, Kalev: The GDELT Project
Leetaru, Kalev: The GDELT Project
 
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...
Zwaard, Kate: Technology and Community: Why we need partners, collaborators a...
 
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...
Younger, Jennifer: lighntning talk, Digital Preservation: Aggregated, Collabo...
 

Recently uploaded

Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsVashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsPooja Nehwal
 
26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
23042024_First India Newspaper Jaipur.pdf
23042024_First India Newspaper Jaipur.pdf23042024_First India Newspaper Jaipur.pdf
23042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service Kolhapur
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service KolhapurCollege Call Girls Kolhapur Aanya 8617697112 Independent Escort Service Kolhapur
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service KolhapurCall girls in Ahmedabad High profile
 
Minto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxMinto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxAwaiskhalid96
 
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Pooja Nehwal
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...Ismail Fahmi
 
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep VictoryAP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victoryanjanibaddipudi1
 
Brief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerBrief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerOmarCabrera39
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Krish109503
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxlorenzodemidio01
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...narsireddynannuri1
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkbhavenpr
 
25042024_First India Newspaper Jaipur.pdf
25042024_First India Newspaper Jaipur.pdf25042024_First India Newspaper Jaipur.pdf
25042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Roberts Rules Cheat Sheet for LD4 Precinct Commiteemen
Roberts Rules Cheat Sheet for LD4 Precinct CommiteemenRoberts Rules Cheat Sheet for LD4 Precinct Commiteemen
Roberts Rules Cheat Sheet for LD4 Precinct Commiteemenkfjstone13
 
Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Ismail Fahmi
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...AlexisTorres963861
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxjohnandrewcarlos
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012ankitnayak356677
 
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书Fi L
 

Recently uploaded (20)

Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsVashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
 
26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf
 
23042024_First India Newspaper Jaipur.pdf
23042024_First India Newspaper Jaipur.pdf23042024_First India Newspaper Jaipur.pdf
23042024_First India Newspaper Jaipur.pdf
 
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service Kolhapur
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service KolhapurCollege Call Girls Kolhapur Aanya 8617697112 Independent Escort Service Kolhapur
College Call Girls Kolhapur Aanya 8617697112 Independent Escort Service Kolhapur
 
Minto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxMinto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptx
 
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
 
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep VictoryAP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
 
Brief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerBrief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert Oppenheimer
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpk
 
25042024_First India Newspaper Jaipur.pdf
25042024_First India Newspaper Jaipur.pdf25042024_First India Newspaper Jaipur.pdf
25042024_First India Newspaper Jaipur.pdf
 
Roberts Rules Cheat Sheet for LD4 Precinct Commiteemen
Roberts Rules Cheat Sheet for LD4 Precinct CommiteemenRoberts Rules Cheat Sheet for LD4 Precinct Commiteemen
Roberts Rules Cheat Sheet for LD4 Precinct Commiteemen
 
Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012
 
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
 

Summarizing Archives with Stories

  • 1. Summarizing archival collections using storytelling techniques Yasmin AlNoamany Michele C. Weigle Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group www.cs.odu.edu/~mln/ @phonedude_mln Research Funded by IMLS LG-71-15-0077-15 Dodging the Memory Hole Los Angeles, CA, 2016-10-14
  • 2. Archive-It, a subscription-based service, allows creation of collections 2 > 3,000 collections ~340 institutions > 10B archived pages
  • 3. 3 Collection title Collection categorization based on the curator Seed URI Metadata about the collection Text search box The group that the resource belongs to List of the seed URIs Timespan of the resource and the number of times it has been captured
  • 4. Collection understanding and collection summarization are not supported currently Not easy to answer “what’s in that collection?” or “how is this collection different from others”? 4
  • 5. There is more than one collection about “Egyptian Revolution” 5 • “2010-2011 Arab Spring” https://archive-it.org/collections/3101 • “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349 • “Egypt Revolution and Politics” https://archive-it.org/collections/2358
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. Our early attempts at collection understanding tried to include everything… 9 “Visualizing digital collections at Archive-It”, JCDL 2012. http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  • 10. 1000s of Seeds X 1000s of archived pages == Conventional Vis Methods Not Applicable 10
  • 12. Stories in literature Story elements: setting, characters, sequence, exposition, conflict, climax, resolution Once upon a time http://www.learner.org/interactives/story/ 12
  • 13. Stories in social media “It's hard to define a story, but I know it when I see it” (Alexander, 2008) basically, just arranging web pages in time 13
  • 14. “Storytelling” is becoming a popular technique in social media 14
  • 15. What are the limitations of storytelling services? 15
  • 16. The Egyptian Revolution on Storify 16
  • 18. Despite these limitations, how do we combine storytelling & archives? 18
  • 19. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 19
  • 20. We sample k mementos from N (k << N) pages of the collection to create a summary story S 1 S 2 S 3 S 4 S 2 S 1 S 3 Collection Y S 3 S 2 S 1 Collection Z Collection X 20
  • 21. Yasmin hand-crafted stories to summarize the Egyptian Revolution collection for her son, Yousof https://storify.com/yasmina_anwar/the-egyptian-revolution- on-archive-it-collection https://storify.com/yasmina_anwar/the-story-of-the-egyptian- revolution-from-archive- 21
  • 22. How do we generate this automatically? 22
  • 23. Collections have two dimensions: {Fixed, Sliding} X {Page, Time} R1 1 R1 2 R1 3 R1 n t1 t3t2 t5t4 tk … R2 1 R2 2 R2 3 R2 n … R3 1 R3 2 R3 3 R3 n … R4 1 R4 2 R4 3 R4 n … R5 1 R5 2 R5 3 R5 n … R6 1 R6 2 R6 3 R6 n … … … … … URI Time Rk 1 Rk 2 Rk 3 Rk n … t6 23
  • 24. Fixed Page, Fixed Time A desktop Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Android Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013. Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 . 24
  • 25. Feb 1 Feb 1 Feb 2 Feb 4 Feb 5 Feb 7 Feb 9 Feb 11 Feb 11 25 Fixed Page, Sliding Time
  • 26. Feb. 11, 2011 Mubarak resigns 26 Sliding Page, Fixed Time
  • 27. Jan 27 Jan 31 Feb 7Feb 4 Feb 11 Feb 11 Feb 2 Jan 25 Feb 10 27 Sliding Page, Sliding Time
  • 28. The Dark and Stormy Archives (DSA) framework Establish a baseline Reduce the candidate pool of archived pages Select good representative pages Characteristics of human-generated Stories Characteristics of Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 28 https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
  • 29. Establish a baseline of social media stories "Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016. 29
  • 30. What is the length of a story (the number of resources per story)? This story has 31 resources 1 3 2 30
  • 31. What are the types of resources that compose a story? Quotes Video 31 This story has • 19 quotes • 8 images • 4 videos
  • 32. What are the most frequently used domains? Twitter.com Twitter.com Twitter.com 32 This story has • 90% twitter.com • 7% instagram.com • 3% facebook.com
  • 33. Top 25 domains represents 92% of all domains 33
  • 34. What differentiates a popular story? (popular = stories with the top 25% of views) 19,795 views 64 views 34
  • 35. The distributions for the features of the stories • Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features • Popular stories tend to have: • more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories 35
  • 36. Do popular stories have a lower decay rate? The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories 36
  • 37. We found that 28 mementos is a good number for the resources in the stories. 37
  • 38. Establish a baseline of current Archive-IT collections "Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016. 38
  • 39. The mean and median number of URIs in a collection This collection has 435 seed URIs 39
  • 40. The mean and median number of mementos per URI This seed URI has 16 mementos 40
  • 41. The most frequent used domains abcnews.go.com blogspot.com This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com 41
  • 42. Archive-It top 25 is fundamentally different than Storify top 25 42
  • 43. Archive-It top 25 is fundamentally different than Storify top 25 43 Twitter is #10 not #1
  • 44. What we archive and what we put in our stories are different subsets of the web 44
  • 45. Detecting off-topic pages "Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016. 45
  • 46. Archive-It provides their partners with tools that allow them to build themed collections 46
  • 47. Archive-It tools are about HTTP events / mechanics, not “content” 47
  • 48. Over 60% of archived versions of hamdeensabahy.com are off-topic May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired. http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com 48
  • 49. How do we automatically detect off-topic pages? 49
  • 50. We investigated 6 similarity metrics • Textual Content • cosine similarity of TF-IDF • intersection of the 20 most frequent terms • Jaccard similarity coefficient • Semantics • Web-based kernel function using a search engine (SE) • Structural • the change in number of words • the change in content length 50
  • 51. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 51
  • 52. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 Method Similarity cosine 0.0 TF-Intersection 0.0 Jaccard 0.0 52
  • 53. Semantics of the text Web based kernel function using the search engine (SE) 53 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  • 54. Semantics of the text Web based kernel function using the search engine (SE) Method Similarity SE-Kernel 0.7 54 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  • 55. Structural methods no. of words, content-length 100 109 Method % change WordCount 0.09 55
  • 56. Structural methods no. of words, content-length 100 109 100 5 Method % change WordCount 0.09 Method % change WordCount -0.95 56
  • 57. We built a gold standard data set to evaluate the methods 57
  • 58. We manually labeled 15,760 mementos Egypt Revolution and Politics URI-Rs: 136 URI-Ms: 6,886 Off-topic URI-Ms: 384 Occupy Movement URI-Rs: 255 URI-Ms: 6,570 Off-topic URI-Ms: 458 Columbia Univ. Human Rights collection URI-Rs: 198 URI-Ms: 2,304 Off-topic URI-Ms: 94 58
  • 59. Evaluated 6 methods at 21 thresholds • Assumed first memento was on-topic • Combined two methods ('OR') to find best combination method • 15 combinations • 6,615 tests (15 combinations x 21 thresholds x 21 thresholds) • Averaged the results at each threshold over the three collections 59
  • 60. Cosine Similarity performed well Similarity Measure Threshold FP FN FP+FN ACC F1 AUC (Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968 (Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934 Cosine 0.15 31 22 53 0.983 0.881 0.961 (WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885 WordCount -0.85 6 44 50 0.982 0.806 0.870 SEKernel 0.05 64 83 147 0.965 0.683 0.865 Bytes -0.65 28 133 161 0.962 0.584 0.746 Jaccard 0.05 74 86 159 0.962 0.538 0.809 TF-Intersection 0.00 49 104 153 0.967 0.537 0.740 60
  • 61. Average precision of 0.89 on 18 Archive-It collections 61 (Cosine,WordCount) with (0.10,-0.85) thresholds
  • 62. Detecting duplicates in a TimeMap 62
  • 63. 9 mementos for news.egypt.com, but 5 are duplicates 63
  • 64. How do we dynamically divide the collections into appropriate slices? 64
  • 65. We expected to see more like this… The Global Food Crisis collection at Archive-It 65
  • 66. This is what we found Egypt Revolution and Politics Human RightsApril 16 Archive Virginia Tech Shooting Jasmine Revolution 2011 Wikileaks Document Release 66
  • 67. Selecting representative pages for generating stories (skipping clustering details, but goal is k=28) 67
  • 68. Quality metrics for selecting mementos • In the DSA, memento quality Mq is calculated as following: Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc • Dm is the memento damage (Brunelle, JCDL 2014) • Sql is the snippet quality based on the URI level • Sqc is the snippet quality based on URI category • wm, wql, wqc are the weights of memento damage, level, and category 68
  • 69. We prefer a higher quality memento (Dm) http://wayback.archive-it.org/2358/20110201231457/ http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/20110201231622/ http://www.bbc.co.uk/news/world/middle_east/ 69 Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
  • 70. We consider the page that gives an attractive snippet https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country- to-auction-treasury-bills/ https://wayback.archive- it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1 70
  • 71. We prefer deep links over high level domains (Sql) Feb. 11, 2011: the homepage of BBC on Storify Feb. 11, 2011: the homepage of BBC Middle East section on Storify Feb. 11, 2011: the article of BBC on Storify https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/ 71
  • 72. Social media pages may not produce good snippets (Sqc) http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk 72
  • 76. We extract the metadata of the pages and order them chronologically { "elements":[ { "permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16- virginia-tech_N.htm", "type":"link", "source":{"href":"http://www.usatoday.com", "name":"www.usatoday.com @ 23, May 2007"} }, { "permalink":"http://wayback.archive- it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link", "source":{"href":"http://www.time.com", "name":"www.time.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/", "type":"link", "source":{"href":"http://www.collegiatetimes.com", "name":"www.collegiatetimes.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/", "type":"link", "source":{ "href":"http://hokies416.wordpress.com", "name":"hokies416.wordpress.com @ 06, Jun 2007" } }, … { "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/", "type":"link", "source":{"href":"http://www.hokiesports.com", "name":"www.hokiesports.com @ 20, Jun 2007" } }, ], "description":"This is an automatically generated story from Archive-It collection.", "title":"April 16 Archive ” } 76 We override the default metadata to generate more attractive snippets
  • 77. Example of an automatically generated story 77 Notice the good metadata: images, titles with dates, favicons
  • 78. Evaluating the Dark and Stormy Archive framework 78
  • 79. What a successful evaluation looks like! • We use Amazon's Mechanical Turk to compare the following stories: • Human-generated stories • DSA (automatically) generated stories • Randomly generated stories • Successful evaluation should result in: • Human and DSA stories are indistinguishable • Human and DSA stories are better than Random 79
  • 80. Our guidelines for expert archivists at Archive-It for generating stories from the collections 80
  • 81. We received 23 stories for 10 Archive-It collections SPST is “Sliding Page, Sliding Time” SPFT is “Sliding Page, Fixed Time” FPST is “Fixed Page, Sliding Time” 81
  • 82. https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 82 • Generated by domain experts • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 83. Automatically generated stories from archived collections 1. Obtain the seed list and the TimeMap of URIs from the front-end interface of Archive- It 2. Extract the HTML of the mementos from the WARC files (locally hosted at ODU) and download the collections that we do not have in the ODU mirror from Archive-It 3. Extract the text of the page using the Boilerpipe library 4. Eliminate the off-topic pages based on the best-performing method ((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85)) 5. Exclude the duplicates of each TimeMap 6. Eliminate the non-English language pages 7. Slice the collection dynamically and then cluster the mementos of each slice using DBSCAN algorithm 8. Apply the quality metrics to select the best representative pages 9. Sort the selected mementos chronologically then put them and their metadata in a JSON object 83
  • 84. https://storify.com/mturk_exp/3649b0s 84 • Automatically generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 85. Random stories 28 mementos were randomly selected from each collection before excluding off-topic and duplicate pages 85
  • 86. https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 86 • Randomly generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 87. https://storify.com/mturk_exp/3649bads 87 if someone prefers this story, we exclude their results • Poorly generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 88. MT experiment setup • Three HITs for each story (69 HITs to evaluate 23 stories); two comparisons per HIT: • HIT1: human vs. automatic, human vs. poor • HIT2: human vs. random, human vs. poor • HIT3: random vs. automatic, automatic vs. poor • 15 distinct turkers with master (have high acceptance rate) qualification for each HIT • We rejected the submissions contained poorly-generated stories and the HITs that were completed in less than 10 seconds (mean time per HIT = 7 minutes) • 989 out of 1,035 (69*15) valid HITs • We awarded the turker $0.50 per HIT 88https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
  • 91. Automatic versus Human 91 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  • 92. Human versus Random Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time 92
  • 93. Automatic versus Random 93 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  • 94. Success! DSA-generated stories are just as good as stories generated by human experts 94
  • 95. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 95 All the code, datasets, papers, slides, etc.: http://bit.ly/YasminPhD

Editor's Notes

  1. First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content. 
  2. Lori created the collections and entered metadata about them,description, title, etc Collection level metadata but it doesn’t help a lot Archive-It provides faceted browsing and search services on the resulting collection
  3. , there are about 3 or 4 collections about egyptian revolution in Archive-it, If I want to know about the egy rev, which collection should I browse?? Collection is two dimensions <<URIs, and copies of these URIs>> Historian with more than one collection will not know where to start
  4. Even we have these vis, we still do not have what is the content of these collections. Users have to go manually through the mementos to understand the collection, so the user has to inspect manually a lot things
  5. We concluded that the conventional viz methods, which we try to visualize everything in the collection are not applicable. The conventional methods are not applicable for
  6. So how about using storytelling ??
  7. Every story is made up of a set of events. Stories in literature has elemnts, such as setting, characters, sequence, etc. We use ``story'' in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc What we mean here by Storytelling here is using visualizations to put a set of web pages from web archives in a narrative structure, ordered by time
  8. Story def. in social media much looser and more relaxed. in social media, it is more arranging resources through time Storytelling may be seen as the set of cultural practices for representing events chronologically.
  9. Because of the sheer volume of information on the web, “storytelling” is becoming a popular technique in social media for selecting representative tweets, videos, web pages, etc. and arranging them in chronological order to support a particular narrative or “story”4. We use “story” in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc
  10. Storytelling looks promising but what are the problems of storytelling
  11. This is an example story for the egyptian revolution on storify, Storify Storify is a storytelling service that lets the user create stories or narratives using social media and web pages. Storify was launched in September 2010, and has been open to the public since April 2011. storytelling” is best typified by the company Storify http://storify.com/nzherald/mu http://storify.com/nzherald/mu http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705546
  12. The problem is that storify operate as bookmarking, it doesn’t preserve the links You have no clue of what the person is saying about the link http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523 http://storify.com/nzherald/mu
  13. So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.
  14. So if this is the web, the archived collections are subsets from the web, we will sample from these collections to create a story….. Then place those generated samples in a social media interface that people already know: Storify
  15. I went through these collection and sampled what I thought interesting pages, ordered them by time, and put them on storify so yousof and his generation can see it later. I took hours in selecting the resources in these handcrafted stories Although that I know the egyptian revolution very well, it wasn’t easy to select these pages from all the pages in the collection to represent the story.
  16. For example, here is the same page, it is different based on desktop and mobile. The archives typically don’t have those versions, so currently we can’t generate this story. http://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2 http://america.aljazeera.com/ Personalized Web resources offer different representations based on the user-agent string and other values in the HTTP request headers, GeoIP, and other environmental factors. Currently web archives don’t support browsing different representation. This means Web crawlers capturing content for archives may receive representations based on the crawl environment which will differ from the representations returned to the interactive users.
  17. For example, here is cnn blog at it evolved over time. You can get a sense of how the story is evolved through time from looking at the images here. http://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/*/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
  18. For example, This is feb. 11. we can see how cnn reported it, and we can see how bbc covered the news. Here is feb 11 from different news sites This story is very important for humanities researchers, https://wayback.archive-it.org/2358/20110211074248/http://www.globalpost.com/dispatch/egypt/110210/mubarak-resign-obama-egypt https://wayback.archive-it.org/2358/20110211191445/http://www.cnn.com/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211192142/http://www.modernegypt.info/ https://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ https://wayback.archive-it.org/2358/20110211191423/http://www.arabist.net/ https://wayback.archive-it.org/2358/20110211194239/http://www.globalpost.com/dispatch/egypt/110211/mubarak-quits-resigns-egypt-cairo
  19. And here I want to get the broadest coverage possible for the egyptian revolution sampling from the entire collection For example, we can see here the news from cnn about shutting down the internet. Also the news about mubarak resigning on feb 11 from bbc
  20. To generate these stories, we introduced the Dark and stormy archives framework. The framework has three main components: First, Establishing a basline of human generated stories and ait collections
  21. First, we check the human generated stories using stories from storify To have a descriptive model of how good stories look like
  22. So we quantified , the number of resources in the stories,
  23. Twitter is the most popular domain in storify stories, and you can notice here that twitter dominate the top list with large %
  24. We looked at what make a good story
  25. We looked at five this, and we found two things Popular stories tend to have: more web elements (medians of 28 vs. 21) longer timespan (5 hours vs. 2 hours) than the unpopular stories longer editing time intervals than the unpopular stories
  26. It shows that the resources of the pop- ular stories tends to stay longer than the resources of the unpopular.
  27. This is the most important thing that you need to remember This is will be used as template for our automatically generated stories.
  28. What we archive and what we put in our stories are different subsets of the web
  29. Archive-It provides their partners with tools that allow them to build themed collections of archived Web pages hosted on Archive-It's machines. This is done by the user manually specifying a set of \emph{seeds}, Uniform Resource Identifiers (URIs) that should be crawled periodically (the frequency is tunable by the user), and to what depth (e.g., follow the pages linked to from the seeds two levels out)
  30. Archive-It provide curators about http events, like how many html file, pdfs, http responses… File types and so on However, the tools are currently focused on issues such as the mechanics of HTTP (e.g., how many HTML files vs. PDFs, how many HTTP 404 responses) and domain information (e.g., how many .uk sites vs. .com sites). Currently, there are no content-based tools that allow curators to detect when seed URIs go off-topic.
  31. Here is a dude running for office, Starts as off-topic, went off-topci because of DB error, then went off because of financial problems, Then it went on-on topic again, then it was hacked, then the domain was expired for sale. We don’t necessarily want to get rid of this because it documents what happened But if r gonna choose pages for a story, db error won’t be a good candidate for a story
  32. These are the scores of two similar pages The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  33. And these are the score for two mementos in which one is off-topic because the domain is lost These are the scores of two similar pages The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  34. These two mementos are about egypt but termwise, they don’t overlap.
  35. For three collections, we went through and manually determined whther the mementos are off-topic and on-topic to the mementos in the URI-rs, like the page I showed b4 to be off I would never wanna do this again
  36. Cosine similarity at threshold = 0.15 is the best single method If cosine similarity between candidate memento and first memento < 0.15, then candidate memento is marked as 'off-topic' If cosine similarity between candidate memento and first memento < 0.10 OR word count between candidate memento and first memento has decreased by more than 85%, then candidate memento is marked as 'off-topic'
  37. FP - classified as off-topic, but really on-topic There are some collections where we didn’’t find off-topic URIs, And we have other collections that have big chunk of off-topic mementos for lost of seeds URIs. For some collections, there are around 10-15 % of the mementos of the collection are off-topic
  38. The URIs in the collecion are being crawled frequently may be daily, weekly, monthly, yearly and so on http://wayback.archive-it.org/2358/20110201013235/http://news.egypt.com/en/
  39. For gaining insight about how to slice the collection, we visualized the memento-datetimes (the crawl time of th URIs) of many collection,
  40. Earlier we talked about two dimensions for stories, In this collection, here are the two dimensions. In the x-axis,we have time, and y-axis has URIs Pretty much for each Url, the crawling are has the same amount of time.
  41. Textual methods will combine these two pages in the same cluster. So we needed an automated method to The same news from two different sites, we pick the better memento in terms of quality
  42. https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-to-auction-treasury-bills/ https://wayback.archive-it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
  43. Remember this, this is not a good snippet. These are not good snippet, We can do better than this. we ovverrise the favicon, add he date to the title,
  44. Remember this, this is not a good snippet. These are not good snippet, We can do better than this. we override the favicon, add he date to the title,
  45. So we generate a json object with the metadata of the stories and push them on storify using Storify API. After we select the best set of pages that represent a story, we extract the metadata of the mementos and put them into json format for visualization We have done a lot of work to override the default metadata that storify extracts.
  46. We consider a story to be ``good'' if a person considers it to be indistinguishable from a human-generated story. Furthermore, the human and the automatic stories should be better than the random stories. A successful evaluation should have: Human and dsa should be indistinguishable And human and dsa should be better than random
  47. They are familiar of the collections, I gave them these guidelines
  48. We obtained 23 stories for 10 collections. Some collections do not have stories because
  49. Random stories. When you choose random you get what get, sometimes is good and sometimes is bad.
  50. The turkers are presented with two sits of comparisons, each comparison has two stories and we ask them which story better summarize the topic. They can scroll down, through the stories they can click on any memento.
  51. So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.