Using Web Archives to Enrich the Live
Web Experience Through Storytelling
Yasmin AlNoamany
University of California, Berkeley
Web Science & Digital Libraries Research Group at ODU
@yasmina_anwar
@WebSciDL
Research Funded by IMLS LG-71-15-0077-15
csv,conf,v3
Portland, OR, 2017-05-03
My son, Yousof, was 2 on
January 17, 2011
hQps://www.facebook.com/elshaheeed.co.uk/
2
No worries! MulHple iniHaHves for
documenHng the EgypHan RevoluHon
403 photos with
informaVon about the
lives of the martyrs
3,525 images and
2,387 videos posted by
people for the January
demonstraVons
artwork produced
during the EgypVan
RevoluVon
3
Luckily these sites are archived at Archive-It
in the EgypHan RevoluHon collecHon
hQp://wayback.archive-it.org/2358/20110314134348/hQp://iamjan25.com/
hQp://wayback.archive-it.org/2358/20110211072306/hQp://1000memories.com/egypt/
hQps://wayback.archive-it.org/2358/20111128095924/hQp://iamtahrir.com/
6
Archived collecHons are important
for posterity, but there are
problems with archived collecHons
7
AOer 10 years, Yousof knows
about Archive-It
> 3,500
collecVons
~340
insVtuVons
> 10B archived
pages
Archive-It, a subscripVon-based service, hosts curated web collecVons
8
There is more than one collecHon about
the EgypHan RevoluHon
• “2010-2011 Arab Spring” hQps://archive-it.org/collecVons/3101
• “North Africa & the Middle East 2011-2013” hQps://archive-it.org/collecVons/2349
• “Egypt RevoluVon and PoliVcs” hQps://archive-it.org/collecVons/2358
9
Current browsing and searching services for
the “Egypt RevoluHon and PoliHcs” collecHon
10
Current browsing and searching services for
the “Egypt RevoluHon and PoliHcs” collecHon
11
Current browsing and searching services for
the “Egypt RevoluHon and PoliHcs” collecHon
12
CollecHon understanding and collecHon
summarizaHon are not currently supported
Not easy to answer “what’s in that collecHon?” or
“how is this collecHon different from others”?
13
Our early aYempts at collecHon understanding
tried to include everything…
14
“Visualizing digital collec5ons at Archive-It”, JCDL 2012.
hQp://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
1000s of seeds X 1000s of archived pages ==
Conven,onal Vis Methods Not Applicable
15
Use interface people already know how to use
to summarize collecHons
Archived collec5ons Storytelling services
Archived enriched
stories
23
Hand-craOed stories to summarize the
EgypHan RevoluHon collecHon for Yousof
hQps://storify.com/yasmina_anwar/the-egypVan-revoluVon-
on-archive-it-collecVon
hQps://storify.com/yasmina_anwar/the-story-of-the-egypVan-
revoluVon-from-archive- 24
CollecHons have two dimensions:
{Fixed, Sliding} X {Page, Time}
t1 t3 t2 t5 t4 tk
…
URI
Time
t6
26
…
…
Fixed Page, Fixed Time
A desktop Chrome user-agent
hQp://www.cnn.com/2014/02/24/world/africa/egypt-poliVcs/
index.html?hpt=wo_c2
Android Chrome user-agent
hQp://www.cnn.com/2014/02/24/world/africa/egypt-
poliVcs/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Kelly et al. “A Method for IdenVfying Personalized RepresentaVons in Web Archives”, D-Lib Magazine 2013 .
27
The Dark and Stormy Archives (DSA) framework
Establish a
baseline
Reduce the candidate
pool of archived pages
Select good
representative
pages
Characteristics of
human-generated
Stories
Characteristics of
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
31
hQps://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
Establish a baseline of
social media stories
"Characteris5cs of Social Media Stories”, TPDL 2015, IJDL 2016.
32
What is the length of a story
(the number of resources per story)?
This story has
31 resources
1
3
2
33
What are the types of resources that
compose a story?
Quotes
Video
34
This story has
• 19 quotes
• 8 images
• 4 videos
We found that 28 mementos is a good
number for the resources in the stories.
35
More than 60% of archive copies of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
hQp://wayback.archive-it.org/2358/*/hQp://hamdeensabahy.com
37
Based on evaluaHng 6 similarity methods, we applied the best
performing method to automaHcally detect off-topic pages
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
hQp://wayback.archive-it.org/2358/*/hQp://hamdeensabahy.com
38
Quality metrics for selecHng mementos
• In the DSA, memento quality Mq is calculated as
following:
Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level
• Sqc is the snippet quality based on URI category
• wm, wql, wqc are the weights of memento damage, level,
and category
41
We prefer a higher quality memento (Dm)
hQp://wayback.archive-it.org/2358/20110201231457/
hQp://news.blogs.cnn.com/category/world/egypt-world-latest-news/
hQp://wayback.archive-it.org/2358/20110201231622/
hQp://www.bbc.co.uk/news/world/middle_east/
42
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
We prefer pages with aYracHve snippets
hQps://wayback.archive-it.org/2358/20110207193404/hQp://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-
to-aucVon-treasury-bills/
hQps://wayback.archive-it.org/2358/20110207194425/hQp://www.cnn.com/2011/WORLD/africa/02/07/
egypt.google.execuVve/index.html?hpt=T1
43
Visualizing stories in Storify
44
”Genera5ng Stories from Archived Collec5ons”, WebSci 2017.
We extract the metadata of the pages
and order them chronologically
{ "elements":[
{
"permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-
virginia-tech_N.htm", "type":"link",
"source":{"href":"http://www.usatoday.com",
"name":"www.usatoday.com
@ 23, May 2007"}
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182159/http://www.time.com/time/specials/2007/
vatech_victims", "type":"link", "source":{"href":"http://www.time.com",
"name":"www.time.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/",
"type":"link", "source":{"href":"http://www.collegiatetimes.com",
"name":"www.collegiatetimes.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/",
"type":"link", "source":{ "href":"http://hokies416.wordpress.com",
"name":"hokies416.wordpress.com
@ 06, Jun 2007" }
},
…
{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/",
"type":"link", "source":{"href":"http://www.hokiesports.com",
"name":"www.hokiesports.com
@ 20, Jun 2007" } },
],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April 16
Archive ”
}
45
Using the Storify API, we
override the default
metadata to generate more
attractive snippets
Example of an automaHcally generated story
46
Notice the good
metadata: images,
titles with dates,
favicons
EvaluaHng the Dark and Stormy
Archive framework
(how good are the automaHcally generated stories?)
47
EvaluaHon is tricky!
(two perfectly good stories could have non-overlapping k=28
elements!)
• Successful evaluaVon means:
• Human and DSA stories are indisVnguishable
• Human and DSA stories are beQer than Random
• We use human evaluators (via Amazon's
Mechanical Turk) to compare:
• Human-generated stories
• DSA (automaVcally) generated stories
• Randomly generated stories
48
Our guidelines for expert archivists at Archive-It
for generaHng stories from the collecHons
49
Use interface people already know how to use
to summarize collecHons
Archived collec5ons Storytelling services
Archived enriched
stories
54
All the code, datasets,
papers, slides, etc.:
hQp://bit.ly/YasminPhD
@yasmina_anwar
Textual content
cosine similarity, intersecHon of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-IntersecVon 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-IntersecVon 0.0
Jaccard 0.0
58
SemanHcs of the text
Web based kernel funcHon using the search engine (SE)
Method Similarity
SE-Kernel 0.7
59
Sahami and Heilman, A Web-based Kernel FuncVon for Measuring the Similarity of Short Text Snippets, WWW 2006
We built a gold standard data set to
evaluate the methods
61
We manually labeled 15,760 mementos
Egypt Revolu5on and Poli5cs
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collec5on
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94 62
Evaluated 6 methods on manually
labeled 15,760
• Textual content
• cosine similarity
• intersecVon of the most frequent terms
• Jaccard similarity
• SemanVcs of the text
• Web based kernel funcVon using the search engine (SE)
• Structural methods
• no. of words
• content-length
63 "Detec5ng Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two
comparisons per HIT:
• HIT1: human vs. automaVc, human vs. poor
• HIT2: human vs. random, human vs. poor
• HIT3: random vs. automaVc, automaVc vs. poor
• 15 disVnct turkers with master qualificaVon (i.e., high acceptance rate)
for each HIT
• We rejected the submissions contained poorly-generated stories and
the HITs that were completed in less than 10 seconds (mean Vme per
HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
67 hQps://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
We prefer deep links over
high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East secVon on Storify
Feb. 11, 2011: the arVcle of BBC on Storify
hQps://wayback.archive-it.org/2358/20110211191429/hQp://www.bbc.co.uk/
hQps://wayback.archive-it.org/2358/20110211192204/hQp://www.bbc.co.uk/news/world-middle-east-12433045
hQps://wayback.archive-it.org/2358/20110211191942/hQp://www.bbc.co.uk/news/world/middle_east/
68
Social media pages may not produce
good snippets (Sqc)
hQp://wayback.archive-it.org/1784/20100131023240/hQp:/twiQer.com/HaiVfeed/ hQp://wayback.archive-it.org/2358/20141225080305/hQps:/www.facebook.com/elshaheeed.co.uk
69
How do we dynamically divide the
collecHons into appropriate slices?
(in other words, how do we pick just 28?)
70
”Genera5ng Stories from Archived Collec5ons”, WebSci 2017.