Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

@shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to:

Outline
1. Motivation
2. Research Questions
3. Preliminary Work
4. Proposed Research
2

Let’s say: you find a bag
3

Let’s say: you find a bag
There are thousands of different items inside.
Can you use the contents of this bag?
How quickly can you make this decision?
4

Now let’s say: there are thousands of bags
Which one might contain something useful for
you?
Do any?
How do you know?
How do you decrease your chances of wasting
your time?
5

What does this have to do with web archives?
6

Researchers create their own web archive collections
7
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah

Web archive collections have many versions of the same
page
8
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015

Different versions allow us to see an unfolding news
story
9
Memento from
April 19, 2013 17:12
Searching for suspects,
City on lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 24, 2013 2:24
Suspect Found,
Office collier lost life,
Obama speaks

Different versions allow us to see changes in an
organization’s web presence
10
The White House: 2016 The White House: 2018

Archive-It allows curators to easily create collections
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
11

… and these collections are used by other researchers
12
The collection curator is not the only user of the
collection!
These collections live a life after their curator
has stopped adding to them.

How do we tell the difference between collections?
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
13

@shawnmjones @WebSciDL 14
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?

Archive-It provides fields for metadata
15
Collection-wide metadata Metadata on individual seeds
Dublin
Core
+
Custom
Fields

But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
16
9 seeds
with metadata
132,599 seeds
no metadata

But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
17
132,599 seeds
no metadata
9 seeds
with metadata
Paradox:
More seeds = more effort
More seeds = greater user need for metadata

Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
18

More Archive-It collections are added every year
More than 8000 collections exist as of the end of 2016
19

The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
20

The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
21

Our proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos
have good information scent
 contain cues that the memento will address a
user’s needs
22
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20

Outline
1. Motivation
3. Preliminary Work
23

Surrogates provide a visual summary of the content
behind a URI…
24
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI represented by a
browser thumbnail surrogate:
The same URI represented by a
social card surrogate:

Social media storytelling uses surrogates to provide a
“summary of summaries”
25
2 resources are shown from this Wakelet story6 resources are shown from this Storify story
Each surrogate summarizes a
web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this technique
to summarize web archive
collections because users are
already familiar with this
visualization paradigm.

Traditional surrogates contain metadata generated by
humans to convey aboutness
26

Web surrogates provide a visual summary of a web
resource drawn from the content of the resource
27
Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.

Our research questions
 RQ1: What types of web archive
collections exist?
 RQ2: What surrogates work best for
understanding collections of
mementos?
 RQ3: How do we select
representative mementos for the
different semantic types of
collections?
 RQ4: How well do stories produced
by different summarization algorithms
work for collection understanding?
28
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types

RQ2: What surrogates work best for web resources?
29
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
in information scent and user preference
Capra (2013)
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.

RQ3: How might we select representative mementos?
Luhn (1958)
• automatic abstracts
Silva (2014)
• word graphs from
Luhn’s algorithm
DUC Datasets (2001-2007)
Napoles (2012)
• Gigaword
Lin (2014)
• ROUGE metrics
Grusky (2018)
• NEWSROOM
• Existing reference summaries were
built from news articles.
• Existing reference summaries were
not built from web archives.
Mihalcea (2004)
• TextRank
Dolan (2004)
• clustering news articles
• Lede3 preferred by evaluators
Xie (2008)
• MMR for meeting summaries
Radev (1998)
• automatic
news briefs
Xie (2008)
• MMR for meeting
summaries
Sipos (2008)
• scholarly corpus
over time
Zhang (2010)/Li (2011)
• aspects of disasters
Hong (2014)
• word weighting
30

RQ3: How might we select representative mementos?
– Related Concepts
 Scatter-Gather (Cutting 1992)
 allows a user to explore a collection by
drilling through topic cluster until they reach
individual documents
 we seek to provide a representative sample
that a user can quickly glance
 Recommender Systems
 predicts the preference of a user based on
past behavior, demographic profile, or
behavior of the user’s friends
 we want to provide a summary without any
knowledge of the user
 Zero-Query Systems
 predicts the information a user will need
based on time, location, environment, user
interests, and other factors
 again, we want to provide a summary with
no knowledge of the user
31
Image reference:
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992.
Scatter/Gather: a cluster-based approach to browsing large document collections.
In Proceedings of the 15th annual international ACM SIGIR conference on Research
and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318-
329. https://doi.org/10.1145/133160.133214

How have others explored collections?
32
Conta Me Histórias
ArchiveSpark
Archives Unleashed Cloud
Existing solutions allow users to query and develop statistics on collections.
Users must have some ideas of a topic or concept a priori.

How have others visualized collections for
understanding?
33
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821

How have others told stories with web
archive collections?
34
 AlNoamany told stories via the storytelling platform Storify
 She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
DOI:10.1145/3091478.3091508

35
 AlNoamany told stories via the storytelling platform Storify – which is no longer in service
 She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
x
S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?”
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017.
x

 AlNoamany told stories via the storytelling platform Storify – which is no longer in service
 She proved that test participants could not detect the difference between her automated stories and
stories generated by human curators
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
36
x
x

Outline
1. Motivation
3. Preliminary Work
37

Outline
1. Motivation
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
38

As collection users, we view Archive-It collections
from outside…
39
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

As collection users, what structural features can we
view from outside?
40
 Using only structural features is
advantageous because it saves one
from having to download a collection’s
content.
 These structural features give us
different insight than can be provided by
text analysis or metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos

Was the collection built from web sites belonging to one
domain or many?
41
Many domains One domain
Structural feature discussed
here:
• domain diversity

Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
42
Top-level pages Deeper links
Structural feature discussed
here:
• path depth diversity
• most frequent path depth

Growth curves provide some understanding of collection
curation behavior
43
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)

Does most of the collection exist earlier or later in its
life?
44
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later in
its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection

When did the curator select and archive a collection’s
contents?
45
This collection was created in
March 2006.
Some of the seeds were selected
in 2006.
Many of the seeds were selected
all along its life.
It has mementos as recent as
July 2018.
• area under the seed growth curve

Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
46
This collection was created
in June 2014.
The seeds were selected
toward the beginning of its
life.
Mementos were captured all
during its life.
• area under the seed growth curve
• area under the seed memento growth curve

We discovered four semantic categories in
Archive-It collections…
47
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections

Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous

Self-Archiving
Subject-based
Time Bounded – Expected Time Bounded – Spontaneous

Self-Archiving
Subject-based
Time Bounded – Expected
Time Bounded – Spontaneous

Self-Archiving
Subject-based
4.2% of collections

Self-Archiving
Subject-based
4.2% of collections
Some evaluated by AlNoamany

We can bridge the structural to the descriptive…
53
Self-Archiving
Subject-based
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720

We have identified different types of Archive-It
collections
54
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous

We have identified different types of Archive-It
collections
55
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
Not so fast…

Outline
1. Motivation
3. Preliminary Work
1. How effective are existing curation platforms at producing mementos?
collections?
56

Existing platforms do not reliably produce surrogates
for mementos…
57
If we cannot rely upon the
service to generate a surrogate
for a memento, our system must
then do the work to create our
own surrogates.
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.

Some services have stories, but not long term
storytelling?
58
Facebook stories
Image ref:
https://techcrunch.com/2018/04/05/facebook-stories-default/
Image ref:
https://techcrunch.com/2013/10/03/snapc
hat-gets-its-own-timeline-with-snapchat-
stories-24-hour-photo-video-tales/
Snapchat stories
Image ref:
https://buffer.com/library/instagram-stories
Instagram stories
These platforms delete the user’s stories 24 hours after they are posted.
This form of social media storytelling is the opposite of what we are looking for.
We want the stories to be artifacts themselves.

Some services’ longevity is in doubt…
59
RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.

Existing surrogate services create a confusing
experience for mementos
60
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks surrogate
embed.ly surrogate

Neither social media services nor surrogate services were
reliable for storytelling, so we created MementoEmbed…
61
Information in the
MementoEmbed social
card surrogate is
separated to avoid
issues of confusion
about attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other surrogates.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.

MementoEmbed provides us with a tool for evaluating
surrogates, a step on the road to answering RQ2…
62
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️

Outline
1. Motivation
3. Preliminary Work
1. How effective are live web curation platforms at producing mementos?
collections?
63

Using stories built from curator-selected mementos, we
shared stories with MT participants…
64
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image
(sc/t)
Social Card With
Thumbnail to
Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 mementos selected by human Archive-It
curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

And then we asked them which of 2 of 6 mementos
come from the same collection…
65
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This is similar to the Sentence Verification Task from reading comprehension studies.

Response times per surrogate had interesting means, but
p-values were not statistically significant at p < 0.05
66
p = 0.190
p = 0.202

Correct answers per surrogate indicate that social
cards probably outperform the Archive-It surrogate
67
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770

Whenever thumbnails are present, more users interact
with them
68
We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted
with a thumbnail, regardless of surrogate.
For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the
surrogate.

We have some results indicating that social cards
perform better, but there is more to answering RQ2…
69
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean

Outline
1. Motivation
3. Preliminary Work
3. Partial RQ3: How do we select representative mementos for the different semantic
types of collections?
70

Identifying off-topic mementos is key to choosing
representative mementos
71
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to
summarization
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87

The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
72

After repeating AlNoamany’s experiment, Word Count had
the best F1 score for identifying off-topic mementos…
73
We reused
AlNoamany’s labeled
dataset.
She did not try:
• Sørensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of
AlNoamany’s.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting oﬀ-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5

Finding off-topic mementos is one of the first steps to
addressing RQ3…
74
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️

Outline
1. Motivation
3. Preliminary Work
75

This work requires a flexible
framework –
Dark and Stormy Archives
(DSA) 2.0
76
OTMT
Hypercane
Raintale MementoEmbed
Archive-It Utilities
Story
Web Archive
Collection
✅
✅
✅
callscalls
calls
provides
input to
input
output
Thousands of
HTML documents
< 30 Representative
Mementos
Visualized as
surrogates
calls
✅
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
Tools for selecting
representative
mementos
Tools for visualizing
mementos as a
story

Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
77
How well do users perform with
different types of surrogates?
1. Select 5 collections from each
semantic category
2. Select the earliest memento of each
of the first 20 seeds from each
collection – this is the number of
surrogates a user views if they
open an Archive-It story and page
down once
3. Present the participant with a story
of 20 surrogates, varying the
surrogate between participants
4. Ask them to address a user task
Variations:
• For step #3, vary the time for participants to view the story
• participants view for 5, 10, 20, 30 seconds
• may surface the ability to “glance” and understand
• some surrogates consist only of title, URI, etc.
• may determine which surrogate elements perform
best
• For step #4, ask the participant to:
• determine if the collection behind the story is suited for a
task – similar to traditional IR research
• identify which items likely belong to the same collection
• Instead of steps 3 and 4 – ask former participants which
surrogate they prefer for a given task

Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
78
What information is available to users
of the existing Archive-It story?
Discover patterns in metadata usage that may indicate
the semantic type of collection.
How well do our stories compare to the
existing metadata?
How well do our stories cover the
content of the underlying collection?
How well does the Archive-It story
cover the underlying collection?
How well do surrogates cover the
content of their mementos?
Collection
Content
Our Story
Content
Collection
Content
Archive-It
Story
Content
Memento
Content
Surrogate
Content
Our Story
Content
Existing
Metadata
For Seeds
Similarity metrics will
be used for evaluating
coverage.

Evaluation of RQ3: How do we select representative
mementos for different semantic types of collections?
79
We will develop different algorithms and compare their output
with several metrics to determine which algorithms provide the
best ”aboutness” for the collection.
0
1
2
3
4
5
6
7
8
9
10
Existing Metadata
Content Coverage
Temporal Spread
Source Diversity
Compression
Performance
DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4

RQ4: How well do stories produced by different summarization
algorithms work for collection understanding?
80
How well do our generated stories compare to the
existing Archive-It interface?
Do study participants understand key concepts of the
collection represented by the story?
Using the stories, can participants tell the difference
between similar collections?
Can participants compare stories and tell which are
similar?
Does the addition of existing metadata improve the
participant’s performance?
Does the layout of the surrogates improve the
participant’s performance?
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️

We plan to
have
completed
this
research in
2021…
81
iPres 2018
iPres 2018
CIKM 2019
ECIR 2020
WWW 2020
CIKM 2020
WebSci 2021
JCDL 2020
JCDL 2018
DTMH 2017

Our methods are not just for Archive-It
82
Our methods will be applicable web archive collections created on
other platforms, like Rhizome’s Webrecorder.

Motivation Summary
 Collection understanding is a problem
with web archive collections
 inconsistent metadata
 1000s of mementos
 1000s of collections
 costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
 relied upon Storify as a visualization medium
83

Contributions
 Existing work:
 Derived semantic categories of web archive collections in
Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 MementoEmbed – surrogates for the past web
 Social cards probably provide better understanding of
collections
 Off-Topic Memento Toolkit – Identifying off-topic mementos
 Future work:
 Evaluate algorithms for surfacing a representative sample
from a document collection
 Evaluate different surrogate types via user evaluation
 Show which surrogate-sample combinations work best for
collection understanding via user evaluation
84

Discussion

Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

Similar to Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam (20)

More from Shawn Jones

More from Shawn Jones (11)

Recently uploaded

Recently uploaded (20)

Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam