Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

Shawn Jones
Shawn JonesResearch Assistant at Old Dominion University
@shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to:
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
4. Proposed Research
2
@shawnmjones @WebSciDL
Let’s say: you find a bag
3
@shawnmjones @WebSciDL
Let’s say: you find a bag
There are thousands of different items inside.
Can you use the contents of this bag?
How quickly can you make this decision?
4
@shawnmjones @WebSciDL
Now let’s say: there are thousands of bags
Which one might contain something useful for
you?
Do any?
How do you know?
How do you decrease your chances of wasting
your time?
5
@shawnmjones @WebSciDL
What does this have to do with web archives?
6
@shawnmjones @WebSciDL
Researchers create their own web archive collections
7
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
@shawnmjones @WebSciDL
Web archive collections have many versions of the same
page
8
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
@shawnmjones @WebSciDL
Different versions allow us to see an unfolding news
story
9
Memento from
April 19, 2013 17:12
Searching for suspects,
City on lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 24, 2013 2:24
Suspect Found,
Office collier lost life,
Obama speaks
@shawnmjones @WebSciDL
Different versions allow us to see changes in an
organization’s web presence
10
The White House: 2016 The White House: 2018
@shawnmjones @WebSciDL
Archive-It allows curators to easily create collections
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
11
@shawnmjones @WebSciDL
… and these collections are used by other researchers
12
The collection curator is not the only user of the
collection!
These collections live a life after their curator
has stopped adding to them.
@shawnmjones @WebSciDL
How do we tell the difference between collections?
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
13
@shawnmjones @WebSciDL 14
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
@shawnmjones @WebSciDL
Archive-It provides fields for metadata
15
Collection-wide metadata Metadata on individual seeds
Dublin
Core
+
Custom
Fields
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
16
9 seeds
with metadata
132,599 seeds
no metadata
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
17
132,599 seeds
no metadata
9 seeds
with metadata
Paradox:
More seeds = more effort
More seeds = greater user need for metadata
@shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
18
@shawnmjones @WebSciDL
More Archive-It collections are added every year
More than 8000 collections exist as of the end of 2016
19
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
20
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
21
@shawnmjones @WebSciDL
Our proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos
have good information scent
 contain cues that the memento will address a
user’s needs
22
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
4. Proposed Research
23
@shawnmjones @WebSciDL
Surrogates provide a visual summary of the content
behind a URI…
24
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI represented by a
browser thumbnail surrogate:
The same URI represented by a
social card surrogate:
@shawnmjones @WebSciDL
Social media storytelling uses surrogates to provide a
“summary of summaries”
25
2 resources are shown from this Wakelet story6 resources are shown from this Storify story
Each surrogate summarizes a
web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this technique
to summarize web archive
collections because users are
already familiar with this
visualization paradigm.
@shawnmjones @WebSciDL
Traditional surrogates contain metadata generated by
humans to convey aboutness
26
@shawnmjones @WebSciDL
Web surrogates provide a visual summary of a web
resource drawn from the content of the resource
27
Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
@shawnmjones @WebSciDL
Our research questions
 RQ1: What types of web archive
collections exist?
 RQ2: What surrogates work best for
understanding collections of
mementos?
 RQ3: How do we select
representative mementos for the
different semantic types of
collections?
 RQ4: How well do stories produced
by different summarization algorithms
work for collection understanding?
28
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
@shawnmjones @WebSciDL
RQ2: What surrogates work best for web resources?
29
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
@shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
Luhn (1958)
• automatic abstracts
Silva (2014)
• word graphs from
Luhn’s algorithm
DUC Datasets (2001-2007)
Napoles (2012)
• Gigaword
Lin (2014)
• ROUGE metrics
Grusky (2018)
• NEWSROOM
• Existing reference summaries were
built from news articles.
• Existing reference summaries were
not built from web archives.
Mihalcea (2004)
• TextRank
Dolan (2004)
• clustering news articles
• Lede3 preferred by evaluators
Xie (2008)
• MMR for meeting summaries
Radev (1998)
• automatic
news briefs
Xie (2008)
• MMR for meeting
summaries
Sipos (2008)
• scholarly corpus
over time
Zhang (2010)/Li (2011)
• aspects of disasters
Hong (2014)
• word weighting
30
@shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
– Related Concepts
 Scatter-Gather (Cutting 1992)
 allows a user to explore a collection by
drilling through topic cluster until they reach
individual documents
 we seek to provide a representative sample
that a user can quickly glance
 Recommender Systems
 predicts the preference of a user based on
past behavior, demographic profile, or
behavior of the user’s friends
 we want to provide a summary without any
knowledge of the user
 Zero-Query Systems
 predicts the information a user will need
based on time, location, environment, user
interests, and other factors
 again, we want to provide a summary with
no knowledge of the user
31
Image reference:
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992.
Scatter/Gather: a cluster-based approach to browsing large document collections.
In Proceedings of the 15th annual international ACM SIGIR conference on Research
and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318-
329. https://doi.org/10.1145/133160.133214
@shawnmjones @WebSciDL
How have others explored collections?
32
Conta Me Histórias
ArchiveSpark
Archives Unleashed Cloud
Existing solutions allow users to query and develop statistics on collections.
Users must have some ideas of a topic or concept a priori.
@shawnmjones @WebSciDL
How have others visualized collections for
understanding?
33
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
@shawnmjones @WebSciDL
How have others told stories with web
archive collections?
34
 AlNoamany told stories via the storytelling platform Storify
 She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
DOI:10.1145/3091478.3091508
@shawnmjones @WebSciDL
How have others told stories with web
archive collections?
35
 AlNoamany told stories via the storytelling platform Storify – which is no longer in service
 She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
x
S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?”
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017.
x
@shawnmjones @WebSciDL
How have others told stories with web
archive collections?
 AlNoamany told stories via the storytelling platform Storify – which is no longer in service
 She proved that test participants could not detect the difference between her automated stories and
stories generated by human curators
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
36
x
x
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
4. Proposed Research
37
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
38
@shawnmjones @WebSciDL
As collection users, we view Archive-It collections
from outside…
39
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
As collection users, what structural features can we
view from outside?
40
 Using only structural features is
advantageous because it saves one
from having to download a collection’s
content.
 These structural features give us
different insight than can be provided by
text analysis or metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
Was the collection built from web sites belonging to one
domain or many?
41
Many domains One domain
Structural feature discussed
here:
• domain diversity
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
42
Top-level pages Deeper links
Structural feature discussed
here:
• path depth diversity
• most frequent path depth
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
Growth curves provide some understanding of collection
curation behavior
43
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
Does most of the collection exist earlier or later in its
life?
44
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later in
its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
When did the curator select and archive a collection’s
contents?
45
This collection was created in
March 2006.
Some of the seeds were selected
in 2006.
Many of the seeds were selected
all along its life.
It has mementos as recent as
July 2018.
Structural feature discussed here:
• area under the seed growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
46
This collection was created
in June 2014.
The seeds were selected
toward the beginning of its
life.
Mementos were captured all
during its life.
Structural feature discussed here:
• area under the seed growth curve
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
We discovered four semantic categories in
Archive-It collections…
47
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL 48
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
@shawnmjones @WebSciDL 49
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
@shawnmjones @WebSciDL 50
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
@shawnmjones @WebSciDL 51
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
@shawnmjones @WebSciDL 52
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
@shawnmjones @WebSciDL
We can bridge the structural to the descriptive…
53
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
54
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
@shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
55
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
Not so fast…
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
56
@shawnmjones @WebSciDL
Existing platforms do not reliably produce surrogates
for mementos…
57
If we cannot rely upon the
service to generate a surrogate
for a memento, our system must
then do the work to create our
own surrogates.
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
@shawnmjones @WebSciDL
Some services have stories, but not long term
storytelling?
58
Facebook stories
Image ref:
https://techcrunch.com/2018/04/05/facebook-stories-default/
Image ref:
https://techcrunch.com/2013/10/03/snapc
hat-gets-its-own-timeline-with-snapchat-
stories-24-hour-photo-video-tales/
Snapchat stories
Image ref:
https://buffer.com/library/instagram-stories
Instagram stories
These platforms delete the user’s stories 24 hours after they are posted.
This form of social media storytelling is the opposite of what we are looking for.
We want the stories to be artifacts themselves.
@shawnmjones @WebSciDL
Some services’ longevity is in doubt…
59
RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
@shawnmjones @WebSciDL
Existing surrogate services create a confusing
experience for mementos
60
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks surrogate
embed.ly surrogate
@shawnmjones @WebSciDL
Neither social media services nor surrogate services were
reliable for storytelling, so we created MementoEmbed…
61
Information in the
MementoEmbed social
card surrogate is
separated to avoid
issues of confusion
about attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other surrogates.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
@shawnmjones @WebSciDL
MementoEmbed provides us with a tool for evaluating
surrogates, a step on the road to answering RQ2…
62
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are live web curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
63
@shawnmjones @WebSciDL
Using stories built from curator-selected mementos, we
shared stories with MT participants…
64
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image
(sc/t)
Social Card With
Thumbnail to
Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 mementos selected by human Archive-It
curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
@shawnmjones @WebSciDL
And then we asked them which of 2 of 6 mementos
come from the same collection…
65
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This is similar to the Sentence Verification Task from reading comprehension studies.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
@shawnmjones @WebSciDL
Response times per surrogate had interesting means, but
p-values were not statistically significant at p < 0.05
66
p = 0.190
p = 0.202
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
@shawnmjones @WebSciDL
Correct answers per surrogate indicate that social
cards probably outperform the Archive-It surrogate
67
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
@shawnmjones @WebSciDL
Whenever thumbnails are present, more users interact
with them
68
We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted
with a thumbnail, regardless of surrogate.
For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the
surrogate.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
@shawnmjones @WebSciDL
We have some results indicating that social cards
perform better, but there is more to answering RQ2…
69
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
3. Partial RQ3: How do we select representative mementos for the different semantic
types of collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
70
@shawnmjones @WebSciDL
Identifying off-topic mementos is key to choosing
representative mementos
71
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to
summarization
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
@shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
72
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
@shawnmjones @WebSciDL
After repeating AlNoamany’s experiment, Word Count had
the best F1 score for identifying off-topic mementos…
73
We reused
AlNoamany’s labeled
dataset.
She did not try:
• Sørensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of
AlNoamany’s.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
@shawnmjones @WebSciDL
Finding off-topic mementos is one of the first steps to
addressing RQ3…
74
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
@shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
4. Proposed Research
75
@shawnmjones @WebSciDL
This work requires a flexible
framework –
Dark and Stormy Archives
(DSA) 2.0
76
OTMT
Hypercane
Raintale MementoEmbed
Archive-It Utilities
Story
Web Archive
Collection
✅
✅
✅
callscalls
calls
provides
input to
input
output
Thousands of
HTML documents
< 30 Representative
Mementos
Visualized as
surrogates
calls
✅
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
Tools for selecting
representative
mementos
Tools for visualizing
mementos as a
story
@shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
77
How well do users perform with
different types of surrogates?
1. Select 5 collections from each
semantic category
2. Select the earliest memento of each
of the first 20 seeds from each
collection – this is the number of
surrogates a user views if they
open an Archive-It story and page
down once
3. Present the participant with a story
of 20 surrogates, varying the
surrogate between participants
4. Ask them to address a user task
Variations:
• For step #3, vary the time for participants to view the story
• participants view for 5, 10, 20, 30 seconds
• may surface the ability to “glance” and understand
• some surrogates consist only of title, URI, etc.
• may determine which surrogate elements perform
best
• For step #4, ask the participant to:
• determine if the collection behind the story is suited for a
task – similar to traditional IR research
• identify which items likely belong to the same collection
• Instead of steps 3 and 4 – ask former participants which
surrogate they prefer for a given task
@shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
78
What information is available to users
of the existing Archive-It story?
Discover patterns in metadata usage that may indicate
the semantic type of collection.
How well do our stories compare to the
existing metadata?
How well do our stories cover the
content of the underlying collection?
How well does the Archive-It story
cover the underlying collection?
How well do surrogates cover the
content of their mementos?
Collection
Content
Our Story
Content
Collection
Content
Archive-It
Story
Content
Memento
Content
Surrogate
Content
Our Story
Content
Existing
Metadata
For Seeds
Similarity metrics will
be used for evaluating
coverage.
@shawnmjones @WebSciDL
Evaluation of RQ3: How do we select representative
mementos for different semantic types of collections?
79
We will develop different algorithms and compare their output
with several metrics to determine which algorithms provide the
best ”aboutness” for the collection.
0
1
2
3
4
5
6
7
8
9
10
Existing Metadata
Content Coverage
Temporal Spread
Source Diversity
Compression
Performance
DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4
@shawnmjones @WebSciDL
RQ4: How well do stories produced by different summarization
algorithms work for collection understanding?
80
How well do our generated stories compare to the
existing Archive-It interface?
Do study participants understand key concepts of the
collection represented by the story?
Using the stories, can participants tell the difference
between similar collections?
Can participants compare stories and tell which are
similar?
Does the addition of existing metadata improve the
participant’s performance?
Does the layout of the surrogates improve the
participant’s performance?
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
@shawnmjones @WebSciDL
We plan to
have
completed
this
research in
2021…
81
iPres 2018
iPres 2018
CIKM 2019
ECIR 2020
WWW 2020
CIKM 2020
WebSci 2021
JCDL 2020
JCDL 2018
DTMH 2017
@shawnmjones @WebSciDL
Our methods are not just for Archive-It
82
Our methods will be applicable web archive collections created on
other platforms, like Rhizome’s Webrecorder.
@shawnmjones @WebSciDL
Motivation Summary
 Collection understanding is a problem
with web archive collections
 inconsistent metadata
 1000s of mementos
 1000s of collections
 costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
 relied upon Storify as a visualization medium
83
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Derived semantic categories of web archive collections in
Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 MementoEmbed – surrogates for the past web
 Social cards probably provide better understanding of
collections
 Off-Topic Memento Toolkit – Identifying off-topic mementos
 Future work:
 Evaluate algorithms for surfacing a representative sample
from a document collection
 Evaluate different surrogate types via user evaluation
 Show which surrogate-sample combinations work best for
collection understanding via user evaluation
84
@shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to:
@shawnmjones @WebSciDL
Discussion
1 of 86

More Related Content

What's hot(20)

Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
Michael Nelson1.6K views
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
Yasmin AlNoamany, PhD2.9K views
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
Michele Weigle4.5K views
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
Michael Nelson2K views
The Power of Sharing Linked Data (NASIG)The Power of Sharing Linked Data (NASIG)
The Power of Sharing Linked Data (NASIG)
Richard Wallis1.3K views

Similar to Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

Eureka! researchEureka! research
Eureka! researchcybraryman
1.9K views54 slides
SheffieldSheffield
Sheffielddaveyp
574 views47 slides

Similar to Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam(20)

Eureka! researchEureka! research
Eureka! research
cybraryman1.9K views
Are museums a dial that only goes to 5? Are museums a dial that only goes to 5?
Are museums a dial that only goes to 5?
Michael Edson30.1K views
SheffieldSheffield
Sheffield
daveyp574 views
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
Cliff Landis196 views
Library orientation programme 2020 to 2021 sgvu jaipur by anish mohammad depu...Library orientation programme 2020 to 2021 sgvu jaipur by anish mohammad depu...
Library orientation programme 2020 to 2021 sgvu jaipur by anish mohammad depu...
Anish Mohammad Resource Person - Library & Information Science , Assistant Vice President 101 views
One Big LibraryOne Big Library
One Big Library
York University Libraries388 views
Linked Data Challenge and OpportunityLinked Data Challenge and Opportunity
Linked Data Challenge and Opportunity
Richard Wallis1.2K views
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
Richard Wallis1.4K views
Wizard of Apps RevisedWizard of Apps Revised
Wizard of Apps Revised
Joyce Kasman Valenza3.7K views
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure 2.5K views
Virtual LibrariesVirtual Libraries
Virtual Libraries
Joyce Kasman Valenza3.2K views
Social Work ResearchSocial Work Research
Social Work Research
Edwards Campus of the University of Kansas19.6K views

Recently uploaded(20)

Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet48 views
ThroughputThroughput
Throughput
Moisés Armani Ramírez28 views
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum118 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya51 views

Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

  • 1. @shawnmjones @WebSciDL Improving Understanding of Web Archive Collections Through Storytelling PhD Candidacy Exam for: Shawn M. Jones Committee: Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna Thanks to:
  • 2. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 4. Proposed Research 2
  • 4. @shawnmjones @WebSciDL Let’s say: you find a bag There are thousands of different items inside. Can you use the contents of this bag? How quickly can you make this decision? 4
  • 5. @shawnmjones @WebSciDL Now let’s say: there are thousands of bags Which one might contain something useful for you? Do any? How do you know? How do you decrease your chances of wasting your time? 5
  • 6. @shawnmjones @WebSciDL What does this have to do with web archives? 6
  • 7. @shawnmjones @WebSciDL Researchers create their own web archive collections 7 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  • 8. @shawnmjones @WebSciDL Web archive collections have many versions of the same page 8 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  • 9. @shawnmjones @WebSciDL Different versions allow us to see an unfolding news story 9 Memento from April 19, 2013 17:12 Searching for suspects, City on lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 24, 2013 2:24 Suspect Found, Office collier lost life, Obama speaks
  • 10. @shawnmjones @WebSciDL Different versions allow us to see changes in an organization’s web presence 10 The White House: 2016 The White House: 2018
  • 11. @shawnmjones @WebSciDL Archive-It allows curators to easily create collections Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 11
  • 12. @shawnmjones @WebSciDL … and these collections are used by other researchers 12 The collection curator is not the only user of the collection! These collections live a life after their curator has stopped adding to them.
  • 13. @shawnmjones @WebSciDL How do we tell the difference between collections? What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 13
  • 14. @shawnmjones @WebSciDL 14 31 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?
  • 15. @shawnmjones @WebSciDL Archive-It provides fields for metadata 15 Collection-wide metadata Metadata on individual seeds Dublin Core + Custom Fields
  • 16. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 16 9 seeds with metadata 132,599 seeds no metadata
  • 17. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 17 132,599 seeds no metadata 9 seeds with metadata Paradox: More seeds = more effort More seeds = greater user need for metadata
  • 18. @shawnmjones @WebSciDL Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 18
  • 19. @shawnmjones @WebSciDL More Archive-It collections are added every year More than 8000 collections exist as of the end of 2016 19
  • 20. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 20
  • 21. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 21
  • 22. @shawnmjones @WebSciDL Our proposal: a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs 22 From this: 318 seeds with 2421 mementos To something like this: a social media story of ~28 surrogates P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
  • 23. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 4. Proposed Research 23
  • 24. @shawnmjones @WebSciDL Surrogates provide a visual summary of the content behind a URI… 24 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI represented by a browser thumbnail surrogate: The same URI represented by a social card surrogate:
  • 25. @shawnmjones @WebSciDL Social media storytelling uses surrogates to provide a “summary of summaries” 25 2 resources are shown from this Wakelet story6 resources are shown from this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.
  • 26. @shawnmjones @WebSciDL Traditional surrogates contain metadata generated by humans to convey aboutness 26
  • 27. @shawnmjones @WebSciDL Web surrogates provide a visual summary of a web resource drawn from the content of the resource 27 Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018- 04-24-lets-get-visual-and-examine.html, 2018.
  • 28. @shawnmjones @WebSciDL Our research questions  RQ1: What types of web archive collections exist?  RQ2: What surrogates work best for understanding collections of mementos?  RQ3: How do we select representative mementos for the different semantic types of collections?  RQ4: How well do stories produced by different summarization algorithms work for collection understanding? 28 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types
  • 29. @shawnmjones @WebSciDL RQ2: What surrogates work best for web resources? 29 Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018- 04-24-lets-get-visual-and-examine.html, 2018.
  • 30. @shawnmjones @WebSciDL RQ3: How might we select representative mementos? Luhn (1958) • automatic abstracts Silva (2014) • word graphs from Luhn’s algorithm DUC Datasets (2001-2007) Napoles (2012) • Gigaword Lin (2014) • ROUGE metrics Grusky (2018) • NEWSROOM • Existing reference summaries were built from news articles. • Existing reference summaries were not built from web archives. Mihalcea (2004) • TextRank Dolan (2004) • clustering news articles • Lede3 preferred by evaluators Xie (2008) • MMR for meeting summaries Radev (1998) • automatic news briefs Xie (2008) • MMR for meeting summaries Sipos (2008) • scholarly corpus over time Zhang (2010)/Li (2011) • aspects of disasters Hong (2014) • word weighting 30
  • 31. @shawnmjones @WebSciDL RQ3: How might we select representative mementos? – Related Concepts  Scatter-Gather (Cutting 1992)  allows a user to explore a collection by drilling through topic cluster until they reach individual documents  we seek to provide a representative sample that a user can quickly glance  Recommender Systems  predicts the preference of a user based on past behavior, demographic profile, or behavior of the user’s friends  we want to provide a summary without any knowledge of the user  Zero-Query Systems  predicts the information a user will need based on time, location, environment, user interests, and other factors  again, we want to provide a summary with no knowledge of the user 31 Image reference: Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318- 329. https://doi.org/10.1145/133160.133214
  • 32. @shawnmjones @WebSciDL How have others explored collections? 32 Conta Me Histórias ArchiveSpark Archives Unleashed Cloud Existing solutions allow users to query and develop statistics on collections. Users must have some ideas of a topic or concept a priori.
  • 33. @shawnmjones @WebSciDL How have others visualized collections for understanding? 33 Other attempts at visualizing Archive-It collections tried to visualize everything. http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821
  • 34. @shawnmjones @WebSciDL How have others told stories with web archive collections? 34  AlNoamany told stories via the storytelling platform Storify  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508
  • 35. @shawnmjones @WebSciDL How have others told stories with web archive collections? 35  AlNoamany told stories via the storytelling platform Storify – which is no longer in service  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize x S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?” http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017. x
  • 36. @shawnmjones @WebSciDL How have others told stories with web archive collections?  AlNoamany told stories via the storytelling platform Storify – which is no longer in service  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 36 x x
  • 37. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 4. Proposed Research 37
  • 38. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are existing curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 38
  • 39. @shawnmjones @WebSciDL As collection users, we view Archive-It collections from outside… 39 • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 40. @shawnmjones @WebSciDL As collection users, what structural features can we view from outside? 40  Using only structural features is advantageous because it saves one from having to download a collection’s content.  These structural features give us different insight than can be provided by text analysis or metadata. 81,014 seeds 486,227 seed mementos Structural features shown here: • number of seeds • number of mementos S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 41. @shawnmjones @WebSciDL Was the collection built from web sites belonging to one domain or many? 41 Many domains One domain Structural feature discussed here: • domain diversity S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 42. @shawnmjones @WebSciDL Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 42 Top-level pages Deeper links Structural feature discussed here: • path depth diversity • most frequent path depth S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 43. @shawnmjones @WebSciDL Growth curves provide some understanding of collection curation behavior 43 • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 44. @shawnmjones @WebSciDL Does most of the collection exist earlier or later in its life? 44 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life. Structural feature discussed here: • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 45. @shawnmjones @WebSciDL When did the curator select and archive a collection’s contents? 45 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018. Structural feature discussed here: • area under the seed growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 46. @shawnmjones @WebSciDL Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 46 This collection was created in June 2014. The seeds were selected toward the beginning of its life. Mementos were captured all during its life. Structural feature discussed here: • area under the seed growth curve • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 47. @shawnmjones @WebSciDL We discovered four semantic categories in Archive-It collections… 47 Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 48. @shawnmjones @WebSciDL 48 Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…
  • 49. @shawnmjones @WebSciDL 49 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…
  • 50. @shawnmjones @WebSciDL 50 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…
  • 51. @shawnmjones @WebSciDL 51 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…
  • 52. @shawnmjones @WebSciDL 52 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…
  • 53. @shawnmjones @WebSciDL We can bridge the structural to the descriptive… 53 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features mentioned previously, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 54. @shawnmjones @WebSciDL We have identified different types of Archive-It collections 54 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ We can take these features into account to address the other research questions. So, let’s tell some stories on social media! Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
  • 55. @shawnmjones @WebSciDL We have identified different types of Archive-It collections 55 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ We can take these features into account to address the other research questions. So, let’s tell some stories on social media! Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous Not so fast…
  • 56. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are existing curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 56
  • 57. @shawnmjones @WebSciDL Existing platforms do not reliably produce surrogates for mementos… 57 If we cannot rely upon the service to generate a surrogate for a memento, our system must then do the work to create our own surrogates. S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws- dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
  • 58. @shawnmjones @WebSciDL Some services have stories, but not long term storytelling? 58 Facebook stories Image ref: https://techcrunch.com/2018/04/05/facebook-stories-default/ Image ref: https://techcrunch.com/2013/10/03/snapc hat-gets-its-own-timeline-with-snapchat- stories-24-hour-photo-video-tales/ Snapchat stories Image ref: https://buffer.com/library/instagram-stories Instagram stories These platforms delete the user’s stories 24 hours after they are posted. This form of social media storytelling is the opposite of what we are looking for. We want the stories to be artifacts themselves.
  • 59. @shawnmjones @WebSciDL Some services’ longevity is in doubt… 59 RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018 S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws- dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
  • 60. @shawnmjones @WebSciDL Existing surrogate services create a confusing experience for mementos 60 Who published these resources? Archive-It? CNN? Is the story author sharing fake news? S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws- dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018. embed.rocks surrogate embed.ly surrogate
  • 61. @shawnmjones @WebSciDL Neither social media services nor surrogate services were reliable for storytelling, so we created MementoEmbed… 61 Information in the MementoEmbed social card surrogate is separated to avoid issues of confusion about attribution. MementoEmbed is archive-aware. It can locate information about the memento that is not available in other surrogates. S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws- dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
  • 62. @shawnmjones @WebSciDL MementoEmbed provides us with a tool for evaluating surrogates, a step on the road to answering RQ2… 62 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️
  • 63. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are live web curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 63
  • 64. @shawnmjones @WebSciDL Using stories built from curator-selected mementos, we shared stories with MT participants… 64 Archive-It like Social Card Browser thumbnails Social Card With Thumbnail as Image (sc/t) Social Card With Thumbnail to Right (sc+t) Social Card with Thumbnail on Hover (sc^t) • 4 stories of 15-17 mementos selected by human Archive-It curators from their collections • 6 different surrogate types • 24 different story-surrogate combinations • 120 MT participants • Given 30 seconds to view each story S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
  • 65. @shawnmjones @WebSciDL And then we asked them which of 2 of 6 mementos come from the same collection… 65 • Each participant was shown a list of 6 surrogates of the same type as the story they just viewed. • They were asked to choose the 2 that they thought came from the same collection. • They were given as much time as they wished to answer the question. • This is similar to the Sentence Verification Task from reading comprehension studies. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
  • 66. @shawnmjones @WebSciDL Response times per surrogate had interesting means, but p-values were not statistically significant at p < 0.05 66 p = 0.190 p = 0.202 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
  • 67. @shawnmjones @WebSciDL Correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate 67 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean p = 0.0569 p = 0.0770 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
  • 68. @shawnmjones @WebSciDL Whenever thumbnails are present, more users interact with them 68 We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted with a thumbnail, regardless of surrogate. For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the surrogate. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
  • 69. @shawnmjones @WebSciDL We have some results indicating that social cards perform better, but there is more to answering RQ2… 69 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️ 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean
  • 70. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 70
  • 71. @shawnmjones @WebSciDL Identifying off-topic mementos is key to choosing representative mementos 71 Hacked Moved on from topic Collections have a theme Seeds are selected to support that theme Mementos are versions of seeds Some of these versions are off-topic Identifying these off-topic mementos is key to summarization Web Page Gone Account Suspension S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
  • 72. @shawnmjones @WebSciDL The Off-Topic Memento Toolkit (OTMT) compares a seed’s first memento with the seed’s other mementos via different measures… Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 72 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
  • 73. @shawnmjones @WebSciDL After repeating AlNoamany’s experiment, Word Count had the best F1 score for identifying off-topic mementos… 73 We reused AlNoamany’s labeled dataset. She did not try: • Sørensen-Dice • Simhash of raw content • Simhash of TF • Gensim LSI Our word count accuracy came out ahead of AlNoamany’s. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
  • 74. @shawnmjones @WebSciDL Finding off-topic mementos is one of the first steps to addressing RQ3… 74 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️
  • 75. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 4. Proposed Research 75
  • 76. @shawnmjones @WebSciDL This work requires a flexible framework – Dark and Stormy Archives (DSA) 2.0 76 OTMT Hypercane Raintale MementoEmbed Archive-It Utilities Story Web Archive Collection ✅ ✅ ✅ callscalls calls provides input to input output Thousands of HTML documents < 30 Representative Mementos Visualized as surrogates calls ✅ S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11- raintale-storytelling-tool.html, 2019. Tools for selecting representative mementos Tools for visualizing mementos as a story
  • 77. @shawnmjones @WebSciDL Evaluation of RQ2: What surrogates work best for understanding collections of mementos? 77 How well do users perform with different types of surrogates? 1. Select 5 collections from each semantic category 2. Select the earliest memento of each of the first 20 seeds from each collection – this is the number of surrogates a user views if they open an Archive-It story and page down once 3. Present the participant with a story of 20 surrogates, varying the surrogate between participants 4. Ask them to address a user task Variations: • For step #3, vary the time for participants to view the story • participants view for 5, 10, 20, 30 seconds • may surface the ability to “glance” and understand • some surrogates consist only of title, URI, etc. • may determine which surrogate elements perform best • For step #4, ask the participant to: • determine if the collection behind the story is suited for a task – similar to traditional IR research • identify which items likely belong to the same collection • Instead of steps 3 and 4 – ask former participants which surrogate they prefer for a given task
  • 78. @shawnmjones @WebSciDL Evaluation of RQ2: What surrogates work best for understanding collections of mementos? 78 What information is available to users of the existing Archive-It story? Discover patterns in metadata usage that may indicate the semantic type of collection. How well do our stories compare to the existing metadata? How well do our stories cover the content of the underlying collection? How well does the Archive-It story cover the underlying collection? How well do surrogates cover the content of their mementos? Collection Content Our Story Content Collection Content Archive-It Story Content Memento Content Surrogate Content Our Story Content Existing Metadata For Seeds Similarity metrics will be used for evaluating coverage.
  • 79. @shawnmjones @WebSciDL Evaluation of RQ3: How do we select representative mementos for different semantic types of collections? 79 We will develop different algorithms and compare their output with several metrics to determine which algorithms provide the best ”aboutness” for the collection. 0 1 2 3 4 5 6 7 8 9 10 Existing Metadata Content Coverage Temporal Spread Source Diversity Compression Performance DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4
  • 80. @shawnmjones @WebSciDL RQ4: How well do stories produced by different summarization algorithms work for collection understanding? 80 How well do our generated stories compare to the existing Archive-It interface? Do study participants understand key concepts of the collection represented by the story? Using the stories, can participants tell the difference between similar collections? Can participants compare stories and tell which are similar? Does the addition of existing metadata improve the participant’s performance? Does the layout of the surrogates improve the participant’s performance? RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️
  • 81. @shawnmjones @WebSciDL We plan to have completed this research in 2021… 81 iPres 2018 iPres 2018 CIKM 2019 ECIR 2020 WWW 2020 CIKM 2020 WebSci 2021 JCDL 2020 JCDL 2018 DTMH 2017
  • 82. @shawnmjones @WebSciDL Our methods are not just for Archive-It 82 Our methods will be applicable web archive collections created on other platforms, like Rhizome’s Webrecorder.
  • 83. @shawnmjones @WebSciDL Motivation Summary  Collection understanding is a problem with web archive collections  inconsistent metadata  1000s of mementos  1000s of collections  costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events  relied upon Storify as a visualization medium 83
  • 84. @shawnmjones @WebSciDL Contributions  Existing work:  Derived semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  MementoEmbed – surrogates for the past web  Social cards probably provide better understanding of collections  Off-Topic Memento Toolkit – Identifying off-topic mementos  Future work:  Evaluate algorithms for surfacing a representative sample from a document collection  Evaluate different surrogate types via user evaluation  Show which surrogate-sample combinations work best for collection understanding via user evaluation 84
  • 85. @shawnmjones @WebSciDL Improving Understanding of Web Archive Collections Through Storytelling PhD Candidacy Exam for: Shawn M. Jones Committee: Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna Thanks to: