Improving Collection Understanding in Web Archives

@shawnmjones @WebSciDL
Improving Collection
Understanding in Web Archives
Shawn M. Jones
Web Science and Digital Libraries Research Group
Advisors: Michael L. Nelson and Michele C. Weigle
Thanks to:

Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah

Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015

Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 11, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks

Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018

Archive-It Provides For Easy Collection Creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
6

The Problem of Collection Understanding
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
7

@shawnmjones @WebSciDL 8
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?

Archive-It provides fields for metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields

But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
10
9 seeds
with metadata
132,599 seeds
no metadata

But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
11
132,599 seeds
no metadata
9 seeds
with metadata
Paradox of metadata:
More seeds = more effort

Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
12

More Archive-It collections are added every year
More than 8000 collections exist as
of the end of 2016
13
More Archive-It collections
are added each year

The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
14

The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
15

The proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos have
good information scent
 contain cues that the memento will address a
user’s needs
From this:
318 seeds with
2421 mementos To something
like this:
a visualization
of ~28 social
cards
16
Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20

Background and Related Work
17

Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes
18
Archive–It Collections

Document collections have aspects
 Metadata on a publication:
 used as a surrogate for understanding
 answers anticipated questions
 Aspects:
 The central concepts of the corpus
 For example: aspects about a disaster
 time
 place
 cause
 countermeasures
 Aspects correspond to the questions that a user
might have about a collection
19
Archive–It Collections Summarize with Aspects
Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries
with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence (AAAI’12), 1727–1733.

How can we surface aspects?
 Named Entity Recognition can
answer questions of who or
where?
 Natural Language Processing can
answer questions of what time
period?
 Topic modeling can surface
general concepts from the corpus
 And we have to be cognizant of
these concepts over time
20
Archive-It Collection 8121:
“The Obama White House”
Archive-It Collection 8513:
“Donald J Trump White House”

Visualizing web resources (surrogates)
21
Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
Visualize MementosArchive–It Collections Summarize with Aspects

Which surrogate is best for web resources?
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
in information scent and user preference
Capra (2013)
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
22
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html

Which surrogate is best for web resources?
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
in information scent and user preference
Capra (2013)
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
23
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html

Visualizing Archive-It Collections
24
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html

Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
25
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science
Conference, 309–318. DOI:10.1145/3091478.3091508
http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html

 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
26
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html
x

 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries and
human-generated summaries
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
27
x

Preliminary Work
28

Growth curves for understanding collection creation
behavior
29
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.

Structural features of Archive-It collections
 difference between seed curve AUC and
diagonal
 difference between seed memento curve
AUC and diagonal
 difference between seed memento curve
AUC and seed curve AUC
 number of seeds
 number of mementos
 seed URI domain diversity
 seed URI path depth diversity
 most frequent seed URI path depth
 % query string usage in seed URIs
 lifespan of collection
30

Semantic categories of Archive-It collections
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
31
In a study of 3,382 Archive-It collections

Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
32

Self-Archiving
Subject-based
Time Bounded – Expected Time Bounded – Spontaneous
33

Self-Archiving
Subject-based
Time Bounded – Expected
Time Bounded – Spontaneous
34

Self-Archiving
Subject-based
4.2% of collections
35

Self-Archiving
Subject-based
4.2% of collections
Some evaluated by AlNoamany
36

Self-Archiving
Subject-based
4.2% of collections
Some evaluated by AlNoamany
Using the structural features on the previous slide, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
37

Research Plan
38

Developing a Flexible Framework
Off-Topic Memento
Toolkit
Representative
Memento Selection
Utilities
Archive-It
Utilities
MementoEmbed
DSA
Visualization
Interface
Web Archive
Collection
Visualized
Summary
Dark and Stormy Archives (DSA) 2.0
A framework based by AlNoamany’s work
Two concepts are embodied in this framework:
1. Selecting representative mementos
2. Visualizing those mementos
39
Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The
Off-Topic Memento Toolkit. In International Conference on Digital
Preservation (iPRES) 2018.

Not just Archive-It
40
Our methods will be applicable to any web archive collection,
like those developed by Rhizome’s Webrecorder.

Evaluation
41

Evaluation
1. Choose target
collections for study
42

Evaluation
1. Choose target
2. Develop user tasks
for each collection
43
Who is X? Where is Y? When does Z take place?

Evaluation
1. Choose target
2. Develop user tasks
for each collection
3. How well do users
complete the tasks?
44
Who is X? Where is Y? When does Z take place?

RQ1: How do we select representative mementos for the
different semantic types of collections?
 Summarizing a collection involves:
1. Grouping the mementos by their
commonalities
2. Select the highest quality mementos
from each group
 Different semantic categories may
require different algorithms
 We want to reuse existing tools
where possible:
 Stanford NLP
 Archives Unleashed Toolkit
 gensim
 SpaCy
45

RQ1 Evaluation
1. How many user tasks were addressed by the mementos chosen? How many
user tasks failed?
2. How many mementos produced are not useful for any user task?
3. Which algorithm surfaces aspects satisfying the highest mean number of user
tasks for a given collection type?
4. What is the mean minimum number of mementos necessary to address the
most user tasks?
46

RQ2: What visualizations (surrogates) work best for
understanding individual mementos?
 There are many different possibilities
for surrogates
 Does the choice in surrogate change
depending on the collection’s
semantic category?
47

RQ2 Evaluation
1. Does the depth, domain, or category of
the URI play a factor in which surrogate
performs better?
2. Do different surrogates work better for
different semantic categories?
3. For social cards, which elements of the
social card need to be present to
understand the underlying memento?
4. For thumbnails, what size thumbnail
works best for understanding? How
much of the web page needs to be
rendered for a thumbnail to be useful for
understanding?
48
Evaluated via:

RQ3: How well do visualizations of groups of mementos
produced by different summarization algorithms work for
collection understanding?
 Once we have:
 Candidate summarization algorithms
 Evaluated surrogates for individual mementos
 We can then evaluate the combination of
summarization and visualization.
 There are many options:
 arranging surrogates
 headings
 metadata
49
RQ1:
Summarization
Algorithms
RQ2:
Visualization
Elements
RQ3: Visualization of
Summary

RQ3 Evaluation
1. How many user tasks are addressed by
the visualization chosen? How many fail?
2. How many visualized mementos were not
needed for any given user task?
3. Given an aspect of the collection, can the
user address a user task concerning it by
visually scanning the visualization?
4. Given multiple aspects of the collection,
can the user successfully compare
different individual memento visualizations
to address a user task?
5. Which visualizations work better for certain
semantic types?
50
Evaluated via:

Research Plan
51
03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020
Preliminary work
Implement a flexible framework
Addressing RQ1: Develop new algorithms for selecting
representative mementos
Addressing RQ2: Evaluation of individual memento
visualizations
Dissertation Candidacy Exam
Addressing RQ1: Evaluation of algorithms for selecting
representative mementos
Addressing RQ3: Develop candidate visualizations of groups of
mementos
Addressing RQ3: Evaluation of visualization of groups of
mementos
Disseration Composition
Dissertation Defense
SIGIR 2020
CHI 2020
iPres 2018
iPres 2019
JCDL 2019
CHI 2020
JCDL 2020
JCDL 2021

Conclusion
52

Summary
 Collection understanding is a problem
with web archive collections
 Inconsistent metadata
 1000s of mementos
 1000s of collections
 Costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
53

Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
54

Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
55
Thanks:

Improving Collection Understanding in Web Archives

More Related Content

What's hot

Similar to Improving Collection Understanding in Web Archives

More from Shawn Jones

Recently uploaded

Improving Collection Understanding in Web Archives