We propose using visualization of representative mementos to aide in collection understanding of web archive collections, as inspired by AlNomanay's work.
Shawn JonesResearch Assistant at Old Dominion University
2. @shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
3. @shawnmjones @WebSciDL
Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
4. @shawnmjones @WebSciDL
Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 11, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks
6. @shawnmjones @WebSciDL
Archive-It Provides For Easy Collection Creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
6
7. @shawnmjones @WebSciDL
The Problem of Collection Understanding
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
7
8. @shawnmjones @WebSciDL 8
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
10. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
10
9 seeds
with metadata
132,599 seeds
no metadata
11. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
11
132,599 seeds
no metadata
9 seeds
with metadata
Paradox of metadata:
More seeds = more effort
12. @shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
12
13. @shawnmjones @WebSciDL
More Archive-It collections are added every year
More than 8000 collections exist as
of the end of 2016
13
More Archive-It collections
are added each year
14. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
14
15. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
Human review of these
mementos for collection
understanding is an expensive
proposition.
15
16. @shawnmjones @WebSciDL
The proposal: a visualization made of representative
mementos
Our visualization is a summary that will
act like an abstract
Pirolli and Card’s Information Foraging
Theory:
maximize the value of the information gained
from our summaries
minimize the cost of interacting with the
collection
ensure that our representative mementos have
good information scent
contain cues that the memento will address a
user’s needs
From this:
318 seeds with
2421 mementos To something
like this:
a visualization
of ~28 social
cards
16
Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
18. @shawnmjones @WebSciDL
Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes
18
Archive–It Collections
19. @shawnmjones @WebSciDL
Document collections have aspects
Metadata on a publication:
used as a surrogate for understanding
answers anticipated questions
Aspects:
The central concepts of the corpus
For example: aspects about a disaster
time
place
cause
countermeasures
Aspects correspond to the questions that a user
might have about a collection
19
Archive–It Collections Summarize with Aspects
Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries
with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence (AAAI’12), 1727–1733.
20. @shawnmjones @WebSciDL
How can we surface aspects?
Named Entity Recognition can
answer questions of who or
where?
Natural Language Processing can
answer questions of what time
period?
Topic modeling can surface
general concepts from the corpus
And we have to be cognizant of
these concepts over time
20
Archive-It Collection 8121:
“The Obama White House”
Archive-It Collection 8513:
“Donald J Trump White House”
Archive–It Collections Summarize with Aspects
21. @shawnmjones @WebSciDL
Visualizing web resources (surrogates)
21
Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
Visualize MementosArchive–It Collections Summarize with Aspects
22. @shawnmjones @WebSciDL
Which surrogate is best for web resources?
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
22
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
23. @shawnmjones @WebSciDL
Which surrogate is best for web resources?
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
23
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
24. @shawnmjones @WebSciDL
Visualizing Archive-It Collections
24
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
25. @shawnmjones @WebSciDL
Prior work by AlNoamany
Visualized summaries via the storytelling platform Storify
Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
25
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science
Conference, 309–318. DOI:10.1145/3091478.3091508
http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
26. @shawnmjones @WebSciDL
Prior work by AlNoamany
Visualized summaries via the storytelling platform Storify – which is no longer in service
Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
26
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html
x
27. @shawnmjones @WebSciDL
Prior work by AlNoamany
Visualized summaries via the storytelling platform Storify – which is no longer in service
Proved that test participants could not detect the difference between her automated summaries and
human-generated summaries
Did not evaluate if the resulting summaries were effective tools for collection understanding
Focused on summarizing collections about events
There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
27
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
x
29. @shawnmjones @WebSciDL
Growth curves for understanding collection creation
behavior
29
Archive–It Collections
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
30. @shawnmjones @WebSciDL
Structural features of Archive-It collections
difference between seed curve AUC and
diagonal
difference between seed memento curve
AUC and diagonal
difference between seed memento curve
AUC and seed curve AUC
number of seeds
number of mementos
seed URI domain diversity
seed URI path depth diversity
most frequent seed URI path depth
% query string usage in seed URIs
lifespan of collection
30
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
31. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
31
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
32. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
32
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
33. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
33
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
34. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
34
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
35. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
35
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
36. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
36
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
37. @shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features on the previous slide, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
37
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
39. @shawnmjones @WebSciDL
Developing a Flexible Framework
Off-Topic Memento
Toolkit
Representative
Memento Selection
Utilities
Archive-It
Utilities
MementoEmbed
DSA
Visualization
Interface
Web Archive
Collection
Visualized
Summary
Dark and Stormy Archives (DSA) 2.0
A framework based by AlNoamany’s work
Two concepts are embodied in this framework:
1. Selecting representative mementos
2. Visualizing those mementos
39
Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The
Off-Topic Memento Toolkit. In International Conference on Digital
Preservation (iPRES) 2018.
40. @shawnmjones @WebSciDL
Not just Archive-It
40
Our methods will be applicable to any web archive collection,
like those developed by Rhizome’s Webrecorder.
44. @shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
3. How well do users
complete the tasks?
44
Who is X? Where is Y? When does Z take place?
45. @shawnmjones @WebSciDL
RQ1: How do we select representative mementos for the
different semantic types of collections?
Summarizing a collection involves:
1. Grouping the mementos by their
commonalities
2. Select the highest quality mementos
from each group
Different semantic categories may
require different algorithms
We want to reuse existing tools
where possible:
Stanford NLP
Archives Unleashed Toolkit
gensim
SpaCy
45
Archive–It Collections Summarize with Aspects
46. @shawnmjones @WebSciDL
RQ1 Evaluation
1. How many user tasks were addressed by the mementos chosen? How many
user tasks failed?
2. How many mementos produced are not useful for any user task?
3. Which algorithm surfaces aspects satisfying the highest mean number of user
tasks for a given collection type?
4. What is the mean minimum number of mementos necessary to address the
most user tasks?
46
Archive–It Collections Summarize with Aspects
47. @shawnmjones @WebSciDL
RQ2: What visualizations (surrogates) work best for
understanding individual mementos?
There are many different possibilities
for surrogates
Does the choice in surrogate change
depending on the collection’s
semantic category?
47
Visualize MementosArchive–It Collections Summarize with Aspects
48. @shawnmjones @WebSciDL
RQ2 Evaluation
1. Does the depth, domain, or category of
the URI play a factor in which surrogate
performs better?
2. Do different surrogates work better for
different semantic categories?
3. For social cards, which elements of the
social card need to be present to
understand the underlying memento?
4. For thumbnails, what size thumbnail
works best for understanding? How
much of the web page needs to be
rendered for a thumbnail to be useful for
understanding?
48
Visualize MementosArchive–It Collections Summarize with Aspects
Evaluated via:
49. @shawnmjones @WebSciDL
RQ3: How well do visualizations of groups of mementos
produced by different summarization algorithms work for
collection understanding?
Once we have:
Candidate summarization algorithms
Evaluated surrogates for individual mementos
We can then evaluate the combination of
summarization and visualization.
There are many options:
arranging surrogates
headings
metadata
49
RQ1:
Summarization
Algorithms
RQ2:
Visualization
Elements
RQ3: Visualization of
Summary
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
50. @shawnmjones @WebSciDL
RQ3 Evaluation
1. How many user tasks are addressed by
the visualization chosen? How many fail?
2. How many visualized mementos were not
needed for any given user task?
3. Given an aspect of the collection, can the
user address a user task concerning it by
visually scanning the visualization?
4. Given multiple aspects of the collection,
can the user successfully compare
different individual memento visualizations
to address a user task?
5. Which visualizations work better for certain
semantic types?
50
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Evaluated via:
51. @shawnmjones @WebSciDL
Research Plan
51
03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020
Preliminary work
Implement a flexible framework
Addressing RQ1: Develop new algorithms for selecting
representative mementos
Addressing RQ2: Evaluation of individual memento
visualizations
Dissertation Candidacy Exam
Addressing RQ1: Evaluation of algorithms for selecting
representative mementos
Addressing RQ3: Develop candidate visualizations of groups of
mementos
Addressing RQ3: Evaluation of visualization of groups of
mementos
Disseration Composition
Dissertation Defense
SIGIR 2020
CHI 2020
iPres 2018
iPres 2019
JCDL 2019
CHI 2020
JCDL 2020
JCDL 2021
53. @shawnmjones @WebSciDL
Summary
Collection understanding is a problem
with web archive collections
Inconsistent metadata
1000s of mementos
1000s of collections
Costly for human review
We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
Prior work in this area:
did not evaluate how well this method works
for collection understanding
only focused on collections about events
53
54. @shawnmjones @WebSciDL
Contributions
Existing work:
Semantic categories of web archive collections in Archive-It
Categories can be predicted by using structural features
Most collections are not about events
Future work:
Investigate new ways of surfacing representative mementos
Contribute knowledge of collection understanding in web archive collections
Which visualization methods work best for understanding mementos in a collection
New algorithms for use in collection understanding
54
55. @shawnmjones @WebSciDL
Contributions
Existing work:
Semantic categories of web archive collections in Archive-It
Categories can be predicted by using structural features
Most collections are not about events
Future work:
Investigate new ways of surfacing representative mementos
Contribute knowledge of collection understanding in web archive collections
Which visualization methods work best for understanding mementos in a collection
New algorithms for use in collection understanding
55
Thanks: