@shawnmjones @WebSciDL
Improving Collection
Understanding in Web Archives
Shawn M. Jones
Web Science and Digital Libraries Research Group
Advisors: Michael L. Nelson and Michele C. Weigle
Thanks to:
@shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
@shawnmjones @WebSciDL
Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
@shawnmjones @WebSciDL
Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 11, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks
@shawnmjones @WebSciDL
Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018
@shawnmjones @WebSciDL
Archive-It Provides For Easy Collection Creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
6
@shawnmjones @WebSciDL
The Problem of Collection Understanding
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
7
@shawnmjones @WebSciDL 8
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
@shawnmjones @WebSciDL
Archive-It provides fields for metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
10
9 seeds
with metadata
132,599 seeds
no metadata
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
11
132,599 seeds
no metadata
9 seeds
with metadata
Paradox of metadata:
More seeds = more effort
@shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
12
@shawnmjones @WebSciDL
More Archive-It collections are added every year
More than 8000 collections exist as
of the end of 2016
13
More Archive-It collections
are added each year
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
14
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
15
@shawnmjones @WebSciDL
The proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos have
good information scent
 contain cues that the memento will address a
user’s needs
From this:
318 seeds with
2421 mementos To something
like this:
a visualization
of ~28 social
cards
16
Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
@shawnmjones @WebSciDL
Background and Related Work
17
@shawnmjones @WebSciDL
Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes
18
Archive–It Collections
@shawnmjones @WebSciDL
Document collections have aspects
 Metadata on a publication:
 used as a surrogate for understanding
 answers anticipated questions
 Aspects:
 The central concepts of the corpus
 For example: aspects about a disaster
 time
 place
 cause
 countermeasures
 Aspects correspond to the questions that a user
might have about a collection
19
Archive–It Collections Summarize with Aspects
Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries
with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence (AAAI’12), 1727–1733.
@shawnmjones @WebSciDL
How can we surface aspects?
 Named Entity Recognition can
answer questions of who or
where?
 Natural Language Processing can
answer questions of what time
period?
 Topic modeling can surface
general concepts from the corpus
 And we have to be cognizant of
these concepts over time
20
Archive-It Collection 8121:
“The Obama White House”
Archive-It Collection 8513:
“Donald J Trump White House”
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Visualizing web resources (surrogates)
21
Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
22
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
23
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Visualizing Archive-It Collections
24
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
25
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science
Conference, 309–318. DOI:10.1145/3091478.3091508
http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
26
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html
x
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries and
human-generated summaries
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
27
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
x
@shawnmjones @WebSciDL
Preliminary Work
28
@shawnmjones @WebSciDL
Growth curves for understanding collection creation
behavior
29
Archive–It Collections
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Structural features of Archive-It collections
 difference between seed curve AUC and
diagonal
 difference between seed memento curve
AUC and diagonal
 difference between seed memento curve
AUC and seed curve AUC
 number of seeds
 number of mementos
 seed URI domain diversity
 seed URI path depth diversity
 most frequent seed URI path depth
 % query string usage in seed URIs
 lifespan of collection
30
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
31
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
32
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
33
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
34
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
35
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
36
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features on the previous slide, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
37
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Research Plan
38
@shawnmjones @WebSciDL
Developing a Flexible Framework
Off-Topic Memento
Toolkit
Representative
Memento Selection
Utilities
Archive-It
Utilities
MementoEmbed
DSA
Visualization
Interface
Web Archive
Collection
Visualized
Summary
Dark and Stormy Archives (DSA) 2.0
A framework based by AlNoamany’s work
Two concepts are embodied in this framework:
1. Selecting representative mementos
2. Visualizing those mementos
39
Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The
Off-Topic Memento Toolkit. In International Conference on Digital
Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Not just Archive-It
40
Our methods will be applicable to any web archive collection,
like those developed by Rhizome’s Webrecorder.
@shawnmjones @WebSciDL
Evaluation
41
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
42
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
43
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
3. How well do users
complete the tasks?
44
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
RQ1: How do we select representative mementos for the
different semantic types of collections?
 Summarizing a collection involves:
1. Grouping the mementos by their
commonalities
2. Select the highest quality mementos
from each group
 Different semantic categories may
require different algorithms
 We want to reuse existing tools
where possible:
 Stanford NLP
 Archives Unleashed Toolkit
 gensim
 SpaCy
45
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ1 Evaluation
1. How many user tasks were addressed by the mementos chosen? How many
user tasks failed?
2. How many mementos produced are not useful for any user task?
3. Which algorithm surfaces aspects satisfying the highest mean number of user
tasks for a given collection type?
4. What is the mean minimum number of mementos necessary to address the
most user tasks?
46
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2: What visualizations (surrogates) work best for
understanding individual mementos?
 There are many different possibilities
for surrogates
 Does the choice in surrogate change
depending on the collection’s
semantic category?
47
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2 Evaluation
1. Does the depth, domain, or category of
the URI play a factor in which surrogate
performs better?
2. Do different surrogates work better for
different semantic categories?
3. For social cards, which elements of the
social card need to be present to
understand the underlying memento?
4. For thumbnails, what size thumbnail
works best for understanding? How
much of the web page needs to be
rendered for a thumbnail to be useful for
understanding?
48
Visualize MementosArchive–It Collections Summarize with Aspects
Evaluated via:
@shawnmjones @WebSciDL
RQ3: How well do visualizations of groups of mementos
produced by different summarization algorithms work for
collection understanding?
 Once we have:
 Candidate summarization algorithms
 Evaluated surrogates for individual mementos
 We can then evaluate the combination of
summarization and visualization.
 There are many options:
 arranging surrogates
 headings
 metadata
49
RQ1:
Summarization
Algorithms
RQ2:
Visualization
Elements
RQ3: Visualization of
Summary
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
@shawnmjones @WebSciDL
RQ3 Evaluation
1. How many user tasks are addressed by
the visualization chosen? How many fail?
2. How many visualized mementos were not
needed for any given user task?
3. Given an aspect of the collection, can the
user address a user task concerning it by
visually scanning the visualization?
4. Given multiple aspects of the collection,
can the user successfully compare
different individual memento visualizations
to address a user task?
5. Which visualizations work better for certain
semantic types?
50
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Evaluated via:
@shawnmjones @WebSciDL
Research Plan
51
03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020
Preliminary work
Implement a flexible framework
Addressing RQ1: Develop new algorithms for selecting
representative mementos
Addressing RQ2: Evaluation of individual memento
visualizations
Dissertation Candidacy Exam
Addressing RQ1: Evaluation of algorithms for selecting
representative mementos
Addressing RQ3: Develop candidate visualizations of groups of
mementos
Addressing RQ3: Evaluation of visualization of groups of
mementos
Disseration Composition
Dissertation Defense
SIGIR 2020
CHI 2020
iPres 2018
iPres 2019
JCDL 2019
CHI 2020
JCDL 2020
JCDL 2021
@shawnmjones @WebSciDL
Conclusion
52
@shawnmjones @WebSciDL
Summary
 Collection understanding is a problem
with web archive collections
 Inconsistent metadata
 1000s of mementos
 1000s of collections
 Costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
53
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
54
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
55
Thanks:

Improving Collection Understanding in Web Archives

  • 1.
    @shawnmjones @WebSciDL Improving Collection Understandingin Web Archives Shawn M. Jones Web Science and Digital Libraries Research Group Advisors: Michael L. Nelson and Michele C. Weigle Thanks to:
  • 2.
    @shawnmjones @WebSciDL Researchers CreateTheir Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  • 3.
    @shawnmjones @WebSciDL Web ArchiveCollections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  • 4.
    @shawnmjones @WebSciDL Different VersionsAllow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 11, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks
  • 5.
    @shawnmjones @WebSciDL Different VersionsAllow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018
  • 6.
    @shawnmjones @WebSciDL Archive-It ProvidesFor Easy Collection Creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 6
  • 7.
    @shawnmjones @WebSciDL The Problemof Collection Understanding What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 7
  • 8.
    @shawnmjones @WebSciDL 8 31Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?
  • 9.
    @shawnmjones @WebSciDL Archive-It providesfields for metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields
  • 10.
    @shawnmjones @WebSciDL But, alasthe metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 10 9 seeds with metadata 132,599 seeds no metadata
  • 11.
    @shawnmjones @WebSciDL But, alasthe metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 11 132,599 seeds no metadata 9 seeds with metadata Paradox of metadata: More seeds = more effort
  • 12.
    @shawnmjones @WebSciDL Reviewing mementosmanually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 12
  • 13.
    @shawnmjones @WebSciDL More Archive-Itcollections are added every year More than 8000 collections exist as of the end of 2016 13 More Archive-It collections are added each year
  • 14.
    @shawnmjones @WebSciDL The problem,summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 14
  • 15.
    @shawnmjones @WebSciDL The problem,summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 15
  • 16.
    @shawnmjones @WebSciDL The proposal:a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a visualization of ~28 social cards 16 Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
  • 17.
  • 18.
    @shawnmjones @WebSciDL Looking atArchive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes 18 Archive–It Collections
  • 19.
    @shawnmjones @WebSciDL Document collectionshave aspects  Metadata on a publication:  used as a surrogate for understanding  answers anticipated questions  Aspects:  The central concepts of the corpus  For example: aspects about a disaster  time  place  cause  countermeasures  Aspects correspond to the questions that a user might have about a collection 19 Archive–It Collections Summarize with Aspects Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12), 1727–1733.
  • 20.
    @shawnmjones @WebSciDL How canwe surface aspects?  Named Entity Recognition can answer questions of who or where?  Natural Language Processing can answer questions of what time period?  Topic modeling can surface general concepts from the corpus  And we have to be cognizant of these concepts over time 20 Archive-It Collection 8121: “The Obama White House” Archive-It Collection 8513: “Donald J Trump White House” Archive–It Collections Summarize with Aspects
  • 21.
    @shawnmjones @WebSciDL Visualizing webresources (surrogates) 21 Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) Visualize MementosArchive–It Collections Summarize with Aspects
  • 22.
    @shawnmjones @WebSciDL Which surrogateis best for web resources? Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 22 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 23.
    @shawnmjones @WebSciDL Which surrogateis best for web resources? Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 23 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 24.
    @shawnmjones @WebSciDL Visualizing Archive-ItCollections 24 Other attempts at visualizing Archive-It collections tried to visualize everything. Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821 http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html
  • 25.
    @shawnmjones @WebSciDL Prior workby AlNoamany  Visualized summaries via the storytelling platform Storify  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 25 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508 http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
  • 26.
    @shawnmjones @WebSciDL Prior workby AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 26 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html x
  • 27.
    @shawnmjones @WebSciDL Prior workby AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 27 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary x
  • 28.
  • 29.
    @shawnmjones @WebSciDL Growth curvesfor understanding collection creation behavior 29 Archive–It Collections • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 30.
    @shawnmjones @WebSciDL Structural featuresof Archive-It collections  difference between seed curve AUC and diagonal  difference between seed memento curve AUC and diagonal  difference between seed memento curve AUC and seed curve AUC  number of seeds  number of mementos  seed URI domain diversity  seed URI path depth diversity  most frequent seed URI path depth  % query string usage in seed URIs  lifespan of collection 30 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 31.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous 31 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 32.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous 32 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 33.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous 33 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 34.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 34 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 35.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections 35 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 36.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany 36 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 37.
    @shawnmjones @WebSciDL Semantic categoriesof Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features on the previous slide, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 37 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 38.
  • 39.
    @shawnmjones @WebSciDL Developing aFlexible Framework Off-Topic Memento Toolkit Representative Memento Selection Utilities Archive-It Utilities MementoEmbed DSA Visualization Interface Web Archive Collection Visualized Summary Dark and Stormy Archives (DSA) 2.0 A framework based by AlNoamany’s work Two concepts are embodied in this framework: 1. Selecting representative mementos 2. Visualizing those mementos 39 Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018.
  • 40.
    @shawnmjones @WebSciDL Not justArchive-It 40 Our methods will be applicable to any web archive collection, like those developed by Rhizome’s Webrecorder.
  • 41.
  • 42.
    @shawnmjones @WebSciDL Evaluation 1. Choosetarget collections for study 42
  • 43.
    @shawnmjones @WebSciDL Evaluation 1. Choosetarget collections for study 2. Develop user tasks for each collection 43 Who is X? Where is Y? When does Z take place?
  • 44.
    @shawnmjones @WebSciDL Evaluation 1. Choosetarget collections for study 2. Develop user tasks for each collection 3. How well do users complete the tasks? 44 Who is X? Where is Y? When does Z take place?
  • 45.
    @shawnmjones @WebSciDL RQ1: Howdo we select representative mementos for the different semantic types of collections?  Summarizing a collection involves: 1. Grouping the mementos by their commonalities 2. Select the highest quality mementos from each group  Different semantic categories may require different algorithms  We want to reuse existing tools where possible:  Stanford NLP  Archives Unleashed Toolkit  gensim  SpaCy 45 Archive–It Collections Summarize with Aspects
  • 46.
    @shawnmjones @WebSciDL RQ1 Evaluation 1.How many user tasks were addressed by the mementos chosen? How many user tasks failed? 2. How many mementos produced are not useful for any user task? 3. Which algorithm surfaces aspects satisfying the highest mean number of user tasks for a given collection type? 4. What is the mean minimum number of mementos necessary to address the most user tasks? 46 Archive–It Collections Summarize with Aspects
  • 47.
    @shawnmjones @WebSciDL RQ2: Whatvisualizations (surrogates) work best for understanding individual mementos?  There are many different possibilities for surrogates  Does the choice in surrogate change depending on the collection’s semantic category? 47 Visualize MementosArchive–It Collections Summarize with Aspects
  • 48.
    @shawnmjones @WebSciDL RQ2 Evaluation 1.Does the depth, domain, or category of the URI play a factor in which surrogate performs better? 2. Do different surrogates work better for different semantic categories? 3. For social cards, which elements of the social card need to be present to understand the underlying memento? 4. For thumbnails, what size thumbnail works best for understanding? How much of the web page needs to be rendered for a thumbnail to be useful for understanding? 48 Visualize MementosArchive–It Collections Summarize with Aspects Evaluated via:
  • 49.
    @shawnmjones @WebSciDL RQ3: Howwell do visualizations of groups of mementos produced by different summarization algorithms work for collection understanding?  Once we have:  Candidate summarization algorithms  Evaluated surrogates for individual mementos  We can then evaluate the combination of summarization and visualization.  There are many options:  arranging surrogates  headings  metadata 49 RQ1: Summarization Algorithms RQ2: Visualization Elements RQ3: Visualization of Summary Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
  • 50.
    @shawnmjones @WebSciDL RQ3 Evaluation 1.How many user tasks are addressed by the visualization chosen? How many fail? 2. How many visualized mementos were not needed for any given user task? 3. Given an aspect of the collection, can the user address a user task concerning it by visually scanning the visualization? 4. Given multiple aspects of the collection, can the user successfully compare different individual memento visualizations to address a user task? 5. Which visualizations work better for certain semantic types? 50 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Evaluated via:
  • 51.
    @shawnmjones @WebSciDL Research Plan 51 03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020 Preliminarywork Implement a flexible framework Addressing RQ1: Develop new algorithms for selecting representative mementos Addressing RQ2: Evaluation of individual memento visualizations Dissertation Candidacy Exam Addressing RQ1: Evaluation of algorithms for selecting representative mementos Addressing RQ3: Develop candidate visualizations of groups of mementos Addressing RQ3: Evaluation of visualization of groups of mementos Disseration Composition Dissertation Defense SIGIR 2020 CHI 2020 iPres 2018 iPres 2019 JCDL 2019 CHI 2020 JCDL 2020 JCDL 2021
  • 52.
  • 53.
    @shawnmjones @WebSciDL Summary  Collectionunderstanding is a problem with web archive collections  Inconsistent metadata  1000s of mementos  1000s of collections  Costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events 53
  • 54.
    @shawnmjones @WebSciDL Contributions  Existingwork:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 54
  • 55.
    @shawnmjones @WebSciDL Contributions  Existingwork:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 55 Thanks: