SlideShare a Scribd company logo
1 of 55
@shawnmjones @WebSciDL
Improving Collection
Understanding in Web Archives
Shawn M. Jones
Web Science and Digital Libraries Research Group
Advisors: Michael L. Nelson and Michele C. Weigle
Thanks to:
@shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
@shawnmjones @WebSciDL
Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
@shawnmjones @WebSciDL
Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 11, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks
@shawnmjones @WebSciDL
Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018
@shawnmjones @WebSciDL
Archive-It Provides For Easy Collection Creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
6
@shawnmjones @WebSciDL
The Problem of Collection Understanding
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
7
@shawnmjones @WebSciDL 8
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
@shawnmjones @WebSciDL
Archive-It provides fields for metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
10
9 seeds
with metadata
132,599 seeds
no metadata
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
11
132,599 seeds
no metadata
9 seeds
with metadata
Paradox of metadata:
More seeds = more effort
@shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
12
@shawnmjones @WebSciDL
More Archive-It collections are added every year
More than 8000 collections exist as
of the end of 2016
13
More Archive-It collections
are added each year
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
14
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
15
@shawnmjones @WebSciDL
The proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos have
good information scent
 contain cues that the memento will address a
user’s needs
From this:
318 seeds with
2421 mementos To something
like this:
a visualization
of ~28 social
cards
16
Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
@shawnmjones @WebSciDL
Background and Related Work
17
@shawnmjones @WebSciDL
Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes
18
Archive–It Collections
@shawnmjones @WebSciDL
Document collections have aspects
 Metadata on a publication:
 used as a surrogate for understanding
 answers anticipated questions
 Aspects:
 The central concepts of the corpus
 For example: aspects about a disaster
 time
 place
 cause
 countermeasures
 Aspects correspond to the questions that a user
might have about a collection
19
Archive–It Collections Summarize with Aspects
Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries
with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence (AAAI’12), 1727–1733.
@shawnmjones @WebSciDL
How can we surface aspects?
 Named Entity Recognition can
answer questions of who or
where?
 Natural Language Processing can
answer questions of what time
period?
 Topic modeling can surface
general concepts from the corpus
 And we have to be cognizant of
these concepts over time
20
Archive-It Collection 8121:
“The Obama White House”
Archive-It Collection 8513:
“Donald J Trump White House”
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Visualizing web resources (surrogates)
21
Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
22
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
23
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Visualizing Archive-It Collections
24
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
25
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science
Conference, 309–318. DOI:10.1145/3091478.3091508
http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
26
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html
x
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries and
human-generated summaries
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
27
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
x
@shawnmjones @WebSciDL
Preliminary Work
28
@shawnmjones @WebSciDL
Growth curves for understanding collection creation
behavior
29
Archive–It Collections
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Structural features of Archive-It collections
 difference between seed curve AUC and
diagonal
 difference between seed memento curve
AUC and diagonal
 difference between seed memento curve
AUC and seed curve AUC
 number of seeds
 number of mementos
 seed URI domain diversity
 seed URI path depth diversity
 most frequent seed URI path depth
 % query string usage in seed URIs
 lifespan of collection
30
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
31
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
32
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
33
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
34
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
35
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
36
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features on the previous slide, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
37
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Research Plan
38
@shawnmjones @WebSciDL
Developing a Flexible Framework
Off-Topic Memento
Toolkit
Representative
Memento Selection
Utilities
Archive-It
Utilities
MementoEmbed
DSA
Visualization
Interface
Web Archive
Collection
Visualized
Summary
Dark and Stormy Archives (DSA) 2.0
A framework based by AlNoamany’s work
Two concepts are embodied in this framework:
1. Selecting representative mementos
2. Visualizing those mementos
39
Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The
Off-Topic Memento Toolkit. In International Conference on Digital
Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Not just Archive-It
40
Our methods will be applicable to any web archive collection,
like those developed by Rhizome’s Webrecorder.
@shawnmjones @WebSciDL
Evaluation
41
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
42
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
43
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
3. How well do users
complete the tasks?
44
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
RQ1: How do we select representative mementos for the
different semantic types of collections?
 Summarizing a collection involves:
1. Grouping the mementos by their
commonalities
2. Select the highest quality mementos
from each group
 Different semantic categories may
require different algorithms
 We want to reuse existing tools
where possible:
 Stanford NLP
 Archives Unleashed Toolkit
 gensim
 SpaCy
45
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ1 Evaluation
1. How many user tasks were addressed by the mementos chosen? How many
user tasks failed?
2. How many mementos produced are not useful for any user task?
3. Which algorithm surfaces aspects satisfying the highest mean number of user
tasks for a given collection type?
4. What is the mean minimum number of mementos necessary to address the
most user tasks?
46
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2: What visualizations (surrogates) work best for
understanding individual mementos?
 There are many different possibilities
for surrogates
 Does the choice in surrogate change
depending on the collection’s
semantic category?
47
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2 Evaluation
1. Does the depth, domain, or category of
the URI play a factor in which surrogate
performs better?
2. Do different surrogates work better for
different semantic categories?
3. For social cards, which elements of the
social card need to be present to
understand the underlying memento?
4. For thumbnails, what size thumbnail
works best for understanding? How
much of the web page needs to be
rendered for a thumbnail to be useful for
understanding?
48
Visualize MementosArchive–It Collections Summarize with Aspects
Evaluated via:
@shawnmjones @WebSciDL
RQ3: How well do visualizations of groups of mementos
produced by different summarization algorithms work for
collection understanding?
 Once we have:
 Candidate summarization algorithms
 Evaluated surrogates for individual mementos
 We can then evaluate the combination of
summarization and visualization.
 There are many options:
 arranging surrogates
 headings
 metadata
49
RQ1:
Summarization
Algorithms
RQ2:
Visualization
Elements
RQ3: Visualization of
Summary
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
@shawnmjones @WebSciDL
RQ3 Evaluation
1. How many user tasks are addressed by
the visualization chosen? How many fail?
2. How many visualized mementos were not
needed for any given user task?
3. Given an aspect of the collection, can the
user address a user task concerning it by
visually scanning the visualization?
4. Given multiple aspects of the collection,
can the user successfully compare
different individual memento visualizations
to address a user task?
5. Which visualizations work better for certain
semantic types?
50
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Evaluated via:
@shawnmjones @WebSciDL
Research Plan
51
03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020
Preliminary work
Implement a flexible framework
Addressing RQ1: Develop new algorithms for selecting
representative mementos
Addressing RQ2: Evaluation of individual memento
visualizations
Dissertation Candidacy Exam
Addressing RQ1: Evaluation of algorithms for selecting
representative mementos
Addressing RQ3: Develop candidate visualizations of groups of
mementos
Addressing RQ3: Evaluation of visualization of groups of
mementos
Disseration Composition
Dissertation Defense
SIGIR 2020
CHI 2020
iPres 2018
iPres 2019
JCDL 2019
CHI 2020
JCDL 2020
JCDL 2021
@shawnmjones @WebSciDL
Conclusion
52
@shawnmjones @WebSciDL
Summary
 Collection understanding is a problem
with web archive collections
 Inconsistent metadata
 1000s of mementos
 1000s of collections
 Costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
53
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
54
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
55
Thanks:

More Related Content

What's hot

Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 

What's hot (20)

Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
Intro to Wikisource
Intro to WikisourceIntro to Wikisource
Intro to Wikisource
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
 
What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)
 
Linked open data and libraries
Linked open data and librariesLinked open data and libraries
Linked open data and libraries
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
 
Environmental trends and OCLC Research, a presentation at the University of N...
Environmental trends and OCLC Research, a presentation at the University of N...Environmental trends and OCLC Research, a presentation at the University of N...
Environmental trends and OCLC Research, a presentation at the University of N...
 
The Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 Workshop
 

Similar to Improving Collection Understanding in Web Archives

Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Shawn Jones
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
PrattSILS
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
Anna Fensel
 
20111120 warsaw learning curve by b hyland notes
20111120 warsaw   learning curve by b hyland notes20111120 warsaw   learning curve by b hyland notes
20111120 warsaw learning curve by b hyland notes
Bernadette Hyland-Wood
 

Similar to Improving Collection Understanding in Web Archives (20)

Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
 
The CSO Open Data Experience
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data Experience
 
web 2.0, library systems and the library system
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library system
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
The Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of MetadataThe Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of Metadata
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Preserving Streams of Issued Content
Preserving Streams of Issued ContentPreserving Streams of Issued Content
Preserving Streams of Issued Content
 
20111120 warsaw learning curve by b hyland notes
20111120 warsaw   learning curve by b hyland notes20111120 warsaw   learning curve by b hyland notes
20111120 warsaw learning curve by b hyland notes
 
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View
Linked Open Data for Libraries, Archives, and Museums: An Aggregators ViewLinked Open Data for Libraries, Archives, and Museums: An Aggregators View
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
CAEPIA 2011
CAEPIA 2011CAEPIA 2011
CAEPIA 2011
 
OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research @ U of Calgary: New directions for metadata workflows across li...OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research @ U of Calgary: New directions for metadata workflows across li...
 
WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 

More from Shawn Jones

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
Shawn Jones
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Shawn Jones
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
Shawn Jones
 

More from Shawn Jones (11)

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
 
Reference Rot
Reference RotReference Rot
Reference Rot
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonest
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven Development
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wiki
 

Recently uploaded

一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理
A
 
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
AS
 
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
AS
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理
A
 
一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
F
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
hfkmxufye
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
F
 
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
AS
 
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
AS
 
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
Fi
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
AS
 
一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书
A
 
一比一定制美国罗格斯大学毕业证学位证书
一比一定制美国罗格斯大学毕业证学位证书一比一定制美国罗格斯大学毕业证学位证书
一比一定制美国罗格斯大学毕业证学位证书
A
 

Recently uploaded (20)

Registry Data Accuracy Improvements, presented by Chimi Dorji at SANOG 41 / I...
Registry Data Accuracy Improvements, presented by Chimi Dorji at SANOG 41 / I...Registry Data Accuracy Improvements, presented by Chimi Dorji at SANOG 41 / I...
Registry Data Accuracy Improvements, presented by Chimi Dorji at SANOG 41 / I...
 
一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理
 
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
原版定制(Glasgow毕业证书)英国格拉斯哥大学毕业证原件一模一样
 
Washington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers ShirtWashington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers Shirt
 
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdf
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理
 
一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
 
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
 
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
 
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
 
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
 
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
 
一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书
 
Dan Quinn Commanders Feather Dad Hat Hoodie
Dan Quinn Commanders Feather Dad Hat HoodieDan Quinn Commanders Feather Dad Hat Hoodie
Dan Quinn Commanders Feather Dad Hat Hoodie
 
一比一定制美国罗格斯大学毕业证学位证书
一比一定制美国罗格斯大学毕业证学位证书一比一定制美国罗格斯大学毕业证学位证书
一比一定制美国罗格斯大学毕业证学位证书
 

Improving Collection Understanding in Web Archives

  • 1. @shawnmjones @WebSciDL Improving Collection Understanding in Web Archives Shawn M. Jones Web Science and Digital Libraries Research Group Advisors: Michael L. Nelson and Michele C. Weigle Thanks to:
  • 2. @shawnmjones @WebSciDL Researchers Create Their Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  • 3. @shawnmjones @WebSciDL Web Archive Collections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  • 4. @shawnmjones @WebSciDL Different Versions Allow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 11, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks
  • 5. @shawnmjones @WebSciDL Different Versions Allow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018
  • 6. @shawnmjones @WebSciDL Archive-It Provides For Easy Collection Creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 6
  • 7. @shawnmjones @WebSciDL The Problem of Collection Understanding What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 7
  • 8. @shawnmjones @WebSciDL 8 31 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?
  • 9. @shawnmjones @WebSciDL Archive-It provides fields for metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields
  • 10. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 10 9 seeds with metadata 132,599 seeds no metadata
  • 11. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 11 132,599 seeds no metadata 9 seeds with metadata Paradox of metadata: More seeds = more effort
  • 12. @shawnmjones @WebSciDL Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 12
  • 13. @shawnmjones @WebSciDL More Archive-It collections are added every year More than 8000 collections exist as of the end of 2016 13 More Archive-It collections are added each year
  • 14. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 14
  • 15. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 15
  • 16. @shawnmjones @WebSciDL The proposal: a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a visualization of ~28 social cards 16 Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
  • 18. @shawnmjones @WebSciDL Looking at Archive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes 18 Archive–It Collections
  • 19. @shawnmjones @WebSciDL Document collections have aspects  Metadata on a publication:  used as a surrogate for understanding  answers anticipated questions  Aspects:  The central concepts of the corpus  For example: aspects about a disaster  time  place  cause  countermeasures  Aspects correspond to the questions that a user might have about a collection 19 Archive–It Collections Summarize with Aspects Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12), 1727–1733.
  • 20. @shawnmjones @WebSciDL How can we surface aspects?  Named Entity Recognition can answer questions of who or where?  Natural Language Processing can answer questions of what time period?  Topic modeling can surface general concepts from the corpus  And we have to be cognizant of these concepts over time 20 Archive-It Collection 8121: “The Obama White House” Archive-It Collection 8513: “Donald J Trump White House” Archive–It Collections Summarize with Aspects
  • 21. @shawnmjones @WebSciDL Visualizing web resources (surrogates) 21 Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) Visualize MementosArchive–It Collections Summarize with Aspects
  • 22. @shawnmjones @WebSciDL Which surrogate is best for web resources? Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 22 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 23. @shawnmjones @WebSciDL Which surrogate is best for web resources? Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 23 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 24. @shawnmjones @WebSciDL Visualizing Archive-It Collections 24 Other attempts at visualizing Archive-It collections tried to visualize everything. Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821 http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html
  • 25. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 25 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508 http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
  • 26. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 26 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html x
  • 27. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 27 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary x
  • 29. @shawnmjones @WebSciDL Growth curves for understanding collection creation behavior 29 Archive–It Collections • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 30. @shawnmjones @WebSciDL Structural features of Archive-It collections  difference between seed curve AUC and diagonal  difference between seed memento curve AUC and diagonal  difference between seed memento curve AUC and seed curve AUC  number of seeds  number of mementos  seed URI domain diversity  seed URI path depth diversity  most frequent seed URI path depth  % query string usage in seed URIs  lifespan of collection 30 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 31. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous 31 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 32. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous 32 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 33. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous 33 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 34. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 34 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 35. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections 35 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 36. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany 36 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 37. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features on the previous slide, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 37 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 39. @shawnmjones @WebSciDL Developing a Flexible Framework Off-Topic Memento Toolkit Representative Memento Selection Utilities Archive-It Utilities MementoEmbed DSA Visualization Interface Web Archive Collection Visualized Summary Dark and Stormy Archives (DSA) 2.0 A framework based by AlNoamany’s work Two concepts are embodied in this framework: 1. Selecting representative mementos 2. Visualizing those mementos 39 Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018.
  • 40. @shawnmjones @WebSciDL Not just Archive-It 40 Our methods will be applicable to any web archive collection, like those developed by Rhizome’s Webrecorder.
  • 42. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 42
  • 43. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 43 Who is X? Where is Y? When does Z take place?
  • 44. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 3. How well do users complete the tasks? 44 Who is X? Where is Y? When does Z take place?
  • 45. @shawnmjones @WebSciDL RQ1: How do we select representative mementos for the different semantic types of collections?  Summarizing a collection involves: 1. Grouping the mementos by their commonalities 2. Select the highest quality mementos from each group  Different semantic categories may require different algorithms  We want to reuse existing tools where possible:  Stanford NLP  Archives Unleashed Toolkit  gensim  SpaCy 45 Archive–It Collections Summarize with Aspects
  • 46. @shawnmjones @WebSciDL RQ1 Evaluation 1. How many user tasks were addressed by the mementos chosen? How many user tasks failed? 2. How many mementos produced are not useful for any user task? 3. Which algorithm surfaces aspects satisfying the highest mean number of user tasks for a given collection type? 4. What is the mean minimum number of mementos necessary to address the most user tasks? 46 Archive–It Collections Summarize with Aspects
  • 47. @shawnmjones @WebSciDL RQ2: What visualizations (surrogates) work best for understanding individual mementos?  There are many different possibilities for surrogates  Does the choice in surrogate change depending on the collection’s semantic category? 47 Visualize MementosArchive–It Collections Summarize with Aspects
  • 48. @shawnmjones @WebSciDL RQ2 Evaluation 1. Does the depth, domain, or category of the URI play a factor in which surrogate performs better? 2. Do different surrogates work better for different semantic categories? 3. For social cards, which elements of the social card need to be present to understand the underlying memento? 4. For thumbnails, what size thumbnail works best for understanding? How much of the web page needs to be rendered for a thumbnail to be useful for understanding? 48 Visualize MementosArchive–It Collections Summarize with Aspects Evaluated via:
  • 49. @shawnmjones @WebSciDL RQ3: How well do visualizations of groups of mementos produced by different summarization algorithms work for collection understanding?  Once we have:  Candidate summarization algorithms  Evaluated surrogates for individual mementos  We can then evaluate the combination of summarization and visualization.  There are many options:  arranging surrogates  headings  metadata 49 RQ1: Summarization Algorithms RQ2: Visualization Elements RQ3: Visualization of Summary Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
  • 50. @shawnmjones @WebSciDL RQ3 Evaluation 1. How many user tasks are addressed by the visualization chosen? How many fail? 2. How many visualized mementos were not needed for any given user task? 3. Given an aspect of the collection, can the user address a user task concerning it by visually scanning the visualization? 4. Given multiple aspects of the collection, can the user successfully compare different individual memento visualizations to address a user task? 5. Which visualizations work better for certain semantic types? 50 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Evaluated via:
  • 51. @shawnmjones @WebSciDL Research Plan 51 03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020 Preliminary work Implement a flexible framework Addressing RQ1: Develop new algorithms for selecting representative mementos Addressing RQ2: Evaluation of individual memento visualizations Dissertation Candidacy Exam Addressing RQ1: Evaluation of algorithms for selecting representative mementos Addressing RQ3: Develop candidate visualizations of groups of mementos Addressing RQ3: Evaluation of visualization of groups of mementos Disseration Composition Dissertation Defense SIGIR 2020 CHI 2020 iPres 2018 iPres 2019 JCDL 2019 CHI 2020 JCDL 2020 JCDL 2021
  • 53. @shawnmjones @WebSciDL Summary  Collection understanding is a problem with web archive collections  Inconsistent metadata  1000s of mementos  1000s of collections  Costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events 53
  • 54. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 54
  • 55. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 55 Thanks: