Improving Collection Understanding in Web Archives

Shawn Jones
Shawn JonesResearch Assistant at Old Dominion University
@shawnmjones @WebSciDL
Improving Collection
Understanding in Web Archives
Shawn M. Jones
Web Science and Digital Libraries Research Group
Advisors: Michael L. Nelson and Michele C. Weigle
Thanks to:
@shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
@shawnmjones @WebSciDL
Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
@shawnmjones @WebSciDL
Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 11, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks
@shawnmjones @WebSciDL
Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018
@shawnmjones @WebSciDL
Archive-It Provides For Easy Collection Creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
6
@shawnmjones @WebSciDL
The Problem of Collection Understanding
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
7
@shawnmjones @WebSciDL 8
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
@shawnmjones @WebSciDL
Archive-It provides fields for metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
10
9 seeds
with metadata
132,599 seeds
no metadata
@shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
11
132,599 seeds
no metadata
9 seeds
with metadata
Paradox of metadata:
More seeds = more effort
@shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
12
@shawnmjones @WebSciDL
More Archive-It collections are added every year
More than 8000 collections exist as
of the end of 2016
13
More Archive-It collections
are added each year
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
14
@shawnmjones @WebSciDL
The problem, summarized
 There are multiple collections
about the same concept.
 The metadata for each collection is
non-existent, or inconsistently
applied.
 Many collections have
1000s of seeds with multiple
mementos.
 There are more than 8000
collections.
 Human review of these
mementos for collection
understanding is an expensive
proposition.
15
@shawnmjones @WebSciDL
The proposal: a visualization made of representative
mementos
 Our visualization is a summary that will
act like an abstract
 Pirolli and Card’s Information Foraging
Theory:
 maximize the value of the information gained
from our summaries
 minimize the cost of interacting with the
collection
 ensure that our representative mementos have
good information scent
 contain cues that the memento will address a
user’s needs
From this:
318 seeds with
2421 mementos To something
like this:
a visualization
of ~28 social
cards
16
Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
@shawnmjones @WebSciDL
Background and Related Work
17
@shawnmjones @WebSciDL
Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes
18
Archive–It Collections
@shawnmjones @WebSciDL
Document collections have aspects
 Metadata on a publication:
 used as a surrogate for understanding
 answers anticipated questions
 Aspects:
 The central concepts of the corpus
 For example: aspects about a disaster
 time
 place
 cause
 countermeasures
 Aspects correspond to the questions that a user
might have about a collection
19
Archive–It Collections Summarize with Aspects
Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries
with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence (AAAI’12), 1727–1733.
@shawnmjones @WebSciDL
How can we surface aspects?
 Named Entity Recognition can
answer questions of who or
where?
 Natural Language Processing can
answer questions of what time
period?
 Topic modeling can surface
general concepts from the corpus
 And we have to be cognizant of
these concepts over time
20
Archive-It Collection 8121:
“The Obama White House”
Archive-It Collection 8513:
“Donald J Trump White House”
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Visualizing web resources (surrogates)
21
Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
22
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Which surrogate is best for web resources?
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
23
Visualize MementosArchive–It Collections Summarize with Aspects
https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
@shawnmjones @WebSciDL
Visualizing Archive-It Collections
24
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
25
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science
Conference, 309–318. DOI:10.1145/3091478.3091508
http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries
and human-generated summaries
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
26
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html
x
@shawnmjones @WebSciDL
Prior work by AlNoamany
 Visualized summaries via the storytelling platform Storify – which is no longer in service
 Proved that test participants could not detect the difference between her automated summaries and
human-generated summaries
 Did not evaluate if the resulting summaries were effective tools for collection understanding
 Focused on summarizing collections about events
 There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
27
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
x
@shawnmjones @WebSciDL
Preliminary Work
28
@shawnmjones @WebSciDL
Growth curves for understanding collection creation
behavior
29
Archive–It Collections
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Structural features of Archive-It collections
 difference between seed curve AUC and
diagonal
 difference between seed memento curve
AUC and diagonal
 difference between seed memento curve
AUC and seed curve AUC
 number of seeds
 number of mementos
 seed URI domain diversity
 seed URI path depth diversity
 most frequent seed URI path depth
 % query string usage in seed URIs
 lifespan of collection
30
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
31
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
32
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
33
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
34
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
35
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
In a study of 3,382 Archive-It collections
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
36
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Semantic categories of Archive-It collections
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features on the previous slide, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
37
Archive–It Collections
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It.
In International Conference on Digital Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Research Plan
38
@shawnmjones @WebSciDL
Developing a Flexible Framework
Off-Topic Memento
Toolkit
Representative
Memento Selection
Utilities
Archive-It
Utilities
MementoEmbed
DSA
Visualization
Interface
Web Archive
Collection
Visualized
Summary
Dark and Stormy Archives (DSA) 2.0
A framework based by AlNoamany’s work
Two concepts are embodied in this framework:
1. Selecting representative mementos
2. Visualizing those mementos
39
Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The
Off-Topic Memento Toolkit. In International Conference on Digital
Preservation (iPRES) 2018.
@shawnmjones @WebSciDL
Not just Archive-It
40
Our methods will be applicable to any web archive collection,
like those developed by Rhizome’s Webrecorder.
@shawnmjones @WebSciDL
Evaluation
41
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
42
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
43
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
Evaluation
1. Choose target
collections for study
2. Develop user tasks
for each collection
3. How well do users
complete the tasks?
44
Who is X? Where is Y? When does Z take place?
@shawnmjones @WebSciDL
RQ1: How do we select representative mementos for the
different semantic types of collections?
 Summarizing a collection involves:
1. Grouping the mementos by their
commonalities
2. Select the highest quality mementos
from each group
 Different semantic categories may
require different algorithms
 We want to reuse existing tools
where possible:
 Stanford NLP
 Archives Unleashed Toolkit
 gensim
 SpaCy
45
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ1 Evaluation
1. How many user tasks were addressed by the mementos chosen? How many
user tasks failed?
2. How many mementos produced are not useful for any user task?
3. Which algorithm surfaces aspects satisfying the highest mean number of user
tasks for a given collection type?
4. What is the mean minimum number of mementos necessary to address the
most user tasks?
46
Archive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2: What visualizations (surrogates) work best for
understanding individual mementos?
 There are many different possibilities
for surrogates
 Does the choice in surrogate change
depending on the collection’s
semantic category?
47
Visualize MementosArchive–It Collections Summarize with Aspects
@shawnmjones @WebSciDL
RQ2 Evaluation
1. Does the depth, domain, or category of
the URI play a factor in which surrogate
performs better?
2. Do different surrogates work better for
different semantic categories?
3. For social cards, which elements of the
social card need to be present to
understand the underlying memento?
4. For thumbnails, what size thumbnail
works best for understanding? How
much of the web page needs to be
rendered for a thumbnail to be useful for
understanding?
48
Visualize MementosArchive–It Collections Summarize with Aspects
Evaluated via:
@shawnmjones @WebSciDL
RQ3: How well do visualizations of groups of mementos
produced by different summarization algorithms work for
collection understanding?
 Once we have:
 Candidate summarization algorithms
 Evaluated surrogates for individual mementos
 We can then evaluate the combination of
summarization and visualization.
 There are many options:
 arranging surrogates
 headings
 metadata
49
RQ1:
Summarization
Algorithms
RQ2:
Visualization
Elements
RQ3: Visualization of
Summary
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
@shawnmjones @WebSciDL
RQ3 Evaluation
1. How many user tasks are addressed by
the visualization chosen? How many fail?
2. How many visualized mementos were not
needed for any given user task?
3. Given an aspect of the collection, can the
user address a user task concerning it by
visually scanning the visualization?
4. Given multiple aspects of the collection,
can the user successfully compare
different individual memento visualizations
to address a user task?
5. Which visualizations work better for certain
semantic types?
50
Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
Evaluated via:
@shawnmjones @WebSciDL
Research Plan
51
03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020
Preliminary work
Implement a flexible framework
Addressing RQ1: Develop new algorithms for selecting
representative mementos
Addressing RQ2: Evaluation of individual memento
visualizations
Dissertation Candidacy Exam
Addressing RQ1: Evaluation of algorithms for selecting
representative mementos
Addressing RQ3: Develop candidate visualizations of groups of
mementos
Addressing RQ3: Evaluation of visualization of groups of
mementos
Disseration Composition
Dissertation Defense
SIGIR 2020
CHI 2020
iPres 2018
iPres 2019
JCDL 2019
CHI 2020
JCDL 2020
JCDL 2021
@shawnmjones @WebSciDL
Conclusion
52
@shawnmjones @WebSciDL
Summary
 Collection understanding is a problem
with web archive collections
 Inconsistent metadata
 1000s of mementos
 1000s of collections
 Costly for human review
 We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
 Prior work in this area:
 did not evaluate how well this method works
for collection understanding
 only focused on collections about events
53
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
54
@shawnmjones @WebSciDL
Contributions
 Existing work:
 Semantic categories of web archive collections in Archive-It
 Categories can be predicted by using structural features
 Most collections are not about events
 Future work:
 Investigate new ways of surfacing representative mementos
 Contribute knowledge of collection understanding in web archive collections
 Which visualization methods work best for understanding mementos in a collection
 New algorithms for use in collection understanding
55
Thanks:
1 of 55

Recommended

Information Visualization - Visualizing Digital Collections at Archive-It by
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItMichele Weigle
2.6K views18 slides
Combining Social Media Storytelling With Web Archives by
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesShawn Jones
577 views38 slides
Improving Understanding of Web Archive Collections Through Storytelling - PhD... by
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Shawn Jones
868 views86 slides
csvconfyasmin2017_05_03 by
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03Yasmin AlNoamany, PhD
4.5K views75 slides
Storytelling With Web Archives by
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web ArchivesShawn Jones
1.3K views38 slides
The Many Shapes of Archive-It by
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-ItShawn Jones
1.3K views95 slides

More Related Content

What's hot

Social Cards Probably Provide For Better Understanding Of Web Archive Collect... by
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Shawn Jones
1.4K views27 slides
Telling Stories with Web Archives by
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
4.5K views87 slides
Visualizing Digital Collections at Archive-It by
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItMichele Weigle
2.5K views19 slides
Detecting Off-Topic Pages in Web Archives by
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
2.1K views26 slides
Characteristics of Social Media Stories by
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media StoriesYasmin AlNoamany, PhD
3.8K views49 slides
Detecting Off-Topic Pages in Web Archives by
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
2.9K views39 slides

What's hot(20)

Social Cards Probably Provide For Better Understanding Of Web Archive Collect... by Shawn Jones
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones1.4K views
Telling Stories with Web Archives by Michele Weigle
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
Michele Weigle4.5K views
Visualizing Digital Collections at Archive-It by Michele Weigle
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
Michele Weigle2.5K views
Summarizing archival collections using storytelling techniques by Michael Nelson
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
Michael Nelson2.9K views
Combining Storytelling and Web Archives by Michael Nelson
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
Michael Nelson1.6K views
Storytelling for Summarizing Collections in Web Archives by Michael Nelson
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
Michael Nelson2.8K views
Bootstrapping Web Archive Collections of Stories from Micro-collections in S... by Alexander Nwala
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Alexander Nwala1.1K views
Let's Get Visible! with Karla Smith, Winnefox Library System by WiLS
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
WiLS578 views
Linked Data and Discovery with Steve Meyer by WiLS
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
WiLS655 views
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P... by Yasmin AlNoamany, PhD
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
What is #LODLAM?! (revised January 2015) by Alison Hitchens
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)
Alison Hitchens1.2K views
Linked open data and libraries by Alison Hitchens
Linked open data and librariesLinked open data and libraries
Linked open data and libraries
Alison Hitchens1.9K views
What is #LODLAM?! Understanding linked open data in libraries, archives [and ... by Alison Hitchens
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
Alison Hitchens3K views
Environmental trends and OCLC Research, a presentation at the University of N... by lisld
Environmental trends and OCLC Research, a presentation at the University of N...Environmental trends and OCLC Research, a presentation at the University of N...
Environmental trends and OCLC Research, a presentation at the University of N...
lisld3.9K views
The Power of Sharing Linked Data - ELAG 2014 Workshop by Richard Wallis
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 Workshop
Richard Wallis1.8K views

Similar to Improving Collection Understanding in Web Archives

Improving Collection Understanding For Web Archives With Storytelling: Shinin... by
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Shawn Jones
355 views153 slides
LIS 653 Posters Fall 2014 by
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 PrattSILS
833 views8 slides
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums by
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
3.4K views136 slides
The Semantic Web Exists. What Next? by
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?Anna Fensel
4.4K views35 slides
The CSO Open Data Experience by
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data ExperienceDublinked .
673 views24 slides
web 2.0, library systems and the library system by
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library systemlisld
1K views47 slides

Similar to Improving Collection Understanding in Web Archives(20)

Improving Collection Understanding For Web Archives With Storytelling: Shinin... by Shawn Jones
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Shawn Jones355 views
LIS 653 Posters Fall 2014 by PrattSILS
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
PrattSILS833 views
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums by Jon Voss
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss3.4K views
The Semantic Web Exists. What Next? by Anna Fensel
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
Anna Fensel4.4K views
The CSO Open Data Experience by Dublinked .
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data Experience
Dublinked .673 views
web 2.0, library systems and the library system by lisld
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library system
lisld1K views
Research Data Curation _ Grad Humanities Class by Aaron Collie
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
Aaron Collie881 views
The Unreasonable Effectiveness of Metadata by James Hendler
The Unreasonable Effectiveness of MetadataThe Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of Metadata
James Hendler1.3K views
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View by Richard Urban
Linked Open Data for Libraries, Archives, and Museums: An Aggregators ViewLinked Open Data for Libraries, Archives, and Museums: An Aggregators View
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View
Richard Urban1.6K views
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy by PRELIDA Project
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
PRELIDA Project772 views
Libraries in a data-centered environment by Jakob .
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
Jakob .1.4K views
OCLC Research @ U of Calgary: New directions for metadata workflows across li... by OCLC Research
OCLC Research @ U of Calgary: New directions for metadata workflows across li...OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research1.1K views
Broad Data (India 2015) by James Hendler
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
James Hendler2.3K views
Linked Data and Locah, UKSG2011 by Jane Stevenson
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
Jane Stevenson1K views

More from Shawn Jones

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea... by
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
115 views12 slides
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea... by
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
10 views12 slides
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G... by
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...Shawn Jones
321 views19 slides
Automatically Selecting Striking Images for Social Cards by
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsShawn Jones
151 views20 slides
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration) by
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)Shawn Jones
259 views13 slides
Reference Rot by
Reference RotReference Rot
Reference RotShawn Jones
243 views44 slides

More from Shawn Jones(10)

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea... by Shawn Jones
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones115 views
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea... by Shawn Jones
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones10 views
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G... by Shawn Jones
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Shawn Jones321 views
Automatically Selecting Striking Images for Social Cards by Shawn Jones
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
Shawn Jones151 views
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration) by Shawn Jones
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
Shawn Jones259 views
Avoiding Spoilers On MediaWiki Fan Sites Using Memento by Shawn Jones
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Shawn Jones2.8K views
Continuous Integration: Finding problems soonest by Shawn Jones
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonest
Shawn Jones1.2K views
A Brief Introduction to Test-Driven Development by Shawn Jones
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven Development
Shawn Jones1.3K views
Reconstructing the past with media wiki by Shawn Jones
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wiki
Shawn Jones2.2K views

Recently uploaded

WITS Deck by
WITS DeckWITS Deck
WITS DeckW.I.T.S.
27 views22 slides
40th TWNIC Open Policy Meeting: A quick look at QUIC by
40th TWNIC Open Policy Meeting: A quick look at QUIC40th TWNIC Open Policy Meeting: A quick look at QUIC
40th TWNIC Open Policy Meeting: A quick look at QUICAPNIC
73 views20 slides
ATPMOUSE_융합2조.pptx by
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptxkts120898
35 views70 slides
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx by
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptxCracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptxLeasedLinesQuote
5 views8 slides
40th TWNIC Open Policy Meeting: APNIC PDP update by
40th TWNIC Open Policy Meeting: APNIC PDP update40th TWNIC Open Policy Meeting: APNIC PDP update
40th TWNIC Open Policy Meeting: APNIC PDP updateAPNIC
69 views20 slides
cis5-Project-11a-Harry Lai by
cis5-Project-11a-Harry Laicis5-Project-11a-Harry Lai
cis5-Project-11a-Harry Laiharrylai126
9 views11 slides

Recently uploaded(13)

WITS Deck by W.I.T.S.
WITS DeckWITS Deck
WITS Deck
W.I.T.S.27 views
40th TWNIC Open Policy Meeting: A quick look at QUIC by APNIC
40th TWNIC Open Policy Meeting: A quick look at QUIC40th TWNIC Open Policy Meeting: A quick look at QUIC
40th TWNIC Open Policy Meeting: A quick look at QUIC
APNIC73 views
ATPMOUSE_융합2조.pptx by kts120898
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptx
kts12089835 views
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx by LeasedLinesQuote
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptxCracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx
40th TWNIC Open Policy Meeting: APNIC PDP update by APNIC
40th TWNIC Open Policy Meeting: APNIC PDP update40th TWNIC Open Policy Meeting: APNIC PDP update
40th TWNIC Open Policy Meeting: APNIC PDP update
APNIC69 views
cis5-Project-11a-Harry Lai by harrylai126
cis5-Project-11a-Harry Laicis5-Project-11a-Harry Lai
cis5-Project-11a-Harry Lai
harrylai1269 views
Penetration Testing for Cybersecurity Professionals by 211 Check
Penetration Testing for Cybersecurity ProfessionalsPenetration Testing for Cybersecurity Professionals
Penetration Testing for Cybersecurity Professionals
211 Check40 views
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download by APNIC
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download
APNIC75 views
The Dark Web : Hidden Services by Anshu Singh
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden Services
Anshu Singh19 views

Improving Collection Understanding in Web Archives

  • 1. @shawnmjones @WebSciDL Improving Collection Understanding in Web Archives Shawn M. Jones Web Science and Digital Libraries Research Group Advisors: Michael L. Nelson and Michele C. Weigle Thanks to:
  • 2. @shawnmjones @WebSciDL Researchers Create Their Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  • 3. @shawnmjones @WebSciDL Web Archive Collections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  • 4. @shawnmjones @WebSciDL Different Versions Allow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 11, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks
  • 5. @shawnmjones @WebSciDL Different Versions Allow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018
  • 6. @shawnmjones @WebSciDL Archive-It Provides For Easy Collection Creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 6
  • 7. @shawnmjones @WebSciDL The Problem of Collection Understanding What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 7
  • 8. @shawnmjones @WebSciDL 8 31 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?
  • 9. @shawnmjones @WebSciDL Archive-It provides fields for metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields
  • 10. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 10 9 seeds with metadata 132,599 seeds no metadata
  • 11. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 11 132,599 seeds no metadata 9 seeds with metadata Paradox of metadata: More seeds = more effort
  • 12. @shawnmjones @WebSciDL Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 12
  • 13. @shawnmjones @WebSciDL More Archive-It collections are added every year More than 8000 collections exist as of the end of 2016 13 More Archive-It collections are added each year
  • 14. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 14
  • 15. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 15
  • 16. @shawnmjones @WebSciDL The proposal: a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a visualization of ~28 social cards 16 Peter Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
  • 18. @shawnmjones @WebSciDL Looking at Archive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes 18 Archive–It Collections
  • 19. @shawnmjones @WebSciDL Document collections have aspects  Metadata on a publication:  used as a surrogate for understanding  answers anticipated questions  Aspects:  The central concepts of the corpus  For example: aspects about a disaster  time  place  cause  countermeasures  Aspects correspond to the questions that a user might have about a collection 19 Archive–It Collections Summarize with Aspects Renxian Zhang, Wenjie Li, and Dehong Gao. 2012. Generating Coherent Summaries with Textual Aspects. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12), 1727–1733.
  • 20. @shawnmjones @WebSciDL How can we surface aspects?  Named Entity Recognition can answer questions of who or where?  Natural Language Processing can answer questions of what time period?  Topic modeling can surface general concepts from the corpus  And we have to be cognizant of these concepts over time 20 Archive-It Collection 8121: “The Obama White House” Archive-It Collection 8513: “Donald J Trump White House” Archive–It Collections Summarize with Aspects
  • 21. @shawnmjones @WebSciDL Visualizing web resources (surrogates) 21 Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) Visualize MementosArchive–It Collections Summarize with Aspects
  • 22. @shawnmjones @WebSciDL Which surrogate is best for web resources? Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 22 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 23. @shawnmjones @WebSciDL Which surrogate is best for web resources? Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance 23 Visualize MementosArchive–It Collections Summarize with Aspects https://ws-dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html
  • 24. @shawnmjones @WebSciDL Visualizing Archive-It Collections 24 Other attempts at visualizing Archive-It collections tried to visualize everything. Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821 http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html
  • 25. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 25 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508 http://ws-dl.blogspot.com/2016/09/2016-09-20-promising-scene-at-end-of.html
  • 26. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 26 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html x
  • 27. @shawnmjones @WebSciDL Prior work by AlNoamany  Visualized summaries via the storytelling platform Storify – which is no longer in service  Proved that test participants could not detect the difference between her automated summaries and human-generated summaries  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 27 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary x
  • 29. @shawnmjones @WebSciDL Growth curves for understanding collection creation behavior 29 Archive–It Collections • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 30. @shawnmjones @WebSciDL Structural features of Archive-It collections  difference between seed curve AUC and diagonal  difference between seed memento curve AUC and diagonal  difference between seed memento curve AUC and seed curve AUC  number of seeds  number of mementos  seed URI domain diversity  seed URI path depth diversity  most frequent seed URI path depth  % query string usage in seed URIs  lifespan of collection 30 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 31. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous 31 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 32. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous 32 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 33. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous 33 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 34. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 34 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 35. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections 35 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. In a study of 3,382 Archive-It collections
  • 36. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany 36 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 37. @shawnmjones @WebSciDL Semantic categories of Archive-It collections Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features on the previous slide, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 37 Archive–It Collections Shawn M. Jones, Alexander Nwala, Michele C. Weigle, and Michael L. Nelson. 2018. The Many Shapes of Archive-It: Characteristics of Archive-It. In International Conference on Digital Preservation (iPRES) 2018.
  • 39. @shawnmjones @WebSciDL Developing a Flexible Framework Off-Topic Memento Toolkit Representative Memento Selection Utilities Archive-It Utilities MementoEmbed DSA Visualization Interface Web Archive Collection Visualized Summary Dark and Stormy Archives (DSA) 2.0 A framework based by AlNoamany’s work Two concepts are embodied in this framework: 1. Selecting representative mementos 2. Visualizing those mementos 39 Shawn M. Jones, Michele C. Weigle, and Michael L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018.
  • 40. @shawnmjones @WebSciDL Not just Archive-It 40 Our methods will be applicable to any web archive collection, like those developed by Rhizome’s Webrecorder.
  • 42. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 42
  • 43. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 43 Who is X? Where is Y? When does Z take place?
  • 44. @shawnmjones @WebSciDL Evaluation 1. Choose target collections for study 2. Develop user tasks for each collection 3. How well do users complete the tasks? 44 Who is X? Where is Y? When does Z take place?
  • 45. @shawnmjones @WebSciDL RQ1: How do we select representative mementos for the different semantic types of collections?  Summarizing a collection involves: 1. Grouping the mementos by their commonalities 2. Select the highest quality mementos from each group  Different semantic categories may require different algorithms  We want to reuse existing tools where possible:  Stanford NLP  Archives Unleashed Toolkit  gensim  SpaCy 45 Archive–It Collections Summarize with Aspects
  • 46. @shawnmjones @WebSciDL RQ1 Evaluation 1. How many user tasks were addressed by the mementos chosen? How many user tasks failed? 2. How many mementos produced are not useful for any user task? 3. Which algorithm surfaces aspects satisfying the highest mean number of user tasks for a given collection type? 4. What is the mean minimum number of mementos necessary to address the most user tasks? 46 Archive–It Collections Summarize with Aspects
  • 47. @shawnmjones @WebSciDL RQ2: What visualizations (surrogates) work best for understanding individual mementos?  There are many different possibilities for surrogates  Does the choice in surrogate change depending on the collection’s semantic category? 47 Visualize MementosArchive–It Collections Summarize with Aspects
  • 48. @shawnmjones @WebSciDL RQ2 Evaluation 1. Does the depth, domain, or category of the URI play a factor in which surrogate performs better? 2. Do different surrogates work better for different semantic categories? 3. For social cards, which elements of the social card need to be present to understand the underlying memento? 4. For thumbnails, what size thumbnail works best for understanding? How much of the web page needs to be rendered for a thumbnail to be useful for understanding? 48 Visualize MementosArchive–It Collections Summarize with Aspects Evaluated via:
  • 49. @shawnmjones @WebSciDL RQ3: How well do visualizations of groups of mementos produced by different summarization algorithms work for collection understanding?  Once we have:  Candidate summarization algorithms  Evaluated surrogates for individual mementos  We can then evaluate the combination of summarization and visualization.  There are many options:  arranging surrogates  headings  metadata 49 RQ1: Summarization Algorithms RQ2: Visualization Elements RQ3: Visualization of Summary Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary
  • 50. @shawnmjones @WebSciDL RQ3 Evaluation 1. How many user tasks are addressed by the visualization chosen? How many fail? 2. How many visualized mementos were not needed for any given user task? 3. Given an aspect of the collection, can the user address a user task concerning it by visually scanning the visualization? 4. Given multiple aspects of the collection, can the user successfully compare different individual memento visualizations to address a user task? 5. Which visualizations work better for certain semantic types? 50 Visualize MementosArchive–It Collections Summarize with Aspects Visualize Summary Evaluated via:
  • 51. @shawnmjones @WebSciDL Research Plan 51 03/201705/201708/201711/201702/201805/201808/201811/201802/201905/201908/201911/201902/202005/2020 Preliminary work Implement a flexible framework Addressing RQ1: Develop new algorithms for selecting representative mementos Addressing RQ2: Evaluation of individual memento visualizations Dissertation Candidacy Exam Addressing RQ1: Evaluation of algorithms for selecting representative mementos Addressing RQ3: Develop candidate visualizations of groups of mementos Addressing RQ3: Evaluation of visualization of groups of mementos Disseration Composition Dissertation Defense SIGIR 2020 CHI 2020 iPres 2018 iPres 2019 JCDL 2019 CHI 2020 JCDL 2020 JCDL 2021
  • 53. @shawnmjones @WebSciDL Summary  Collection understanding is a problem with web archive collections  Inconsistent metadata  1000s of mementos  1000s of collections  Costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events 53
  • 54. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 54
  • 55. @shawnmjones @WebSciDL Contributions  Existing work:  Semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  Future work:  Investigate new ways of surfacing representative mementos  Contribute knowledge of collection understanding in web archive collections  Which visualization methods work best for understanding mementos in a collection  New algorithms for use in collection understanding 55 Thanks: