SlideShare a Scribd company logo
1 of 95
The Many Shapes of
Archive-It
Shawn M. Jones Alexander Nwala Michele C. Weigle Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
sjone@cs.odu.edu
@shawnmjones
anwala@cs.odu.edu
@acnwala
mweigle@cs.odu.edu
@weiglemc
mln@cs.odu.edu
@phonedude_mln
Thanks to:
@shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
@shawnmjones @WebSciDL
Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
@shawnmjones @WebSciDL
Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 21, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks
@shawnmjones @WebSciDL
Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018
@shawnmjones @WebSciDL
The Internet Archive created Archive-It so organizations could
create their own web archive collections
Curators can supply live web resources as seeds and establish crawling schedules of those seeds to
create mementos of these seeds at different points in time.
6
@shawnmjones @WebSciDL
But this is the interface available for browsing those
collections…
7
How do we tell the difference without going through them all?
What types of collections exist?
@shawnmjones @WebSciDL
How can we understand an Archive-It collection?
@shawnmjones @WebSciDL
We Can Understand It Based On Metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields
@shawnmjones @WebSciDL
We Can Understand It Based On Metadata,
but the Metadata Does Not Always Help…
10
132,599 seeds
no metadata
9 seeds
with metadata
Because
metadata is
optional it is
not always
present.
@shawnmjones @WebSciDL
We Can Understand It Based On Metadata,
but the Metadata Does Not Always Help…
11
Because
metadata is
optional it is
not always
present.
When it is present, metadata on Archive-It collections is:
• generated by many different curators
• from different organizations
• with different content standards
• and different rules of interpretation
@shawnmjones @WebSciDL
We Can Understand It Based On Metadata,
but the Metadata Does Not Always Help…
12
Because
metadata is
optional it is
not always
present.
When it is present, metadata on Archive-It collections is:
• generated by many different curators
• from different organizations
• with different content standards
• and different rules of interpretation
It is inconsistently applied!
This means that a user cannot reliably compare metadata
fields to understand the differences between collections.
@shawnmjones @WebSciDL
We Can Understand It Based on Content
 We can use techniques such as text mining
and network analysis
The same collection in the
Archives Unleashed Cloud
https://archivesunleashed.org
13
@shawnmjones @WebSciDL
We Can Understand It Based on Content,
but all of that Content Must Be Dereferenced…
14
@shawnmjones @WebSciDL
We Can Understand It Based on Content,
but all of that Content Must Be Dereferenced…
15
Remember:
• Each result is a
seed
• Each seed has
multiple mementos
@shawnmjones @WebSciDL
We Can Understand It Based on Content,
but all of that Content Must Be Dereferenced…
16
There are 486,227 seed mementos that
must be downloaded and processed to
understand this collection.
Remember:
• Each result is a
seed
• Each seed has
multiple mementos
@shawnmjones @WebSciDL
We Can Understand It Based on Content,
but all of that Content Must Be Dereferenced…
17
There are 486,227 seed mementos that
must be downloaded and processed to
understand this collection.
Remember:
• Each result is a
seed
• Each seed has
multiple mementos
These 333 seeds correspond to
278,306 seed mementos.
They must be downloaded and processed.
@shawnmjones @WebSciDL
and what if we do not know the language?
18
???
About University of Utah
English
non-
German
Speakers
can
discern: About shootings in Tuscon
@shawnmjones @WebSciDL
How else can we understand an Archive-It collection?
@shawnmjones @WebSciDL
What kinds of questions can be answered with
Structural Features?
 Using only structural features is
advantageous because it saves one
from having to dereference all of the
URIs in a collection.
 These structural features also give us
different insight than can be provided by
text analysis or metadata.
20
81,014 seeds
486,227 seed mementos
@shawnmjones @WebSciDL
Does most of the collection exist earlier or later in its
life?
21
This collection was created in March 2010.
Most of its mementos come from 2016 – 2018.
Most of this collection exists later in its life.
@shawnmjones @WebSciDL
When did the curator select and archive a collection’s contents?
22
This collection was created in March 2006.
Some of the seeds were selected in 2006.
Many of the seeds were selected all along its
life.
It has mementos as recent as July 2018.
@shawnmjones @WebSciDL
Did the curator create a collection intended to archive new versions of the
same web pages repeatedly?
23
This collection was created in June 2014.
The seeds were selected at the beginning of its life.
Mementos were captured all during its life.
@shawnmjones @WebSciDL
Was the collection built from web sites belonging to one domain
or many?
24
Many domains One domain
@shawnmjones @WebSciDL
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
25
Top-level pages Deeper Links
@shawnmjones @WebSciDL
Other questions answered by structural features:
 Was there renewed interest at some point later in the collection’s life?
 Did the curator nurture the selected web pages throughout the collection’s life
and add content continuously?
 What time period does the collection span?
 What is the temporal skew of the collection?
 What is the lifetime of the collection?
26
@shawnmjones @WebSciDL
Can we bridge the structural to the descriptive?
 We can categorize Archive-It’s collections into four main semantic categories.
 We can predict these categories using a Random Forest Classifier using
structural features.
27
@shawnmjones @WebSciDL
Let’s go over a few things…
@shawnmjones @WebSciDL
Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• In this work, we focus on seeds and seed mementos
29
@shawnmjones @WebSciDL
TimeMaps from the Memento Protocol
30
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self";
type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";
datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";
datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 04:41:56 GMT"
…
Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes.
entries for mementos
memento-datetime
original resource URI
Memento URI (URI-M)
TimeMap URI (URI-T)
@shawnmjones @WebSciDL
What other work is related to web collections?
@shawnmjones @WebSciDL
Related Work
32
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
@shawnmjones @WebSciDL
Related Work
33
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
@shawnmjones @WebSciDL
Related Work
34
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
@shawnmjones @WebSciDL
Related Work
35
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
@shawnmjones @WebSciDL
Related Work
36
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time
@shawnmjones @WebSciDL
Related Work
37
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time
We focus on
structural
features
rather than
challenges
with using
Archive-It as
a tool
@shawnmjones @WebSciDL
Related Work
38
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time
We focus on
structural
features
rather than
challenges
with using
Archive-It as
a tool
We focus on
structural
features of
the archives
rather than
their user
interfaces
@shawnmjones @WebSciDL
Related Work
39
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
Related Work
40
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
Related Work
41
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
Related Work
42
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
Related Work
43
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
We apply
AlSum’s
methods to
specific
collections
rather than
entire
archives
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
Related Work
44
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
We apply
AlSum’s
methods to
specific
collections
rather than
entire
archives
We look at
collections
as units
rather than
analyzing
Archive-It as
a whole
Dublin Core
AlNoamany
(2016)
@shawnmjones @WebSciDL
How did we acquire the data for this study?
@shawnmjones @WebSciDL
Acquiring 9351 Archive-It collections
We used BeautifulSoup to
scrape the web pages of 9,351
Archive-It Collections.
From this scraping we
discovered:
• If the collection was public or
private
• Seed URIs
Using the Seed URIs, we
discovered TimeMaps listing all
seed mementos and their
memento-datetimes.
46
@shawnmjones @WebSciDL
Remove 4,823 private collections Private collections do not allow
access to seeds, seed
mementos, or TimeMaps
47
@shawnmjones @WebSciDL
Remove 440 young collections
Collections younger than a year
may still be building, possibly
skewing results
48
@shawnmjones @WebSciDL
Remove empty collections
Empty collections have no data
to analyze
49
@shawnmjones @WebSciDL
Remove 48 collections with errors
Collections with download or
processing errors may skew the
results
50
@shawnmjones @WebSciDL
Remove 357 collections with a single memento
Singletons consist of a single
seed with a single memento,
offering no behavior to study
51
@shawnmjones @WebSciDL
Remove 21 instantaneous collections
Single second collections were
captured in a single second,
offering no behavior over time
to study
52
@shawnmjones @WebSciDL
Remove 32 test collections
Collections clearly marked as
test or trial do not represent
regular collection behavior
53
@shawnmjones @WebSciDL
We study the remaining 3,382 collections
This leaves us with 3,382
collections for study with a total
of :
• 700,835 seeds
• 6,943,677 seed mementos
54
@shawnmjones @WebSciDL
Understanding Collection Growth Through Time
55
collections that do not grow are not interesting for us
@shawnmjones @WebSciDL
Growth curves help us understand collection growth,
but require normalization for comparison
56
We want to compare time
• “2014 Primaries” has 219,084 mementos
• “The Obama White House” has 140
• We normalize the number as a percentage
We want to compare memento count
• “Hurricane Sandy” has 174,884 seeds
• “Scottish Politics” has 58 seeds
• We normalize the number as a percentage
We want to compare seed count
• “Indiana: State and Local Documents”
spans 2005 – 2018
• “Japan: Election 2016 House of Councilors”
spans less than 2 days in July 2016
• We normalize time as a percentage of
the lifespan of the collection,
from the first memento-datetime to the last
@shawnmjones @WebSciDL
Once normalized, we can compare behavior in the seed
growth…
57
• Skew of the curator’s
involvement with the
collection
• When seeds were added
• When interest was lost or
regained
Seeds added all up frontSeeds added early, but
not all up front
@shawnmjones @WebSciDL
And, we can compare behavior in the memento growth…
58
• Built from all mementos in
the collection’s TimeMaps
• Skew of the collection’s
holdings
• Indicates temporality of
collection
Mementos crawled all alongMementos crawled later
@shawnmjones @WebSciDL
We can classifying different behaviors of Growth Curves
 Using two features:
 Area under the seed curve (AUCseed)
 Area under the seed memento curve
(AUCsmem)
 We can classify a collection’s
growth curve into 9 categories
 If AUC > 0.55, then those points occur
early
 If AUC < 0.45, then those points occur
late
 If 0.55 > AUC > 0.45, then those points
occur continuously
59
Seeds
Late
Seeds
Continuously
Seeds
Early
Seed
Mementos
Early
Seed
Mementos
Continuously
Seed
Mementos
Late
AUCseed > 0.55
AUCseed < 0.45
AUCsmem > 0.55
0.55 > AUCsmem > 0.45
AUCsmem < 0.45
0.55 > AUCseed > 0.45
@shawnmjones @WebSciDL
Seeds Early
60
The curators added most of the seeds at the beginning of the collection’s
life and then scheduled crawls at different schedules.
@shawnmjones @WebSciDL
Seeds Continuously
61
The curators keep adding new things to these collections throughout each collection’s life.
@shawnmjones @WebSciDL
Seeds Late
62
There was renewed interest in adding seeds at some point in these collections’ lives.
@shawnmjones @WebSciDL
From These Growth Curves we have some
simple Structural Features
 Number of Seeds
 Number of Seed Mementos
 Collection Lifespan
 Time between first and last
memento
63
@shawnmjones @WebSciDL
We also have complex Growth Curve Features:
Difference of Seed Curve AUC and Diagonal
64
Subtracting the AUC of the diagonal from the AUC of
the seed curve:
• We can more easily see if the seed curve is early
or late
• Early is positive
• Late is negative
• “Close” to 0 means continuous
(pos.)
(neg.)
@shawnmjones @WebSciDL
More complex Growth Curve Features:
Difference of Seed Memento Curve AUC and Diagonal
65
Subtracting the AUC of the diagonal from the seed curve:
• We can more easily see if the seed curve is early or
late
• Early is positive
• Late is negative
• “Close” to 0 means continuous
(pos.)
(neg.)
@shawnmjones @WebSciDL
More complex Growth Curve Features:
Diff. of Seed Curve AUC and Seed Memento Curve AUC
66
Difference between the seed curve AUC and the seed
memento curve AUC indicates how close the two are.
A value of 0 means that the two overlap, likely meaning
that there is one memento per seed.
A positive value means that the seeds are added earlier
than the seed mementos.
A negative value means that the seed memento growth
has overtaken the seed growth.
@shawnmjones @WebSciDL
What About Structural Features of Seeds?
67
@shawnmjones @WebSciDL
Seed URI domain diversity
68
Alexander Nwala. (2018 May) An Exploration of URL Diversity Measures. Web Science and Digital Libraries Reseach Group Blog.
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
Domain diversity: 0
(duplicate cnn.com hosts)
http://www.cnn.com/path/to/story0
http://news.cnn.com/path/to/story1
http://top.cnn.com/path/to/story2
Domain diversity: 1
(no duplicate domains)
http://www.cnn.com/path/to/story0
http://www.vox.com/path/to/story
http://www.foxnews.com/path/to/story
Domain diversity: 0.5
(1 duplicate cnn.com host)
http://www.cnn.com/path/to/story0
http://www.cnn.com/path/to/story1
http://www.vox.com/path/to/story
U = # of unique domains
C = number of seeds
D = diversity
D’ = normalized diversity
* Now known as the WSDL Diversity Index
Observation: Some collections only archive a single domain while others have more variety.
@shawnmjones @WebSciDL
Path Depth
Path Depth is a concept measuring how many
items exist in a URI’s path
 Based on McCown’s work, we also add 1 for
any path containing a query string:
69
Example URI Path Depth
http://example.com/ 0
http://example.com/directory 1
http://example.com/dir1/dir2/dir3/dir4 4
http://example.com/dir1/file2?key1=val1&k
ey2=val2&key3=val3
3
Observation: Top-level pages tend to have more general information whereas deeper pages tend to have a
more specific focus.
@shawnmjones @WebSciDL
Seed URI Path Depth Diversity
70
Path depth diversity: 0
(All path depths are 3)
http://www.cnn.com/path/to/story0
http://news.vox.com/path/to/story1
http://top.cnn.com/path/to/story2
Path depth diversity: 1
(all completely different path depths)
http://www.cnn.com/
http://news.vox.com/path/
http://top.cnn.com/path/to/story
Path depth diversity: 0.5
(1 path depth of 1, 2 with depth of 3)
http://www.cnn.com/
http://news.vox.com/path/to/story1
http://top.cnn.com/path/to/story2
Observation: Some collections only have seeds at the top level where others only link to deeper articles.
We reuse the WSDL Diversity Index, but this time apply it to path depth.
@shawnmjones @WebSciDL
Other Seed Features
 Most Frequent Path Depth
 The path depth that appears most in the
seed URIs
 Observation: For some collections, most
seeds exist at the top level while others
link to deeper articles.
 % Query String Usage
 How many URIs consist of query strings
 Observation: Some collections have many
URIs with query strings, while others have
none.
71
@shawnmjones @WebSciDL
Mapping the structural to the descriptive is hard…
72
@shawnmjones @WebSciDL
At first, we tried to map the structural features to
metadata directly…
 We tried using machine learning to
predict the topics found in the
metadata of a collection
 There are problems with this
approach:
 Not all collections have topics.
 Many collections have multiple topics.
 Many collections have user-supplied
topics.
73
@shawnmjones @WebSciDL
Instead, we established semantic categories of
Archive-It collections
 We reviewed the descriptions of 3,382 Archive-It Collections
 Based on their metadata and seeds, we placed them into 4 semantic categories
74
@shawnmjones @WebSciDL
Self-Archiving dominates Archive-It
54.1% 27.6% 14.1% 4.2%
75
Self-Archiving Subject-based Time Bounded
– Expected
Time Bounded
– Spontaneous
@shawnmjones @WebSciDL
We can predict the semantic category
with structural features
76
Random Forest Results by Semantic CategoryResults for different Machine Learning algorithms
We found that a Random Forest classifier was best
able to predict the semantic category using a
collection’s structural features.
The Random Forest classifier works best with
collections in the Self-Archiving category.
without processing the page content
@shawnmjones @WebSciDL
We optimized our prediction
77
Using Kendall Tau, we were able to determine
which features had a strong correlation with the
semantic category.
Removing the “number of mementos” feature
improved F1 scores for all categories, except
Self-Archiving.
Original
With
feature
removed
@shawnmjones @WebSciDL
Where do we go from here?
78
@shawnmjones @WebSciDL
Future Work
 We will adapt these structural features for our collection summarization work
 The skew of growth curves may affect which mementos are chosen for review
 The seed analysis features will help us better choose seeds to be included
 We can incorporate this classifier to tailor summarization algorithms to specific semantic
categories
 We intend to work further with Archive-It to make metadata and other data more
accessible so that screen-scraping is not necessary
79
@shawnmjones @WebSciDL
Conclusion
80
@shawnmjones @WebSciDL
We adapted Growth Curves for collections
We can normalize & visualize curator
engagement with the collection
81
@shawnmjones @WebSciDL
We introduced Seed Features
 Seed features also help us
understand the curation strategy
of a collection
 Are most of the seeds from the
same domain?
 Are most of the seeds from top-level
domains or deeper pages?
82
@shawnmjones @WebSciDL
We bridged the structural to
the descriptive
83
Results of Random Forest Classifier
@shawnmjones @WebSciDL
We can understand web archive collections
using only structural features
84
Thanks to:
Metadata scraping code available: https://github.com/oduwsdl/archiveit_utilities
@shawnmjones @WebSciDL
Backup Slides
85
@shawnmjones @WebSciDL
Growth curves allow us to understand collection curation
behavior
86
• Built from all
mementos in the
collection’s Timemaps
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Built from the first
memento for each seed
in the collection’s
TimeMaps
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
@shawnmjones @WebSciDL
Seeds Early, Seed Mementos Early
Most curatorial
decisions were made
early in this collection’s
life
Most crawling was
done early in its life
The temporalness of
these collections skew
early
AUCseed > 0.55
AUCsmem > 0.55
@shawnmjones @WebSciDL
Seeds Early, Seed Mementos Continuously
Most curatorial
decisions were made
early in this
collection’s life
Seed mementos
were added
continuously
The temporalness of
these collections
spreads throughout
their lives
AUCseed > 0.55
0.55 > AUCsmem > 0.45
@shawnmjones @WebSciDL
Seeds Early, Seed Mementos Late
Seed mementos were
added later
The temporalness of
these collections skew
more recent
Most curatorial
decisions were made
early in this collection’s
life
AUCseed > 0.55
AUCsmem < 0.45
@shawnmjones @WebSciDL
Seeds Continuously, Seed Mementos Early
0.55 > AUCseed > 0.45
AUCsmem > 0.55
Seeds are added
throughout a
collection’s life.
Seed mementos were
added earlier.
This means that most
the content of the
collection comes from
earlier in its life.
@shawnmjones @WebSciDL
Seeds Continuously, Seed Mementos Continuously
0.55 > AUCseed > 0.45
0.55 > AUCseed memento > 0.45
Seeds are added
throughout and their seed
mementos are collected
continuously.
These collections have a
lot of curatorial
involvement throughout
their life.
Their contents are spread
throughout their life.
@shawnmjones @WebSciDL
Seeds Continuously, Seed Mementos Late
0.55 > AUCseed > 0.45
AUCsmem < 0.45
Seeds are added
throughout, but the
collection is built
from mementos that
were collected later.
@shawnmjones @WebSciDL
Seeds Late, Seed Mementos Early
AUCseed < 0.45
AUCseed memento > 0.55
Most curatorial decisions
were made later in this
collection’s life.
But most of the mementos
were added earlier.
The temporalness of the
collection skews earlier.
Most of the mementos
belong to these early
seeds.
@shawnmjones @WebSciDL
Seeds Late, Seed Mementos Continuously
AUCseed < 0.45
0.55 > AUCseed memento > 0.45
The collection’s
contents are spread
throughout its life, but
many seeds were
added later.
This means that some
of those early seeds
have more mementos.
@shawnmjones @WebSciDL
Seeds Late, Seed Mementos Late
AUCseed < 0.45
AUCseed memento < 0.45
In these cases, the
collection appears
to have
experienced a
“resurgence in
interest” later in its
life.

More Related Content

What's hot

Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItMichele Weigle
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media StoriesYasmin AlNoamany, PhD
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...Alexander Nwala
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemWiLS
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItMichele Weigle
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
The Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopRichard Wallis
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesMichele Weigle
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesJoyce Kasman Valenza
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesMichele Weigle
 
Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Don Boozer
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 

What's hot (20)

Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
The Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 Workshop
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School Libraries
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
 
Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 

Similar to The Many Shapes of Archive-It

Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
Creating Collection Growth Curves With Archives Unleashed Toolkit And HypercaneCreating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
Creating Collection Growth Curves With Archives Unleashed Toolkit And HypercaneTravisReid5
 
Contributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaContributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaNick Sheppard
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked DataRichard Wallis
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Shawn Jones
 
Using digital collections como2015
Using digital collections como2015Using digital collections como2015
Using digital collections como2015LYRASIS_PRODEV
 
Linked data - A radical change?
Linked data - A radical change?Linked data - A radical change?
Linked data - A radical change?Richard Wallis
 
The Archives of American Art on Wikipedia and Wikimedia SAA2012
The Archives of American Art on Wikipedia and Wikimedia SAA2012The Archives of American Art on Wikipedia and Wikimedia SAA2012
The Archives of American Art on Wikipedia and Wikimedia SAA2012Sara Snyder
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for ArchivesCliff Landis
 
Eastern Shores Library System digitization project
Eastern Shores Library System digitization projectEastern Shores Library System digitization project
Eastern Shores Library System digitization projectRecollection Wisconsin
 
Something about links
Something about linksSomething about links
Something about linksRoderic Page
 
Eureka! research
Eureka! researchEureka! research
Eureka! researchcybraryman
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshareHafabe
 

Similar to The Many Shapes of Archive-It (20)

Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
Creating Collection Growth Curves With Archives Unleashed Toolkit And HypercaneCreating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
 
Virtual Libraries
Virtual LibrariesVirtual Libraries
Virtual Libraries
 
One Big Library
One Big LibraryOne Big Library
One Big Library
 
MSO4991 June 2020
MSO4991 June 2020MSO4991 June 2020
MSO4991 June 2020
 
Contributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaContributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and Wikimedia
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
Breaking through invisible walls: Developing a new discovery catalog for a me...
Breaking through invisible walls: Developing a new discovery catalog for a me...Breaking through invisible walls: Developing a new discovery catalog for a me...
Breaking through invisible walls: Developing a new discovery catalog for a me...
 
From Record to Graph
From Record to GraphFrom Record to Graph
From Record to Graph
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
 
Using digital collections como2015
Using digital collections como2015Using digital collections como2015
Using digital collections como2015
 
Linked data - A radical change?
Linked data - A radical change?Linked data - A radical change?
Linked data - A radical change?
 
The Archives of American Art on Wikipedia and Wikimedia SAA2012
The Archives of American Art on Wikipedia and Wikimedia SAA2012The Archives of American Art on Wikipedia and Wikimedia SAA2012
The Archives of American Art on Wikipedia and Wikimedia SAA2012
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
 
Eastern Shores Library System digitization project
Eastern Shores Library System digitization projectEastern Shores Library System digitization project
Eastern Shores Library System digitization project
 
Something about links
Something about linksSomething about links
Something about links
 
Eureka! research
Eureka! researchEureka! research
Eureka! research
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive Awards
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 

More from Shawn Jones

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...Shawn Jones
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...Shawn Jones
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsShawn Jones
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)Shawn Jones
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoShawn Jones
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestShawn Jones
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentShawn Jones
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wikiShawn Jones
 

More from Shawn Jones (11)

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
 
Reference Rot
Reference RotReference Rot
Reference Rot
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonest
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven Development
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wiki
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

The Many Shapes of Archive-It

  • 1. The Many Shapes of Archive-It Shawn M. Jones Alexander Nwala Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL sjone@cs.odu.edu @shawnmjones anwala@cs.odu.edu @acnwala mweigle@cs.odu.edu @weiglemc mln@cs.odu.edu @phonedude_mln Thanks to:
  • 2. @shawnmjones @WebSciDL Researchers Create Their Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah
  • 3. @shawnmjones @WebSciDL Web Archive Collections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015
  • 4. @shawnmjones @WebSciDL Different Versions Allow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 21, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks
  • 5. @shawnmjones @WebSciDL Different Versions Allow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018
  • 6. @shawnmjones @WebSciDL The Internet Archive created Archive-It so organizations could create their own web archive collections Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos of these seeds at different points in time. 6
  • 7. @shawnmjones @WebSciDL But this is the interface available for browsing those collections… 7 How do we tell the difference without going through them all? What types of collections exist?
  • 8. @shawnmjones @WebSciDL How can we understand an Archive-It collection?
  • 9. @shawnmjones @WebSciDL We Can Understand It Based On Metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields
  • 10. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 10 132,599 seeds no metadata 9 seeds with metadata Because metadata is optional it is not always present.
  • 11. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 11 Because metadata is optional it is not always present. When it is present, metadata on Archive-It collections is: • generated by many different curators • from different organizations • with different content standards • and different rules of interpretation
  • 12. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 12 Because metadata is optional it is not always present. When it is present, metadata on Archive-It collections is: • generated by many different curators • from different organizations • with different content standards • and different rules of interpretation It is inconsistently applied! This means that a user cannot reliably compare metadata fields to understand the differences between collections.
  • 13. @shawnmjones @WebSciDL We Can Understand It Based on Content  We can use techniques such as text mining and network analysis The same collection in the Archives Unleashed Cloud https://archivesunleashed.org 13
  • 14. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 14
  • 15. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 15 Remember: • Each result is a seed • Each seed has multiple mementos
  • 16. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 16 There are 486,227 seed mementos that must be downloaded and processed to understand this collection. Remember: • Each result is a seed • Each seed has multiple mementos
  • 17. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 17 There are 486,227 seed mementos that must be downloaded and processed to understand this collection. Remember: • Each result is a seed • Each seed has multiple mementos These 333 seeds correspond to 278,306 seed mementos. They must be downloaded and processed.
  • 18. @shawnmjones @WebSciDL and what if we do not know the language? 18 ??? About University of Utah English non- German Speakers can discern: About shootings in Tuscon
  • 19. @shawnmjones @WebSciDL How else can we understand an Archive-It collection?
  • 20. @shawnmjones @WebSciDL What kinds of questions can be answered with Structural Features?  Using only structural features is advantageous because it saves one from having to dereference all of the URIs in a collection.  These structural features also give us different insight than can be provided by text analysis or metadata. 20 81,014 seeds 486,227 seed mementos
  • 21. @shawnmjones @WebSciDL Does most of the collection exist earlier or later in its life? 21 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life.
  • 22. @shawnmjones @WebSciDL When did the curator select and archive a collection’s contents? 22 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018.
  • 23. @shawnmjones @WebSciDL Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 23 This collection was created in June 2014. The seeds were selected at the beginning of its life. Mementos were captured all during its life.
  • 24. @shawnmjones @WebSciDL Was the collection built from web sites belonging to one domain or many? 24 Many domains One domain
  • 25. @shawnmjones @WebSciDL Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 25 Top-level pages Deeper Links
  • 26. @shawnmjones @WebSciDL Other questions answered by structural features:  Was there renewed interest at some point later in the collection’s life?  Did the curator nurture the selected web pages throughout the collection’s life and add content continuously?  What time period does the collection span?  What is the temporal skew of the collection?  What is the lifetime of the collection? 26
  • 27. @shawnmjones @WebSciDL Can we bridge the structural to the descriptive?  We can categorize Archive-It’s collections into four main semantic categories.  We can predict these categories using a Random Forest Classifier using structural features. 27
  • 28. @shawnmjones @WebSciDL Let’s go over a few things…
  • 29. @shawnmjones @WebSciDL Looking at Archive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • In this work, we focus on seeds and seed mementos 29
  • 30. @shawnmjones @WebSciDL TimeMaps from the Memento Protocol 30 <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento"; datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 04:41:56 GMT" … Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes. entries for mementos memento-datetime original resource URI Memento URI (URI-M) TimeMap URI (URI-T)
  • 31. @shawnmjones @WebSciDL What other work is related to web collections?
  • 32. @shawnmjones @WebSciDL Related Work 32 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012)
  • 33. @shawnmjones @WebSciDL Related Work 33 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections
  • 34. @shawnmjones @WebSciDL Related Work 34 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created
  • 35. @shawnmjones @WebSciDL Related Work 35 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms
  • 36. @shawnmjones @WebSciDL Related Work 36 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time
  • 37. @shawnmjones @WebSciDL Related Work 37 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time We focus on structural features rather than challenges with using Archive-It as a tool
  • 38. @shawnmjones @WebSciDL Related Work 38 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time We focus on structural features rather than challenges with using Archive-It as a tool We focus on structural features of the archives rather than their user interfaces
  • 39. @shawnmjones @WebSciDL Related Work 39 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It Dublin Core AlNoamany (2016)
  • 40. @shawnmjones @WebSciDL Related Work 40 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata Dublin Core AlNoamany (2016)
  • 41. @shawnmjones @WebSciDL Related Work 41 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections Dublin Core AlNoamany (2016)
  • 42. @shawnmjones @WebSciDL Related Work 42 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth Dublin Core AlNoamany (2016)
  • 43. @shawnmjones @WebSciDL Related Work 43 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth We apply AlSum’s methods to specific collections rather than entire archives Dublin Core AlNoamany (2016)
  • 44. @shawnmjones @WebSciDL Related Work 44 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth We apply AlSum’s methods to specific collections rather than entire archives We look at collections as units rather than analyzing Archive-It as a whole Dublin Core AlNoamany (2016)
  • 45. @shawnmjones @WebSciDL How did we acquire the data for this study?
  • 46. @shawnmjones @WebSciDL Acquiring 9351 Archive-It collections We used BeautifulSoup to scrape the web pages of 9,351 Archive-It Collections. From this scraping we discovered: • If the collection was public or private • Seed URIs Using the Seed URIs, we discovered TimeMaps listing all seed mementos and their memento-datetimes. 46
  • 47. @shawnmjones @WebSciDL Remove 4,823 private collections Private collections do not allow access to seeds, seed mementos, or TimeMaps 47
  • 48. @shawnmjones @WebSciDL Remove 440 young collections Collections younger than a year may still be building, possibly skewing results 48
  • 49. @shawnmjones @WebSciDL Remove empty collections Empty collections have no data to analyze 49
  • 50. @shawnmjones @WebSciDL Remove 48 collections with errors Collections with download or processing errors may skew the results 50
  • 51. @shawnmjones @WebSciDL Remove 357 collections with a single memento Singletons consist of a single seed with a single memento, offering no behavior to study 51
  • 52. @shawnmjones @WebSciDL Remove 21 instantaneous collections Single second collections were captured in a single second, offering no behavior over time to study 52
  • 53. @shawnmjones @WebSciDL Remove 32 test collections Collections clearly marked as test or trial do not represent regular collection behavior 53
  • 54. @shawnmjones @WebSciDL We study the remaining 3,382 collections This leaves us with 3,382 collections for study with a total of : • 700,835 seeds • 6,943,677 seed mementos 54
  • 55. @shawnmjones @WebSciDL Understanding Collection Growth Through Time 55 collections that do not grow are not interesting for us
  • 56. @shawnmjones @WebSciDL Growth curves help us understand collection growth, but require normalization for comparison 56 We want to compare time • “2014 Primaries” has 219,084 mementos • “The Obama White House” has 140 • We normalize the number as a percentage We want to compare memento count • “Hurricane Sandy” has 174,884 seeds • “Scottish Politics” has 58 seeds • We normalize the number as a percentage We want to compare seed count • “Indiana: State and Local Documents” spans 2005 – 2018 • “Japan: Election 2016 House of Councilors” spans less than 2 days in July 2016 • We normalize time as a percentage of the lifespan of the collection, from the first memento-datetime to the last
  • 57. @shawnmjones @WebSciDL Once normalized, we can compare behavior in the seed growth… 57 • Skew of the curator’s involvement with the collection • When seeds were added • When interest was lost or regained Seeds added all up frontSeeds added early, but not all up front
  • 58. @shawnmjones @WebSciDL And, we can compare behavior in the memento growth… 58 • Built from all mementos in the collection’s TimeMaps • Skew of the collection’s holdings • Indicates temporality of collection Mementos crawled all alongMementos crawled later
  • 59. @shawnmjones @WebSciDL We can classifying different behaviors of Growth Curves  Using two features:  Area under the seed curve (AUCseed)  Area under the seed memento curve (AUCsmem)  We can classify a collection’s growth curve into 9 categories  If AUC > 0.55, then those points occur early  If AUC < 0.45, then those points occur late  If 0.55 > AUC > 0.45, then those points occur continuously 59 Seeds Late Seeds Continuously Seeds Early Seed Mementos Early Seed Mementos Continuously Seed Mementos Late AUCseed > 0.55 AUCseed < 0.45 AUCsmem > 0.55 0.55 > AUCsmem > 0.45 AUCsmem < 0.45 0.55 > AUCseed > 0.45
  • 60. @shawnmjones @WebSciDL Seeds Early 60 The curators added most of the seeds at the beginning of the collection’s life and then scheduled crawls at different schedules.
  • 61. @shawnmjones @WebSciDL Seeds Continuously 61 The curators keep adding new things to these collections throughout each collection’s life.
  • 62. @shawnmjones @WebSciDL Seeds Late 62 There was renewed interest in adding seeds at some point in these collections’ lives.
  • 63. @shawnmjones @WebSciDL From These Growth Curves we have some simple Structural Features  Number of Seeds  Number of Seed Mementos  Collection Lifespan  Time between first and last memento 63
  • 64. @shawnmjones @WebSciDL We also have complex Growth Curve Features: Difference of Seed Curve AUC and Diagonal 64 Subtracting the AUC of the diagonal from the AUC of the seed curve: • We can more easily see if the seed curve is early or late • Early is positive • Late is negative • “Close” to 0 means continuous (pos.) (neg.)
  • 65. @shawnmjones @WebSciDL More complex Growth Curve Features: Difference of Seed Memento Curve AUC and Diagonal 65 Subtracting the AUC of the diagonal from the seed curve: • We can more easily see if the seed curve is early or late • Early is positive • Late is negative • “Close” to 0 means continuous (pos.) (neg.)
  • 66. @shawnmjones @WebSciDL More complex Growth Curve Features: Diff. of Seed Curve AUC and Seed Memento Curve AUC 66 Difference between the seed curve AUC and the seed memento curve AUC indicates how close the two are. A value of 0 means that the two overlap, likely meaning that there is one memento per seed. A positive value means that the seeds are added earlier than the seed mementos. A negative value means that the seed memento growth has overtaken the seed growth.
  • 67. @shawnmjones @WebSciDL What About Structural Features of Seeds? 67
  • 68. @shawnmjones @WebSciDL Seed URI domain diversity 68 Alexander Nwala. (2018 May) An Exploration of URL Diversity Measures. Web Science and Digital Libraries Reseach Group Blog. http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html Domain diversity: 0 (duplicate cnn.com hosts) http://www.cnn.com/path/to/story0 http://news.cnn.com/path/to/story1 http://top.cnn.com/path/to/story2 Domain diversity: 1 (no duplicate domains) http://www.cnn.com/path/to/story0 http://www.vox.com/path/to/story http://www.foxnews.com/path/to/story Domain diversity: 0.5 (1 duplicate cnn.com host) http://www.cnn.com/path/to/story0 http://www.cnn.com/path/to/story1 http://www.vox.com/path/to/story U = # of unique domains C = number of seeds D = diversity D’ = normalized diversity * Now known as the WSDL Diversity Index Observation: Some collections only archive a single domain while others have more variety.
  • 69. @shawnmjones @WebSciDL Path Depth Path Depth is a concept measuring how many items exist in a URI’s path  Based on McCown’s work, we also add 1 for any path containing a query string: 69 Example URI Path Depth http://example.com/ 0 http://example.com/directory 1 http://example.com/dir1/dir2/dir3/dir4 4 http://example.com/dir1/file2?key1=val1&k ey2=val2&key3=val3 3 Observation: Top-level pages tend to have more general information whereas deeper pages tend to have a more specific focus.
  • 70. @shawnmjones @WebSciDL Seed URI Path Depth Diversity 70 Path depth diversity: 0 (All path depths are 3) http://www.cnn.com/path/to/story0 http://news.vox.com/path/to/story1 http://top.cnn.com/path/to/story2 Path depth diversity: 1 (all completely different path depths) http://www.cnn.com/ http://news.vox.com/path/ http://top.cnn.com/path/to/story Path depth diversity: 0.5 (1 path depth of 1, 2 with depth of 3) http://www.cnn.com/ http://news.vox.com/path/to/story1 http://top.cnn.com/path/to/story2 Observation: Some collections only have seeds at the top level where others only link to deeper articles. We reuse the WSDL Diversity Index, but this time apply it to path depth.
  • 71. @shawnmjones @WebSciDL Other Seed Features  Most Frequent Path Depth  The path depth that appears most in the seed URIs  Observation: For some collections, most seeds exist at the top level while others link to deeper articles.  % Query String Usage  How many URIs consist of query strings  Observation: Some collections have many URIs with query strings, while others have none. 71
  • 72. @shawnmjones @WebSciDL Mapping the structural to the descriptive is hard… 72
  • 73. @shawnmjones @WebSciDL At first, we tried to map the structural features to metadata directly…  We tried using machine learning to predict the topics found in the metadata of a collection  There are problems with this approach:  Not all collections have topics.  Many collections have multiple topics.  Many collections have user-supplied topics. 73
  • 74. @shawnmjones @WebSciDL Instead, we established semantic categories of Archive-It collections  We reviewed the descriptions of 3,382 Archive-It Collections  Based on their metadata and seeds, we placed them into 4 semantic categories 74
  • 75. @shawnmjones @WebSciDL Self-Archiving dominates Archive-It 54.1% 27.6% 14.1% 4.2% 75 Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
  • 76. @shawnmjones @WebSciDL We can predict the semantic category with structural features 76 Random Forest Results by Semantic CategoryResults for different Machine Learning algorithms We found that a Random Forest classifier was best able to predict the semantic category using a collection’s structural features. The Random Forest classifier works best with collections in the Self-Archiving category. without processing the page content
  • 77. @shawnmjones @WebSciDL We optimized our prediction 77 Using Kendall Tau, we were able to determine which features had a strong correlation with the semantic category. Removing the “number of mementos” feature improved F1 scores for all categories, except Self-Archiving. Original With feature removed
  • 78. @shawnmjones @WebSciDL Where do we go from here? 78
  • 79. @shawnmjones @WebSciDL Future Work  We will adapt these structural features for our collection summarization work  The skew of growth curves may affect which mementos are chosen for review  The seed analysis features will help us better choose seeds to be included  We can incorporate this classifier to tailor summarization algorithms to specific semantic categories  We intend to work further with Archive-It to make metadata and other data more accessible so that screen-scraping is not necessary 79
  • 81. @shawnmjones @WebSciDL We adapted Growth Curves for collections We can normalize & visualize curator engagement with the collection 81
  • 82. @shawnmjones @WebSciDL We introduced Seed Features  Seed features also help us understand the curation strategy of a collection  Are most of the seeds from the same domain?  Are most of the seeds from top-level domains or deeper pages? 82
  • 83. @shawnmjones @WebSciDL We bridged the structural to the descriptive 83 Results of Random Forest Classifier
  • 84. @shawnmjones @WebSciDL We can understand web archive collections using only structural features 84 Thanks to: Metadata scraping code available: https://github.com/oduwsdl/archiveit_utilities
  • 86. @shawnmjones @WebSciDL Growth curves allow us to understand collection curation behavior 86 • Built from all mementos in the collection’s Timemaps • Skew of the collection’s holdings • Indicates temporality of collection • Built from the first memento for each seed in the collection’s TimeMaps • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained
  • 87. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Early Most curatorial decisions were made early in this collection’s life Most crawling was done early in its life The temporalness of these collections skew early AUCseed > 0.55 AUCsmem > 0.55
  • 88. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Continuously Most curatorial decisions were made early in this collection’s life Seed mementos were added continuously The temporalness of these collections spreads throughout their lives AUCseed > 0.55 0.55 > AUCsmem > 0.45
  • 89. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Late Seed mementos were added later The temporalness of these collections skew more recent Most curatorial decisions were made early in this collection’s life AUCseed > 0.55 AUCsmem < 0.45
  • 90. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Early 0.55 > AUCseed > 0.45 AUCsmem > 0.55 Seeds are added throughout a collection’s life. Seed mementos were added earlier. This means that most the content of the collection comes from earlier in its life.
  • 91. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Continuously 0.55 > AUCseed > 0.45 0.55 > AUCseed memento > 0.45 Seeds are added throughout and their seed mementos are collected continuously. These collections have a lot of curatorial involvement throughout their life. Their contents are spread throughout their life.
  • 92. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Late 0.55 > AUCseed > 0.45 AUCsmem < 0.45 Seeds are added throughout, but the collection is built from mementos that were collected later.
  • 93. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Early AUCseed < 0.45 AUCseed memento > 0.55 Most curatorial decisions were made later in this collection’s life. But most of the mementos were added earlier. The temporalness of the collection skews earlier. Most of the mementos belong to these early seeds.
  • 94. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Continuously AUCseed < 0.45 0.55 > AUCseed memento > 0.45 The collection’s contents are spread throughout its life, but many seeds were added later. This means that some of those early seeds have more mementos.
  • 95. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Late AUCseed < 0.45 AUCseed memento < 0.45 In these cases, the collection appears to have experienced a “resurgence in interest” later in its life.