The Many Shapes of Archive-It

The Many Shapes of
Archive-It
Shawn M. Jones Alexander Nwala Michele C. Weigle Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
sjone@cs.odu.edu
@shawnmjones
anwala@cs.odu.edu
@acnwala
mweigle@cs.odu.edu
@weiglemc
mln@cs.odu.edu
@phonedude_mln
Thanks to:

@shawnmjones @WebSciDL
Researchers Create Their Own Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah

Web Archive Collections Have Many Versions of the
Same Page
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015

Different Versions Allow Us to See an Unfolding News
Story
4
Memento from
April 19, 2013 17:12
Searching for Suspects,
City on Lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 21, 2013 2:24
Suspect Found,
Office Collier Lost Life,
Obama speaks

Different Versions Allow Us To See Changes In An
Organization’s Web Presence
5
The White House: 2016 The White House: 2018

The Internet Archive created Archive-It so organizations could
create their own web archive collections
Curators can supply live web resources as seeds and establish crawling schedules of those seeds to
create mementos of these seeds at different points in time.
6

But this is the interface available for browsing those
collections…
7
How do we tell the difference without going through them all?
What types of collections exist?

How can we understand an Archive-It collection?

We Can Understand It Based On Metadata
9
Collection wide Metadata Metadata on Individual Seeds
Dublin
Core
+
Custom
Fields

We Can Understand It Based On Metadata,
but the Metadata Does Not Always Help…
10
132,599 seeds
no metadata
9 seeds
with metadata
Because
metadata is
optional it is
not always
present.

11
Because
metadata is
optional it is
not always
present.
When it is present, metadata on Archive-It collections is:
• generated by many different curators
• from different organizations
• with different content standards
• and different rules of interpretation

12
Because
metadata is
optional it is
not always
present.
When it is present, metadata on Archive-It collections is:
• generated by many different curators
• from different organizations
• with different content standards
• and different rules of interpretation
It is inconsistently applied!
This means that a user cannot reliably compare metadata
fields to understand the differences between collections.

We Can Understand It Based on Content
 We can use techniques such as text mining
and network analysis
The same collection in the
Archives Unleashed Cloud
https://archivesunleashed.org
13

We Can Understand It Based on Content,
but all of that Content Must Be Dereferenced…
14

15
Remember:
• Each result is a
seed
• Each seed has
multiple mementos

16
There are 486,227 seed mementos that
must be downloaded and processed to
understand this collection.
Remember:
seed
• Each seed has
multiple mementos

17
There are 486,227 seed mementos that
must be downloaded and processed to
understand this collection.
Remember:
seed
• Each seed has
multiple mementos
These 333 seeds correspond to
278,306 seed mementos.
They must be downloaded and processed.

and what if we do not know the language?
18
???
About University of Utah
English
non-
German
Speakers
can
discern: About shootings in Tuscon

How else can we understand an Archive-It collection?

What kinds of questions can be answered with
Structural Features?
 Using only structural features is
advantageous because it saves one
from having to dereference all of the
URIs in a collection.
 These structural features also give us
different insight than can be provided by
text analysis or metadata.
20
81,014 seeds
486,227 seed mementos

Does most of the collection exist earlier or later in its
life?
21
This collection was created in March 2010.
Most of its mementos come from 2016 – 2018.
Most of this collection exists later in its life.

When did the curator select and archive a collection’s contents?
22
This collection was created in March 2006.
Some of the seeds were selected in 2006.
Many of the seeds were selected all along its
life.
It has mementos as recent as July 2018.

Did the curator create a collection intended to archive new versions of the
same web pages repeatedly?
23
This collection was created in June 2014.
The seeds were selected at the beginning of its life.
Mementos were captured all during its life.

Was the collection built from web sites belonging to one domain
or many?
24
Many domains One domain

Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
25
Top-level pages Deeper Links

Other questions answered by structural features:
 Was there renewed interest at some point later in the collection’s life?
 Did the curator nurture the selected web pages throughout the collection’s life
and add content continuously?
 What time period does the collection span?
 What is the temporal skew of the collection?
 What is the lifetime of the collection?
26

Can we bridge the structural to the descriptive?
 We can categorize Archive-It’s collections into four main semantic categories.
 We can predict these categories using a Random Forest Classifier using
structural features.
27

Let’s go over a few things…

Looking at Archive-It collections from the outside
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
• In this work, we focus on seeds and seed mementos
29

TimeMaps from the Memento Protocol
30
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self";
type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";
datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";
datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 04:41:56 GMT"
…
Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their
memento-datetimes.
entries for mementos
memento-datetime
original resource URI
Memento URI (URI-M)
TimeMap URI (URI-T)

What other work is related to web collections?

Related Work
32
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)

Related Work
33
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections

Related Work
34
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created

Related Work
35
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms

Related Work
36
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time

Related Work
37
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time
We focus on
structural
features
rather than
challenges
with using
Archive-It as
a tool

Related Work
38
Nwala (2018)
Mull (2014)
Wang (2016)
Ogden (2017)
features of digital
collections
Fenlon (2017)
selecting seeds
for
web archive
collections
Milligan (2016)
motivations for
creating
collections
behavior of web
archivists
Crook (2009)
Slania (2013)
Deutch (2016)
studies of using
Archive-It
capabilities
of web archive user
interfaces
Niu (2012)
We focus on
web archive
collections
We examine
web archive
collections
after they
have been
created
We look to
structural
features of
web
archives
rather than
user studies
of live web
curation
platforms
We focus on
the output of
web
archivists
rather than
studying
their
behavior in
real time
We focus on
structural
features
rather than
challenges
with using
Archive-It as
a tool
We focus on
structural
features of
the archives
rather than
their user
interfaces

Related Work
39
Sağlam (2014) Abramson
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
Dublin Core
AlNoamany
(2016)

Related Work
40
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
Dublin Core
AlNoamany
(2016)

Related Work
41
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
Dublin Core
AlNoamany
(2016)

Related Work
42
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
Dublin Core
AlNoamany
(2016)

Related Work
43
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
We apply
AlSum’s
methods to
specific
collections
rather than
entire
archives
Dublin Core
AlNoamany
(2016)

Related Work
44
(2012)
AlSum (2014)
metadata
standards
EAD
topics in web
archive
collections
AlNoamany
(2016)
classification of
URIs
web archive growth
analysis
studies of using
Archive-It
We look at
the
structural
features
rather than
metadata
We are not
looking at
the content,
but the
structural
features of
collections
We examine
different
features of
URIs like
domain and
path depth
We apply
AlSum’s
methods to
specific
collections
rather than
entire
archives
We look at
collections
as units
rather than
analyzing
Archive-It as
a whole
Dublin Core
AlNoamany
(2016)

How did we acquire the data for this study?

Acquiring 9351 Archive-It collections
We used BeautifulSoup to
scrape the web pages of 9,351
Archive-It Collections.
From this scraping we
discovered:
• If the collection was public or
private
• Seed URIs
Using the Seed URIs, we
discovered TimeMaps listing all
seed mementos and their
memento-datetimes.
46

Remove 4,823 private collections Private collections do not allow
access to seeds, seed
mementos, or TimeMaps
47

Remove 440 young collections
Collections younger than a year
may still be building, possibly
skewing results
48

Remove empty collections
Empty collections have no data
to analyze
49

Remove 48 collections with errors
Collections with download or
processing errors may skew the
results
50

Remove 357 collections with a single memento
Singletons consist of a single
seed with a single memento,
offering no behavior to study
51

Remove 21 instantaneous collections
Single second collections were
captured in a single second,
offering no behavior over time
to study
52

Remove 32 test collections
Collections clearly marked as
test or trial do not represent
regular collection behavior
53

We study the remaining 3,382 collections
This leaves us with 3,382
collections for study with a total
of :
• 700,835 seeds
• 6,943,677 seed mementos
54

Understanding Collection Growth Through Time
55
collections that do not grow are not interesting for us

Growth curves help us understand collection growth,
but require normalization for comparison
56
We want to compare time
• “2014 Primaries” has 219,084 mementos
• “The Obama White House” has 140
• We normalize the number as a percentage
We want to compare memento count
• “Hurricane Sandy” has 174,884 seeds
• “Scottish Politics” has 58 seeds
• We normalize the number as a percentage
We want to compare seed count
• “Indiana: State and Local Documents”
spans 2005 – 2018
• “Japan: Election 2016 House of Councilors”
spans less than 2 days in July 2016
• We normalize time as a percentage of
the lifespan of the collection,
from the first memento-datetime to the last

Once normalized, we can compare behavior in the seed
growth…
57
• Skew of the curator’s
involvement with the
collection
• When seeds were added
• When interest was lost or
regained
Seeds added all up frontSeeds added early, but
not all up front

And, we can compare behavior in the memento growth…
58
• Built from all mementos in
the collection’s TimeMaps
• Skew of the collection’s
holdings
• Indicates temporality of
collection
Mementos crawled all alongMementos crawled later

We can classifying different behaviors of Growth Curves
 Using two features:
 Area under the seed curve (AUCseed)
 Area under the seed memento curve
(AUCsmem)
 We can classify a collection’s
growth curve into 9 categories
 If AUC > 0.55, then those points occur
early
 If AUC < 0.45, then those points occur
late
 If 0.55 > AUC > 0.45, then those points
occur continuously
59
Seeds
Late
Seeds
Continuously
Seeds
Early
Seed
Mementos
Early
Seed
Mementos
Continuously
Seed
Mementos
Late
AUCseed > 0.55
AUCseed < 0.45
AUCsmem > 0.55
0.55 > AUCsmem > 0.45
AUCsmem < 0.45
0.55 > AUCseed > 0.45

Seeds Early
60
The curators added most of the seeds at the beginning of the collection’s
life and then scheduled crawls at different schedules.

Seeds Continuously
61
The curators keep adding new things to these collections throughout each collection’s life.

Seeds Late
62
There was renewed interest in adding seeds at some point in these collections’ lives.

From These Growth Curves we have some
simple Structural Features
 Number of Seeds
 Number of Seed Mementos
 Collection Lifespan
 Time between first and last
memento
63

We also have complex Growth Curve Features:
Difference of Seed Curve AUC and Diagonal
64
Subtracting the AUC of the diagonal from the AUC of
the seed curve:
• We can more easily see if the seed curve is early
or late
• Early is positive
• Late is negative
• “Close” to 0 means continuous
(pos.)
(neg.)

More complex Growth Curve Features:
Difference of Seed Memento Curve AUC and Diagonal
65
Subtracting the AUC of the diagonal from the seed curve:
• We can more easily see if the seed curve is early or
late
• Early is positive
• Late is negative
• “Close” to 0 means continuous
(pos.)
(neg.)

More complex Growth Curve Features:
Diff. of Seed Curve AUC and Seed Memento Curve AUC
66
Difference between the seed curve AUC and the seed
memento curve AUC indicates how close the two are.
A value of 0 means that the two overlap, likely meaning
that there is one memento per seed.
A positive value means that the seeds are added earlier
than the seed mementos.
A negative value means that the seed memento growth
has overtaken the seed growth.

What About Structural Features of Seeds?
67

Seed URI domain diversity
68
Alexander Nwala. (2018 May) An Exploration of URL Diversity Measures. Web Science and Digital Libraries Reseach Group Blog.
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
Domain diversity: 0
(duplicate cnn.com hosts)
http://www.cnn.com/path/to/story0
http://news.cnn.com/path/to/story1
http://top.cnn.com/path/to/story2
Domain diversity: 1
(no duplicate domains)
http://www.vox.com/path/to/story
http://www.foxnews.com/path/to/story
Domain diversity: 0.5
(1 duplicate cnn.com host)
http://www.vox.com/path/to/story
U = # of unique domains
C = number of seeds
D = diversity
D’ = normalized diversity
* Now known as the WSDL Diversity Index
Observation: Some collections only archive a single domain while others have more variety.

Path Depth
Path Depth is a concept measuring how many
items exist in a URI’s path
 Based on McCown’s work, we also add 1 for
any path containing a query string:
69
Example URI Path Depth
http://example.com/ 0
http://example.com/directory 1
http://example.com/dir1/dir2/dir3/dir4 4
http://example.com/dir1/file2?key1=val1&k
ey2=val2&key3=val3
3
Observation: Top-level pages tend to have more general information whereas deeper pages tend to have a
more specific focus.

Seed URI Path Depth Diversity
70
Path depth diversity: 0
(All path depths are 3)
http://news.vox.com/path/to/story1
Path depth diversity: 1
(all completely different path depths)
http://www.cnn.com/
http://news.vox.com/path/
http://top.cnn.com/path/to/story
Path depth diversity: 0.5
(1 path depth of 1, 2 with depth of 3)
http://www.cnn.com/
http://news.vox.com/path/to/story1
Observation: Some collections only have seeds at the top level where others only link to deeper articles.
We reuse the WSDL Diversity Index, but this time apply it to path depth.

Other Seed Features
 Most Frequent Path Depth
 The path depth that appears most in the
seed URIs
 Observation: For some collections, most
seeds exist at the top level while others
link to deeper articles.
 % Query String Usage
 How many URIs consist of query strings
 Observation: Some collections have many
URIs with query strings, while others have
none.
71

Mapping the structural to the descriptive is hard…
72

At first, we tried to map the structural features to
metadata directly…
 We tried using machine learning to
predict the topics found in the
metadata of a collection
 There are problems with this
approach:
 Not all collections have topics.
 Many collections have multiple topics.
 Many collections have user-supplied
topics.
73

Instead, we established semantic categories of
Archive-It collections
 We reviewed the descriptions of 3,382 Archive-It Collections
 Based on their metadata and seeds, we placed them into 4 semantic categories
74

Self-Archiving dominates Archive-It
54.1% 27.6% 14.1% 4.2%
75
Self-Archiving Subject-based Time Bounded
– Expected
Time Bounded
– Spontaneous

We can predict the semantic category
with structural features
76
Random Forest Results by Semantic CategoryResults for different Machine Learning algorithms
We found that a Random Forest classifier was best
able to predict the semantic category using a
collection’s structural features.
The Random Forest classifier works best with
collections in the Self-Archiving category.
without processing the page content

We optimized our prediction
77
Using Kendall Tau, we were able to determine
which features had a strong correlation with the
semantic category.
Removing the “number of mementos” feature
improved F1 scores for all categories, except
Self-Archiving.
Original
With
feature
removed

Where do we go from here?
78

Future Work
 We will adapt these structural features for our collection summarization work
 The skew of growth curves may affect which mementos are chosen for review
 The seed analysis features will help us better choose seeds to be included
 We can incorporate this classifier to tailor summarization algorithms to specific semantic
categories
 We intend to work further with Archive-It to make metadata and other data more
accessible so that screen-scraping is not necessary
79

Conclusion
80

We adapted Growth Curves for collections
We can normalize & visualize curator
engagement with the collection
81

We introduced Seed Features
 Seed features also help us
understand the curation strategy
of a collection
 Are most of the seeds from the
same domain?
 Are most of the seeds from top-level
domains or deeper pages?
82

We bridged the structural to
the descriptive
83
Results of Random Forest Classifier

We can understand web archive collections
using only structural features
84
Thanks to:
Metadata scraping code available: https://github.com/oduwsdl/archiveit_utilities

Backup Slides
85

Growth curves allow us to understand collection curation
behavior
86
• Built from all
mementos in the
collection’s Timemaps
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Built from the first
memento for each seed
in the collection’s
TimeMaps
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained

Seeds Early, Seed Mementos Early
Most curatorial
decisions were made
early in this collection’s
life
Most crawling was
done early in its life
The temporalness of
these collections skew
early
AUCseed > 0.55
AUCsmem > 0.55

Seeds Early, Seed Mementos Continuously
Most curatorial
decisions were made
early in this
collection’s life
Seed mementos
were added
continuously
The temporalness of
these collections
spreads throughout
their lives
AUCseed > 0.55
0.55 > AUCsmem > 0.45

Seeds Early, Seed Mementos Late
Seed mementos were
added later
The temporalness of
these collections skew
more recent
Most curatorial
decisions were made
early in this collection’s
life
AUCseed > 0.55
AUCsmem < 0.45

Seeds Continuously, Seed Mementos Early
0.55 > AUCseed > 0.45
AUCsmem > 0.55
Seeds are added
throughout a
collection’s life.
Seed mementos were
added earlier.
This means that most
the content of the
collection comes from
earlier in its life.

Seeds Continuously, Seed Mementos Continuously
0.55 > AUCseed > 0.45
0.55 > AUCseed memento > 0.45
Seeds are added
throughout and their seed
mementos are collected
continuously.
These collections have a
lot of curatorial
involvement throughout
their life.
Their contents are spread
throughout their life.

Seeds Continuously, Seed Mementos Late
0.55 > AUCseed > 0.45
AUCsmem < 0.45
Seeds are added
throughout, but the
collection is built
from mementos that
were collected later.

Seeds Late, Seed Mementos Early
AUCseed < 0.45
AUCseed memento > 0.55
Most curatorial decisions
were made later in this
collection’s life.
But most of the mementos
were added earlier.
The temporalness of the
collection skews earlier.
Most of the mementos
belong to these early
seeds.

Seeds Late, Seed Mementos Continuously
AUCseed < 0.45
0.55 > AUCseed memento > 0.45
The collection’s
contents are spread
throughout its life, but
many seeds were
added later.
This means that some
of those early seeds
have more mementos.

Seeds Late, Seed Mementos Late
AUCseed < 0.45
AUCseed memento < 0.45
In these cases, the
collection appears
to have
experienced a
“resurgence in
interest” later in its
life.

The Many Shapes of Archive-It

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Many Shapes of Archive-It

Similar to The Many Shapes of Archive-It (20)

More from Shawn Jones

More from Shawn Jones (11)

Recently uploaded

Recently uploaded (20)

The Many Shapes of Archive-It