Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

Creating Collection Growth Curves
With Archives Unleashed Toolkit
And Hypercane
Travis Reid
Web Science and Digital Libraries Research Group
Old Dominion University
@TReid803 @WebSciDL @oducs

Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
A Seed Is A URI Selected By An Archivist
2
Archive-It Collection: https://archive-it.org/collections/366
Seeds

@TReid803 @WebSciDL
A Memento Is An Archived Web Page And A TimeMap Is A List Of Mementos
3
Seeds
TimeMap: List Of Mementos
Mementos
of a seed

@TReid803 @WebSciDL
Examples Of Seed Mementos
4
Seeds
TimeMap: List Of Mementos
A Seed Memento
Mementos
of a seed

@TReid803 @WebSciDL
Collection Growth Curves
● A collection growth curve is used for
gaining a better understanding of:
○ Seed curation
○ Crawling behavior
● “The Many Shapes of Archive-It” first
applied the concept of collection growth
curves to Archive-It collections
○ https://arxiv.org/abs/1806.06878
I created a Google Colab notebook that can be
used to create collection growth curves.
5
The Anatomy of a Collection Growth Curve
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878

@TReid803 @WebSciDL
The Shape Of A Seed Line Depends On When The Seeds Are Added
6
Line will be near upper left corner if
most seeds added early, like 900
seeds added in the first 20 days
Line will be closer to diagonal
when regularly adding seeds, like
2 seeds added for 500 days
Line will be near lower right corner if
most seeds added later, like 850
seeds added during the last 30 days
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days

@TReid803 @WebSciDL
The Shape Of A Seed Memento Line Depends On When The Seed Mementos Are
Added
7
Line will be near upper left corner if
most seed mementos added early,
like 8,000 seed mementos during
the first 40 days
Line will be closer to diagonal when
regularly adding seed mementos,
like 20 seed mementos added for
500 days
Line will be near lower right corner
if most seed mementos added
later, like 9,500 seed mementos
added during the last 50 days
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days

@TReid803 @WebSciDL
Growth Curves With Different Durations Can Be Compared Since x-axis Is
A Percentage
8
This is either the end of the collection’s life
or the current time when the growth curve
is created
The beginning of the
collection’s life

@TReid803 @WebSciDL
Main Tools Used
● Docker
● Hypercane
● Archives Unleashed Toolkit (AUT)
● Archive-It Utilities (AIU)
9

@TReid803 @WebSciDL
Docker
Docker is used in the examples included in these slides, because Docker makes it
easier to install and setup the dependencies needed for AUT and Hypercane.
Docker Desktop: https://www.docker.com/products/docker-desktop
10
Source: www.docker.com/company/newsroom/media-resources

@TReid803 @WebSciDL
Hypercane
● When you do not own the collection, Hypercane is needed to get the WARCs for the
collection
● If you already have WARC files for a collection, then you do not need to use Hypercane
● GitHub repository: https://github.com/oduwsdl/hypercane
● Hypercane blog post part 2:
https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html
11
Source: github.com/oduwsdl/hypercane
Given collection ID for public
Archive-It collection
Creates WARC files for
the collection
Archives
Unleashed
Toolkit

@TReid803 @WebSciDL
No
Owners of a collection can use AU Toolkit or AU Cloud to create the
derivative needed for the growth curve notebook
AUT documentation: https://aut.docs.archivesunleashed.org/docs/home
Working With Archives Unleashed Cloud: https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html
12
Provide WARC
files for the
collection
Create web
page text
derivative
Archives
Unleashed
Toolkit
Archives
Unleashed
Cloud
User owns the
collection
Provide the
collection ID
Hypercane
Process locally
with AUT
No
Have WARC
Files
Yes
Yes
Yes
No

@TReid803 @WebSciDL
Archive-It Utilities (AIU)
This tool is needed for getting seed metadata that cannot be determined from just the
WARC files.
For more information:
https://ws-dl.blogspot.com/2018/07/2018-07-03-extracting-metadata-from.html
Github Repository: https://github.com/oduwsdl/archiveit_utilities
13
Given collection ID for public
Archive-It collection AIU
Extracts collection metadata from
collection page on Archive-It

@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
14
If you have a web page text derivative from
Archives Unleashed Cloud, then you can go
directly to step 3
If you already have WARCs, then you can
go directly to step 2

@TReid803 @WebSciDL
#
#
#
#
#
#
#
git clone https://github.com/oduwsdl/hypercane.git
cd hypercane
docker-compose run hypercane hc --help
mkdir ../hypercane_workspace
cp ./docker-compose.yml ../hypercane_workspace/docker-compose.yml
cd ../hypercane_workspace
docker-compose run hypercane hc synthesize warcs -i archiveit -a 4006 -o
4006_warcs
Steps For Creating WARC Files With Hypercane
15
If you use this example to create WARC
files, then only the collection ID and
output directory need to be changed
If Docker Compose is installed, then this example
should work on Windows (with PowerShell) and
Unix systems
Docker Compose: https://docs.docker.com/compose/install/

@TReid803 @WebSciDL
collection
16

@TReid803 @WebSciDL
Steps For Creating A Web Page Text Derivative With AUT
1. Use Docker to launch an Apache Spark shell with AUT
2. Create web page text derivative file(s)
3. If there are multiple web page text derivative files, then combine the text
derivative files into one file
17

@TReid803 @WebSciDL
#
#
Using AUT With Docker
Unix Example:
docker run --rm -it -v "/tr/hypercane_workspace/4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
Windows Example:
docker run --rm -it -v
"C:Userstrhypercane_workspace4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
18
AUT Docker Image: https://hub.docker.com/r/archivesunleashed/docker-aut
Unix filenaming convention applies
for this part of the command

@TReid803 @WebSciDL
Creating Derivatives Similar To Web Page Text Derivative: Scala DF
import io.archivesunleashed._
import io.archivesunleashed.udfs._
RecordLoader.loadArchives("/path/to/warcs/*.gz", sc)
.webpages()
.select($"crawl_date", removePrefixWWW(extractDomain($"url")).as("domain"),
$"url", $"mime_type_web_server", $"mime_type_tika", $"language",
removeHTML(removeHTTPHeader(($"content"))).alias("content"))
.write.csv("/path/to/warcs/full-text-df/")
https://github.com/archivesunleashed/aut/blob/39fc370e814fa294545213e918529260dadae261/src/main/scala/io/
archivesunleashed/app/WebPagesExtractor.scala#L24
19
This path needs to be changed
to create this derivative
This path should also be updated
scala> :paste
Make sure to use :paste before
pasting the statements below

@TReid803 @WebSciDL
#
#
#
#
Example (Unix):
mkdir /path/to/warcs/full-text-df/Combined
cat /path/to/warcs/full-text-df/*.csv | sort >
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell 7):
mkdir /path/to/warcs/full-text-df/Combined
Get-Content -Encoding utf8NoBOM /path/to/warcs/full-text-df/*.csv |sort>
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Combine All The Text Derivative Files
20
This encoding makes it easy to read the
file in Python, but is not available in older
versions of PowerShell like version 5.1.
Update PowerShell: docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-windows

@TReid803 @WebSciDL
collection
21

@TReid803 @WebSciDL
#
#
Compress The Text Derivative
Example (Unix):
gzip -k /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell):
Compress-Archive -Path
pathtowarcsfull-text-dfCombinedcollectionID-fulltext.csv
-DestinationPath pathtowarcsfull-text-dfcollectionID-fulltext.zip
22

@TReid803 @WebSciDL
Go To Zenodo
https://zenodo.org/
23

@TReid803 @WebSciDL
Select Upload
24

@TReid803 @WebSciDL
Select New Upload
25

@TReid803 @WebSciDL
Publish The Record After Uploading Files
26

@TReid803 @WebSciDL
Files Cannot Be Modified After Publishing A Record
27

@TReid803 @WebSciDL
When Files Need To Be Changed Create A New Version
28

@TReid803 @WebSciDL
Published Upload
29

@TReid803 @WebSciDL
Copy A Derivative’s Link
30

@TReid803 @WebSciDL
Download The Derivative From Zenodo
The link from the previous step will be used in the
collection growth curve notebook
31
Growth Curve Notebook: https://colab.research.google.com/drive/1xpas-80K3yygMsK8DnRE2l83jfpiO-xs

@TReid803 @WebSciDL
collection
32

@TReid803 @WebSciDL
Go To Collection Growth Curve Notebook
33
https://github.com/treid003/Collection-Growth-Curve-Notebook/blob/main/Collection_Growth_Curve.ipynb

@TReid803 @WebSciDL
Update The Variables In The First Code Cell
34
Collection ID for the public Archive-It collection Name of the web page text derivative file
The type of file compression used for
the downloaded file
The URL needed to download the compressed
derivative file

@TReid803 @WebSciDL
Run The Second Code Cell
35
When certain Python modules need to be
upgraded, the runtime needs to be restarted

@TReid803 @WebSciDL
Restart The Runtime And Run All
36
This step must be done after the second code
cell is finished executing

@TReid803 @WebSciDL
The Collection Growth Curve Will Be Displayed At The Bottom Of The
Notebook
37
Common reasons why seeds are missing:
● New seeds could have been added to the
collection after the text derivative is created
● A seed may not have any captures

@TReid803 @WebSciDL
Useful Resources
● Archives Unleashed Toolkit Documentation
○ https://archivesunleashed.org/aut/
○ https://aut.docs.archivesunleashed.org/docs/home
● AUT Docker Image (https://hub.docker.com/r/archivesunleashed/docker-aut)
● DataFrame Schemas (https://aut.docs.archivesunleashed.org/docs/dataframe-schemas)
● DataFrame Filters (https://aut.docs.archivesunleashed.org/docs/filters-df)
● DataFrame Results (https://aut.docs.archivesunleashed.org/docs/df-results)
● RDD Filters (https://aut.docs.archivesunleashed.org/docs/filters-rdd)
● Apache Spark Documentation (https://spark.apache.org/docs/latest/)
● Hypercane (https://oduwsdl.github.io/hypercane/)
● Hypercane Documentation (https://hypercane.readthedocs.io/en/latest/)
● Hypercane Blog Post Part 2 (https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html)
● Working With Archives Unleashed Cloud
(https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html)
38

Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

Recommended

Recommended

More Related Content

Similar to Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

Similar to Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane (20)

Recently uploaded

Recently uploaded (20)

Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane