SlideShare a Scribd company logo
Creating Collection Growth Curves
With Archives Unleashed Toolkit
And Hypercane
Travis Reid
Web Science and Digital Libraries Research Group
Old Dominion University
@TReid803 @WebSciDL @oducs
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
A Seed Is A URI Selected By An Archivist
2
Archive-It Collection: https://archive-it.org/collections/366
Seeds
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
A Memento Is An Archived Web Page And A TimeMap Is A List Of Mementos
3
Archive-It Collection: https://archive-it.org/collections/366
Seeds
TimeMap: List Of Mementos
Mementos
of a seed
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Examples Of Seed Mementos
4
Archive-It Collection: https://archive-it.org/collections/366
Seeds
TimeMap: List Of Mementos
A Seed Memento
Mementos
of a seed
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Collection Growth Curves
● A collection growth curve is used for
gaining a better understanding of:
○ Seed curation
○ Crawling behavior
● “The Many Shapes of Archive-It” first
applied the concept of collection growth
curves to Archive-It collections
○ https://arxiv.org/abs/1806.06878
I created a Google Colab notebook that can be
used to create collection growth curves.
5
The Anatomy of a Collection Growth Curve
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Shape Of A Seed Line Depends On When The Seeds Are Added
6
Line will be near upper left corner if
most seeds added early, like 900
seeds added in the first 20 days
Line will be closer to diagonal
when regularly adding seeds, like
2 seeds added for 500 days
Line will be near lower right corner if
most seeds added later, like 850
seeds added during the last 30 days
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Shape Of A Seed Memento Line Depends On When The Seed Mementos Are
Added
7
Line will be near upper left corner if
most seed mementos added early,
like 8,000 seed mementos during
the first 40 days
Line will be closer to diagonal when
regularly adding seed mementos,
like 20 seed mementos added for
500 days
Line will be near lower right corner
if most seed mementos added
later, like 9,500 seed mementos
added during the last 50 days
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Growth Curves With Different Durations Can Be Compared Since x-axis Is
A Percentage
8
This is either the end of the collection’s life
or the current time when the growth curve
is created
The beginning of the
collection’s life
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Main Tools Used
● Docker
● Hypercane
● Archives Unleashed Toolkit (AUT)
● Archive-It Utilities (AIU)
9
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Docker
Docker is used in the examples included in these slides, because Docker makes it
easier to install and setup the dependencies needed for AUT and Hypercane.
Docker Desktop: https://www.docker.com/products/docker-desktop
10
Source: www.docker.com/company/newsroom/media-resources
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Hypercane
● When you do not own the collection, Hypercane is needed to get the WARCs for the
collection
● If you already have WARC files for a collection, then you do not need to use Hypercane
● GitHub repository: https://github.com/oduwsdl/hypercane
● Hypercane blog post part 2:
https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html
11
Source: github.com/oduwsdl/hypercane
Given collection ID for public
Archive-It collection
Creates WARC files for
the collection
Archives
Unleashed
Toolkit
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
No
Owners of a collection can use AU Toolkit or AU Cloud to create the
derivative needed for the growth curve notebook
AUT documentation: https://aut.docs.archivesunleashed.org/docs/home
Working With Archives Unleashed Cloud: https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html
12
Provide WARC
files for the
collection
Create web
page text
derivative
Archives
Unleashed
Toolkit
Archives
Unleashed
Cloud
User owns the
collection
Provide the
collection ID
Hypercane
Process locally
with AUT
No
Have WARC
Files
Yes
Yes
Yes
No
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Archive-It Utilities (AIU)
This tool is needed for getting seed metadata that cannot be determined from just the
WARC files.
For more information:
https://ws-dl.blogspot.com/2018/07/2018-07-03-extracting-metadata-from.html
Github Repository: https://github.com/oduwsdl/archiveit_utilities
13
Given collection ID for public
Archive-It collection AIU
Extracts collection metadata from
collection page on Archive-It
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
14
If you have a web page text derivative from
Archives Unleashed Cloud, then you can go
directly to step 3
If you already have WARCs, then you can
go directly to step 2
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
#
#
#
#
#
git clone https://github.com/oduwsdl/hypercane.git
cd hypercane
docker-compose run hypercane hc --help
mkdir ../hypercane_workspace
cp ./docker-compose.yml ../hypercane_workspace/docker-compose.yml
cd ../hypercane_workspace
docker-compose run hypercane hc synthesize warcs -i archiveit -a 4006 -o
4006_warcs
Steps For Creating WARC Files With Hypercane
15
If you use this example to create WARC
files, then only the collection ID and
output directory need to be changed
If Docker Compose is installed, then this example
should work on Windows (with PowerShell) and
Unix systems
Docker Compose: https://docs.docker.com/compose/install/
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
16
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating A Web Page Text Derivative With AUT
1. Use Docker to launch an Apache Spark shell with AUT
2. Create web page text derivative file(s)
3. If there are multiple web page text derivative files, then combine the text
derivative files into one file
17
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
Using AUT With Docker
Unix Example:
docker run --rm -it -v "/tr/hypercane_workspace/4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
Windows Example:
docker run --rm -it -v
"C:Userstrhypercane_workspace4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
18
AUT Docker Image: https://hub.docker.com/r/archivesunleashed/docker-aut
Unix filenaming convention applies
for this part of the command
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Creating Derivatives Similar To Web Page Text Derivative: Scala DF
import io.archivesunleashed._
import io.archivesunleashed.udfs._
RecordLoader.loadArchives("/path/to/warcs/*.gz", sc)
.webpages()
.select($"crawl_date", removePrefixWWW(extractDomain($"url")).as("domain"),
$"url", $"mime_type_web_server", $"mime_type_tika", $"language",
removeHTML(removeHTTPHeader(($"content"))).alias("content"))
.write.csv("/path/to/warcs/full-text-df/")
https://github.com/archivesunleashed/aut/blob/39fc370e814fa294545213e918529260dadae261/src/main/scala/io/
archivesunleashed/app/WebPagesExtractor.scala#L24
19
This path needs to be changed
to create this derivative
This path should also be updated
scala> :paste
Make sure to use :paste before
pasting the statements below
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
#
#
Example (Unix):
mkdir /path/to/warcs/full-text-df/Combined
cat /path/to/warcs/full-text-df/*.csv | sort >
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell 7):
mkdir /path/to/warcs/full-text-df/Combined
Get-Content -Encoding utf8NoBOM /path/to/warcs/full-text-df/*.csv |sort>
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Combine All The Text Derivative Files
20
This encoding makes it easy to read the
file in Python, but is not available in older
versions of PowerShell like version 5.1.
Update PowerShell: docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-windows
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
21
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
Compress The Text Derivative
Example (Unix):
gzip -k /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell):
Compress-Archive -Path
pathtowarcsfull-text-dfCombinedcollectionID-fulltext.csv
-DestinationPath pathtowarcsfull-text-dfcollectionID-fulltext.zip
22
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Go To Zenodo
https://zenodo.org/
23
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Select Upload
24
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Select New Upload
25
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Publish The Record After Uploading Files
26
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Files Cannot Be Modified After Publishing A Record
27
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
When Files Need To Be Changed Create A New Version
28
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Published Upload
29
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Copy A Derivative’s Link
30
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Download The Derivative From Zenodo
The link from the previous step will be used in the
collection growth curve notebook
31
Growth Curve Notebook: https://colab.research.google.com/drive/1xpas-80K3yygMsK8DnRE2l83jfpiO-xs
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
32
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Go To Collection Growth Curve Notebook
33
https://github.com/treid003/Collection-Growth-Curve-Notebook/blob/main/Collection_Growth_Curve.ipynb
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Update The Variables In The First Code Cell
34
Collection ID for the public Archive-It collection Name of the web page text derivative file
The type of file compression used for
the downloaded file
The URL needed to download the compressed
derivative file
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Run The Second Code Cell
35
When certain Python modules need to be
upgraded, the runtime needs to be restarted
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Restart The Runtime And Run All
36
This step must be done after the second code
cell is finished executing
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Collection Growth Curve Will Be Displayed At The Bottom Of The
Notebook
37
Common reasons why seeds are missing:
● New seeds could have been added to the
collection after the text derivative is created
● A seed may not have any captures
Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Useful Resources
● Archives Unleashed Toolkit Documentation
○ https://archivesunleashed.org/aut/
○ https://aut.docs.archivesunleashed.org/docs/home
● AUT Docker Image (https://hub.docker.com/r/archivesunleashed/docker-aut)
● DataFrame Schemas (https://aut.docs.archivesunleashed.org/docs/dataframe-schemas)
● DataFrame Filters (https://aut.docs.archivesunleashed.org/docs/filters-df)
● DataFrame Results (https://aut.docs.archivesunleashed.org/docs/df-results)
● RDD Filters (https://aut.docs.archivesunleashed.org/docs/filters-rdd)
● Apache Spark Documentation (https://spark.apache.org/docs/latest/)
● Hypercane (https://oduwsdl.github.io/hypercane/)
● Hypercane Documentation (https://hypercane.readthedocs.io/en/latest/)
● Hypercane Blog Post Part 2 (https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html)
● Working With Archives Unleashed Cloud
(https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html)
38

More Related Content

Similar to Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of Bibframe
Thomas Meehan
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
Ian Mulvany
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
Richard Wallis
 
Wikimedia Game Jam 20015: Wikimedia APIs
Wikimedia Game Jam 20015: Wikimedia APIsWikimedia Game Jam 20015: Wikimedia APIs
Wikimedia Game Jam 20015: Wikimedia APIs
Lucie-Aimée Kaffee
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
Richard Wallis
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
Ankit Solanki
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
Christophe Guéret
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...
Anna Perricci
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael Nelson
 
Applied marine science 2017
Applied marine science 2017Applied marine science 2017
Applied marine science 2017
UCT
 
Open Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future EverythingOpen Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future Everything
Massimo Menichinelli
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
Ayapparaj SKS
 
Research data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesResearch data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciences
Blue BRIDGE
 
Biological Science Honours class of 2017
Biological Science Honours class of 2017Biological Science Honours class of 2017
Biological Science Honours class of 2017
UCT
 
Wikipedia Day 2011 Talk
Wikipedia Day 2011 TalkWikipedia Day 2011 Talk
Wikipedia Day 2011 Talk
Mark Reynolds
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
Shawn Day
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
Ahmed AlSum
 
LOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data ApproachesLOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data Approaches
Adrian Stevenson
 

Similar to Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane (20)

The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of Bibframe
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
Wikimedia Game Jam 20015: Wikimedia APIs
Wikimedia Game Jam 20015: Wikimedia APIsWikimedia Game Jam 20015: Wikimedia APIs
Wikimedia Game Jam 20015: Wikimedia APIs
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Applied marine science 2017
Applied marine science 2017Applied marine science 2017
Applied marine science 2017
 
Open Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future EverythingOpen Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future Everything
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
 
Research data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesResearch data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciences
 
Biological Science Honours class of 2017
Biological Science Honours class of 2017Biological Science Honours class of 2017
Biological Science Honours class of 2017
 
Wikipedia Day 2011 Talk
Wikipedia Day 2011 TalkWikipedia Day 2011 Talk
Wikipedia Day 2011 Talk
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
LOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data ApproachesLOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data Approaches
 

Recently uploaded

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 

Recently uploaded (20)

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 

Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

  • 1. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane Travis Reid Web Science and Digital Libraries Research Group Old Dominion University @TReid803 @WebSciDL @oducs
  • 2. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL A Seed Is A URI Selected By An Archivist 2 Archive-It Collection: https://archive-it.org/collections/366 Seeds
  • 3. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL A Memento Is An Archived Web Page And A TimeMap Is A List Of Mementos 3 Archive-It Collection: https://archive-it.org/collections/366 Seeds TimeMap: List Of Mementos Mementos of a seed
  • 4. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Examples Of Seed Mementos 4 Archive-It Collection: https://archive-it.org/collections/366 Seeds TimeMap: List Of Mementos A Seed Memento Mementos of a seed
  • 5. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Collection Growth Curves ● A collection growth curve is used for gaining a better understanding of: ○ Seed curation ○ Crawling behavior ● “The Many Shapes of Archive-It” first applied the concept of collection growth curves to Archive-It collections ○ https://arxiv.org/abs/1806.06878 I created a Google Colab notebook that can be used to create collection growth curves. 5 The Anatomy of a Collection Growth Curve Shawn Jones et al., The Many Shapes of Archive-It, https://arxiv.org/abs/1806.06878
  • 6. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL The Shape Of A Seed Line Depends On When The Seeds Are Added 6 Line will be near upper left corner if most seeds added early, like 900 seeds added in the first 20 days Line will be closer to diagonal when regularly adding seeds, like 2 seeds added for 500 days Line will be near lower right corner if most seeds added later, like 850 seeds added during the last 30 days Shawn Jones et al., The Many Shapes of Archive-It, https://arxiv.org/abs/1806.06878 Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
  • 7. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL The Shape Of A Seed Memento Line Depends On When The Seed Mementos Are Added 7 Line will be near upper left corner if most seed mementos added early, like 8,000 seed mementos during the first 40 days Line will be closer to diagonal when regularly adding seed mementos, like 20 seed mementos added for 500 days Line will be near lower right corner if most seed mementos added later, like 9,500 seed mementos added during the last 50 days Shawn Jones et al., The Many Shapes of Archive-It, https://arxiv.org/abs/1806.06878 Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
  • 8. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Growth Curves With Different Durations Can Be Compared Since x-axis Is A Percentage 8 This is either the end of the collection’s life or the current time when the growth curve is created The beginning of the collection’s life
  • 9. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Main Tools Used ● Docker ● Hypercane ● Archives Unleashed Toolkit (AUT) ● Archive-It Utilities (AIU) 9
  • 10. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Docker Docker is used in the examples included in these slides, because Docker makes it easier to install and setup the dependencies needed for AUT and Hypercane. Docker Desktop: https://www.docker.com/products/docker-desktop 10 Source: www.docker.com/company/newsroom/media-resources
  • 11. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Hypercane ● When you do not own the collection, Hypercane is needed to get the WARCs for the collection ● If you already have WARC files for a collection, then you do not need to use Hypercane ● GitHub repository: https://github.com/oduwsdl/hypercane ● Hypercane blog post part 2: https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html 11 Source: github.com/oduwsdl/hypercane Given collection ID for public Archive-It collection Creates WARC files for the collection Archives Unleashed Toolkit
  • 12. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL No Owners of a collection can use AU Toolkit or AU Cloud to create the derivative needed for the growth curve notebook AUT documentation: https://aut.docs.archivesunleashed.org/docs/home Working With Archives Unleashed Cloud: https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html 12 Provide WARC files for the collection Create web page text derivative Archives Unleashed Toolkit Archives Unleashed Cloud User owns the collection Provide the collection ID Hypercane Process locally with AUT No Have WARC Files Yes Yes Yes No
  • 13. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Archive-It Utilities (AIU) This tool is needed for getting seed metadata that cannot be determined from just the WARC files. For more information: https://ws-dl.blogspot.com/2018/07/2018-07-03-extracting-metadata-from.html Github Repository: https://github.com/oduwsdl/archiveit_utilities 13 Given collection ID for public Archive-It collection AIU Extracts collection metadata from collection page on Archive-It
  • 14. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Steps For Creating Collection Growth Curves 1. Use Hypercane to create WARCs associated with a public Archive-It collection 2. Create a web page text derivative with Archives Unleashed Toolkit 3. Upload the compressed web page text derivative to Zenodo 4. Use the collection growth curve notebook 14 If you have a web page text derivative from Archives Unleashed Cloud, then you can go directly to step 3 If you already have WARCs, then you can go directly to step 2
  • 15. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL # # # # # # # git clone https://github.com/oduwsdl/hypercane.git cd hypercane docker-compose run hypercane hc --help mkdir ../hypercane_workspace cp ./docker-compose.yml ../hypercane_workspace/docker-compose.yml cd ../hypercane_workspace docker-compose run hypercane hc synthesize warcs -i archiveit -a 4006 -o 4006_warcs Steps For Creating WARC Files With Hypercane 15 If you use this example to create WARC files, then only the collection ID and output directory need to be changed If Docker Compose is installed, then this example should work on Windows (with PowerShell) and Unix systems Docker Compose: https://docs.docker.com/compose/install/
  • 16. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Steps For Creating Collection Growth Curves 1. Use Hypercane to create WARCs associated with a public Archive-It collection 2. Create a web page text derivative with Archives Unleashed Toolkit 3. Upload the compressed web page text derivative to Zenodo 4. Use the collection growth curve notebook 16
  • 17. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Steps For Creating A Web Page Text Derivative With AUT 1. Use Docker to launch an Apache Spark shell with AUT 2. Create web page text derivative file(s) 3. If there are multiple web page text derivative files, then combine the text derivative files into one file 17
  • 18. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL # # Using AUT With Docker Unix Example: docker run --rm -it -v "/tr/hypercane_workspace/4006_warcs:/4006_warcs" archivesunleashed/docker-aut Windows Example: docker run --rm -it -v "C:Userstrhypercane_workspace4006_warcs:/4006_warcs" archivesunleashed/docker-aut 18 AUT Docker Image: https://hub.docker.com/r/archivesunleashed/docker-aut Unix filenaming convention applies for this part of the command
  • 19. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Creating Derivatives Similar To Web Page Text Derivative: Scala DF import io.archivesunleashed._ import io.archivesunleashed.udfs._ RecordLoader.loadArchives("/path/to/warcs/*.gz", sc) .webpages() .select($"crawl_date", removePrefixWWW(extractDomain($"url")).as("domain"), $"url", $"mime_type_web_server", $"mime_type_tika", $"language", removeHTML(removeHTTPHeader(($"content"))).alias("content")) .write.csv("/path/to/warcs/full-text-df/") https://github.com/archivesunleashed/aut/blob/39fc370e814fa294545213e918529260dadae261/src/main/scala/io/ archivesunleashed/app/WebPagesExtractor.scala#L24 19 This path needs to be changed to create this derivative This path should also be updated scala> :paste Make sure to use :paste before pasting the statements below
  • 20. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL # # # # Example (Unix): mkdir /path/to/warcs/full-text-df/Combined cat /path/to/warcs/full-text-df/*.csv | sort > /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv Example (Windows PowerShell 7): mkdir /path/to/warcs/full-text-df/Combined Get-Content -Encoding utf8NoBOM /path/to/warcs/full-text-df/*.csv |sort> /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv Combine All The Text Derivative Files 20 This encoding makes it easy to read the file in Python, but is not available in older versions of PowerShell like version 5.1. Update PowerShell: docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-windows
  • 21. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Steps For Creating Collection Growth Curves 1. Use Hypercane to create WARCs associated with a public Archive-It collection 2. Create a web page text derivative with Archives Unleashed Toolkit 3. Upload the compressed web page text derivative to Zenodo 4. Use the collection growth curve notebook 21
  • 22. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL # # Compress The Text Derivative Example (Unix): gzip -k /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv Example (Windows PowerShell): Compress-Archive -Path pathtowarcsfull-text-dfCombinedcollectionID-fulltext.csv -DestinationPath pathtowarcsfull-text-dfcollectionID-fulltext.zip 22
  • 23. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Go To Zenodo https://zenodo.org/ 23
  • 24. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Select Upload 24
  • 25. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Select New Upload 25
  • 26. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Publish The Record After Uploading Files 26
  • 27. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Files Cannot Be Modified After Publishing A Record 27
  • 28. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL When Files Need To Be Changed Create A New Version 28
  • 29. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Published Upload 29
  • 30. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Copy A Derivative’s Link 30
  • 31. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Download The Derivative From Zenodo The link from the previous step will be used in the collection growth curve notebook 31 Growth Curve Notebook: https://colab.research.google.com/drive/1xpas-80K3yygMsK8DnRE2l83jfpiO-xs
  • 32. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Steps For Creating Collection Growth Curves 1. Use Hypercane to create WARCs associated with a public Archive-It collection 2. Create a web page text derivative with Archives Unleashed Toolkit 3. Upload the compressed web page text derivative to Zenodo 4. Use the collection growth curve notebook 32
  • 33. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Go To Collection Growth Curve Notebook 33 https://github.com/treid003/Collection-Growth-Curve-Notebook/blob/main/Collection_Growth_Curve.ipynb
  • 34. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Update The Variables In The First Code Cell 34 Collection ID for the public Archive-It collection Name of the web page text derivative file The type of file compression used for the downloaded file The URL needed to download the compressed derivative file
  • 35. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Run The Second Code Cell 35 When certain Python modules need to be upgraded, the runtime needs to be restarted
  • 36. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Restart The Runtime And Run All 36 This step must be done after the second code cell is finished executing
  • 37. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL The Collection Growth Curve Will Be Displayed At The Bottom Of The Notebook 37 Common reasons why seeds are missing: ● New seeds could have been added to the collection after the text derivative is created ● A seed may not have any captures
  • 38. Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane @TReid803 @WebSciDL Useful Resources ● Archives Unleashed Toolkit Documentation ○ https://archivesunleashed.org/aut/ ○ https://aut.docs.archivesunleashed.org/docs/home ● AUT Docker Image (https://hub.docker.com/r/archivesunleashed/docker-aut) ● DataFrame Schemas (https://aut.docs.archivesunleashed.org/docs/dataframe-schemas) ● DataFrame Filters (https://aut.docs.archivesunleashed.org/docs/filters-df) ● DataFrame Results (https://aut.docs.archivesunleashed.org/docs/df-results) ● RDD Filters (https://aut.docs.archivesunleashed.org/docs/filters-rdd) ● Apache Spark Documentation (https://spark.apache.org/docs/latest/) ● Hypercane (https://oduwsdl.github.io/hypercane/) ● Hypercane Documentation (https://hypercane.readthedocs.io/en/latest/) ● Hypercane Blog Post Part 2 (https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html) ● Working With Archives Unleashed Cloud (https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html) 38