A tutorial on how to create collection growth curves with a collection growth curve notebook.
Collection growth curve example:
https://twitter.com/TReid803/status/1329193051764494337
A collection growth curve shows when the seeds for an
Archive-It collection were added (green line) and when the seed mementos were added (red line). The green line is associated with the seed curation process and the red line is associated with the crawling behavior.
Collection Growth Curve Notebook:
https://github.com/treid003/Collection-Growth-Curve-Notebook/blob/main/Collection_Growth_Curve.ipynb
Blog post:
https://ws-dl.blogspot.com/2020/11/2020-11-18-creating-collection-growth.html
I presented this at iPres 2018. It consists of an analysis of some structural features found in Archive-It collections. We also categorize Archive-It collections into 4 different semantic categories and then uses the structural features to predict these categories with a Random Forest Classifier.
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
This presentation was delivered in session 306 at the annual meeting of the Society of American Archivists (#saa15). These slides provide information about and lessons learned from the web archiving incentive awards program. Links provided are to facilitate further learning about the tools mentioned but are not a definitive set of resources about these tools.
In 2015, I created a web archiving fundamentals course for the Society of American Archivists (SAA) Digital Archives Specialist (DAS) program. This is a portion of the slide deck I used for that course.
ACDI – African Climate and Development Initiative 2017UCT
UCT libraries resources and orientation. Use ALEPH, search effectively in databases: Scopus and Web of Science, Introduction to RefWorks, How to keep up to date in your field,
I presented this at iPres 2018. It consists of an analysis of some structural features found in Archive-It collections. We also categorize Archive-It collections into 4 different semantic categories and then uses the structural features to predict these categories with a Random Forest Classifier.
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
This presentation was delivered in session 306 at the annual meeting of the Society of American Archivists (#saa15). These slides provide information about and lessons learned from the web archiving incentive awards program. Links provided are to facilitate further learning about the tools mentioned but are not a definitive set of resources about these tools.
In 2015, I created a web archiving fundamentals course for the Society of American Archivists (SAA) Digital Archives Specialist (DAS) program. This is a portion of the slide deck I used for that course.
ACDI – African Climate and Development Initiative 2017UCT
UCT libraries resources and orientation. Use ALEPH, search effectively in databases: Scopus and Web of Science, Introduction to RefWorks, How to keep up to date in your field,
Presentation given at the CILIP Cataloguing and Indexing Group Conference 2014 "The Impact of Metadata" #cig14 on Monday 8 September 2014 at the University of Kent, Canterbury.
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
Semantic Web will be the next big thing in the world of internet. This presentation talks about various approaches that can be used to query the underlying triple store that has all the information.
This is an informal overview of Linked Data and the usage made of it for the project http://res.space (presented on August 11th 2016 during a team meeting)
Information sharing about Columbia University Library’s recent web archiving ...Anna Perricci
This presentation was given at the 2015 Archive-It partner meeting and contains some highlights from a recent web archiving conference held at Columbia University Libraries. More information about this conference, including presentation slides and videos, can be found on this page: https://library.columbia.edu/bts/web_resources_collection/Conferences/program.html
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Research data catalogues and data interoperability in life sciencesBlue BRIDGE
Presentation by Rafael C Jimenez, ELIXIR CTO
This presentation gives an overview of data catalogues in the life sciences and describe different approaches of data interoperability and federation. It also explains the relationship and differences among ELIXIR registries, data repositories, data archives and knowledge-bases. The presentation introduces few ideas for discussion about how to facilitate data interoperability in the European Open Science Cloud.
Copy of the slides given at MadLab as part of Wikipedia Day, held to celebrate 10 years of Wikipedia and to help introduce Free Software and show how it's used, why it's important and to discuss the common philosophies.
http://madlab.org.uk/content/manchester-free-software-wikipedia-day-2/
This is a very basic workshop to introduce novice users to Omeka with an eye towards providing hands-on experience to decide whether it can serve their own research needs.
LOCAH Project and Considerations of Linked Data ApproachesAdrian Stevenson
Presentation given at JISC 'Managing Research Data International Workshop', Birmingham, UK. 29th March 2011
http://www.jisc.ac.uk/whatwedo/programmes/mrd/rdmevents/mrdinternationalworkshop.aspx
Presentation given at the CILIP Cataloguing and Indexing Group Conference 2014 "The Impact of Metadata" #cig14 on Monday 8 September 2014 at the University of Kent, Canterbury.
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
Semantic Web will be the next big thing in the world of internet. This presentation talks about various approaches that can be used to query the underlying triple store that has all the information.
This is an informal overview of Linked Data and the usage made of it for the project http://res.space (presented on August 11th 2016 during a team meeting)
Information sharing about Columbia University Library’s recent web archiving ...Anna Perricci
This presentation was given at the 2015 Archive-It partner meeting and contains some highlights from a recent web archiving conference held at Columbia University Libraries. More information about this conference, including presentation slides and videos, can be found on this page: https://library.columbia.edu/bts/web_resources_collection/Conferences/program.html
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Research data catalogues and data interoperability in life sciencesBlue BRIDGE
Presentation by Rafael C Jimenez, ELIXIR CTO
This presentation gives an overview of data catalogues in the life sciences and describe different approaches of data interoperability and federation. It also explains the relationship and differences among ELIXIR registries, data repositories, data archives and knowledge-bases. The presentation introduces few ideas for discussion about how to facilitate data interoperability in the European Open Science Cloud.
Copy of the slides given at MadLab as part of Wikipedia Day, held to celebrate 10 years of Wikipedia and to help introduce Free Software and show how it's used, why it's important and to discuss the common philosophies.
http://madlab.org.uk/content/manchester-free-software-wikipedia-day-2/
This is a very basic workshop to introduce novice users to Omeka with an eye towards providing hands-on experience to decide whether it can serve their own research needs.
LOCAH Project and Considerations of Linked Data ApproachesAdrian Stevenson
Presentation given at JISC 'Managing Research Data International Workshop', Birmingham, UK. 29th March 2011
http://www.jisc.ac.uk/whatwedo/programmes/mrd/rdmevents/mrdinternationalworkshop.aspx
Similar to Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane (20)
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane
1. Creating Collection Growth Curves
With Archives Unleashed Toolkit
And Hypercane
Travis Reid
Web Science and Digital Libraries Research Group
Old Dominion University
@TReid803 @WebSciDL @oducs
2. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
A Seed Is A URI Selected By An Archivist
2
Archive-It Collection: https://archive-it.org/collections/366
Seeds
3. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
A Memento Is An Archived Web Page And A TimeMap Is A List Of Mementos
3
Archive-It Collection: https://archive-it.org/collections/366
Seeds
TimeMap: List Of Mementos
Mementos
of a seed
4. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Examples Of Seed Mementos
4
Archive-It Collection: https://archive-it.org/collections/366
Seeds
TimeMap: List Of Mementos
A Seed Memento
Mementos
of a seed
5. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Collection Growth Curves
● A collection growth curve is used for
gaining a better understanding of:
○ Seed curation
○ Crawling behavior
● “The Many Shapes of Archive-It” first
applied the concept of collection growth
curves to Archive-It collections
○ https://arxiv.org/abs/1806.06878
I created a Google Colab notebook that can be
used to create collection growth curves.
5
The Anatomy of a Collection Growth Curve
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
6. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Shape Of A Seed Line Depends On When The Seeds Are Added
6
Line will be near upper left corner if
most seeds added early, like 900
seeds added in the first 20 days
Line will be closer to diagonal
when regularly adding seeds, like
2 seeds added for 500 days
Line will be near lower right corner if
most seeds added later, like 850
seeds added during the last 30 days
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
7. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Shape Of A Seed Memento Line Depends On When The Seed Mementos Are
Added
7
Line will be near upper left corner if
most seed mementos added early,
like 8,000 seed mementos during
the first 40 days
Line will be closer to diagonal when
regularly adding seed mementos,
like 20 seed mementos added for
500 days
Line will be near lower right corner
if most seed mementos added
later, like 9,500 seed mementos
added during the last 50 days
Shawn Jones et al., The Many Shapes of Archive-It,
https://arxiv.org/abs/1806.06878
Collection with 1,000 seeds, 10,000 seed mementos, and a lifespan of 500 days
8. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Growth Curves With Different Durations Can Be Compared Since x-axis Is
A Percentage
8
This is either the end of the collection’s life
or the current time when the growth curve
is created
The beginning of the
collection’s life
9. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Main Tools Used
● Docker
● Hypercane
● Archives Unleashed Toolkit (AUT)
● Archive-It Utilities (AIU)
9
10. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Docker
Docker is used in the examples included in these slides, because Docker makes it
easier to install and setup the dependencies needed for AUT and Hypercane.
Docker Desktop: https://www.docker.com/products/docker-desktop
10
Source: www.docker.com/company/newsroom/media-resources
11. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Hypercane
● When you do not own the collection, Hypercane is needed to get the WARCs for the
collection
● If you already have WARC files for a collection, then you do not need to use Hypercane
● GitHub repository: https://github.com/oduwsdl/hypercane
● Hypercane blog post part 2:
https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html
11
Source: github.com/oduwsdl/hypercane
Given collection ID for public
Archive-It collection
Creates WARC files for
the collection
Archives
Unleashed
Toolkit
12. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
No
Owners of a collection can use AU Toolkit or AU Cloud to create the
derivative needed for the growth curve notebook
AUT documentation: https://aut.docs.archivesunleashed.org/docs/home
Working With Archives Unleashed Cloud: https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html
12
Provide WARC
files for the
collection
Create web
page text
derivative
Archives
Unleashed
Toolkit
Archives
Unleashed
Cloud
User owns the
collection
Provide the
collection ID
Hypercane
Process locally
with AUT
No
Have WARC
Files
Yes
Yes
Yes
No
13. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Archive-It Utilities (AIU)
This tool is needed for getting seed metadata that cannot be determined from just the
WARC files.
For more information:
https://ws-dl.blogspot.com/2018/07/2018-07-03-extracting-metadata-from.html
Github Repository: https://github.com/oduwsdl/archiveit_utilities
13
Given collection ID for public
Archive-It collection AIU
Extracts collection metadata from
collection page on Archive-It
14. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
14
If you have a web page text derivative from
Archives Unleashed Cloud, then you can go
directly to step 3
If you already have WARCs, then you can
go directly to step 2
15. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
#
#
#
#
#
git clone https://github.com/oduwsdl/hypercane.git
cd hypercane
docker-compose run hypercane hc --help
mkdir ../hypercane_workspace
cp ./docker-compose.yml ../hypercane_workspace/docker-compose.yml
cd ../hypercane_workspace
docker-compose run hypercane hc synthesize warcs -i archiveit -a 4006 -o
4006_warcs
Steps For Creating WARC Files With Hypercane
15
If you use this example to create WARC
files, then only the collection ID and
output directory need to be changed
If Docker Compose is installed, then this example
should work on Windows (with PowerShell) and
Unix systems
Docker Compose: https://docs.docker.com/compose/install/
16. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
16
17. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating A Web Page Text Derivative With AUT
1. Use Docker to launch an Apache Spark shell with AUT
2. Create web page text derivative file(s)
3. If there are multiple web page text derivative files, then combine the text
derivative files into one file
17
18. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
Using AUT With Docker
Unix Example:
docker run --rm -it -v "/tr/hypercane_workspace/4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
Windows Example:
docker run --rm -it -v
"C:Userstrhypercane_workspace4006_warcs:/4006_warcs"
archivesunleashed/docker-aut
18
AUT Docker Image: https://hub.docker.com/r/archivesunleashed/docker-aut
Unix filenaming convention applies
for this part of the command
19. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Creating Derivatives Similar To Web Page Text Derivative: Scala DF
import io.archivesunleashed._
import io.archivesunleashed.udfs._
RecordLoader.loadArchives("/path/to/warcs/*.gz", sc)
.webpages()
.select($"crawl_date", removePrefixWWW(extractDomain($"url")).as("domain"),
$"url", $"mime_type_web_server", $"mime_type_tika", $"language",
removeHTML(removeHTTPHeader(($"content"))).alias("content"))
.write.csv("/path/to/warcs/full-text-df/")
https://github.com/archivesunleashed/aut/blob/39fc370e814fa294545213e918529260dadae261/src/main/scala/io/
archivesunleashed/app/WebPagesExtractor.scala#L24
19
This path needs to be changed
to create this derivative
This path should also be updated
scala> :paste
Make sure to use :paste before
pasting the statements below
20. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
#
#
Example (Unix):
mkdir /path/to/warcs/full-text-df/Combined
cat /path/to/warcs/full-text-df/*.csv | sort >
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell 7):
mkdir /path/to/warcs/full-text-df/Combined
Get-Content -Encoding utf8NoBOM /path/to/warcs/full-text-df/*.csv |sort>
/path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Combine All The Text Derivative Files
20
This encoding makes it easy to read the
file in Python, but is not available in older
versions of PowerShell like version 5.1.
Update PowerShell: docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-windows
21. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
21
22. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
#
#
Compress The Text Derivative
Example (Unix):
gzip -k /path/to/warcs/full-text-df/Combined/collectionID-fulltext.csv
Example (Windows PowerShell):
Compress-Archive -Path
pathtowarcsfull-text-dfCombinedcollectionID-fulltext.csv
-DestinationPath pathtowarcsfull-text-dfcollectionID-fulltext.zip
22
23. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Go To Zenodo
https://zenodo.org/
23
24. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Select Upload
24
25. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Select New Upload
25
26. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Publish The Record After Uploading Files
26
27. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Files Cannot Be Modified After Publishing A Record
27
28. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
When Files Need To Be Changed Create A New Version
28
29. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Published Upload
29
30. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Copy A Derivative’s Link
30
31. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Download The Derivative From Zenodo
The link from the previous step will be used in the
collection growth curve notebook
31
Growth Curve Notebook: https://colab.research.google.com/drive/1xpas-80K3yygMsK8DnRE2l83jfpiO-xs
32. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Steps For Creating Collection Growth Curves
1. Use Hypercane to create WARCs associated with a public Archive-It
collection
2. Create a web page text derivative with Archives Unleashed Toolkit
3. Upload the compressed web page text derivative to Zenodo
4. Use the collection growth curve notebook
32
33. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Go To Collection Growth Curve Notebook
33
https://github.com/treid003/Collection-Growth-Curve-Notebook/blob/main/Collection_Growth_Curve.ipynb
34. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Update The Variables In The First Code Cell
34
Collection ID for the public Archive-It collection Name of the web page text derivative file
The type of file compression used for
the downloaded file
The URL needed to download the compressed
derivative file
35. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Run The Second Code Cell
35
When certain Python modules need to be
upgraded, the runtime needs to be restarted
36. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Restart The Runtime And Run All
36
This step must be done after the second code
cell is finished executing
37. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
The Collection Growth Curve Will Be Displayed At The Bottom Of The
Notebook
37
Common reasons why seeds are missing:
● New seeds could have been added to the
collection after the text derivative is created
● A seed may not have any captures
38. Creating Collection Growth Curves With
Archives Unleashed Toolkit And Hypercane
@TReid803 @WebSciDL
Useful Resources
● Archives Unleashed Toolkit Documentation
○ https://archivesunleashed.org/aut/
○ https://aut.docs.archivesunleashed.org/docs/home
● AUT Docker Image (https://hub.docker.com/r/archivesunleashed/docker-aut)
● DataFrame Schemas (https://aut.docs.archivesunleashed.org/docs/dataframe-schemas)
● DataFrame Filters (https://aut.docs.archivesunleashed.org/docs/filters-df)
● DataFrame Results (https://aut.docs.archivesunleashed.org/docs/df-results)
● RDD Filters (https://aut.docs.archivesunleashed.org/docs/filters-rdd)
● Apache Spark Documentation (https://spark.apache.org/docs/latest/)
● Hypercane (https://oduwsdl.github.io/hypercane/)
● Hypercane Documentation (https://hypercane.readthedocs.io/en/latest/)
● Hypercane Blog Post Part 2 (https://ws-dl.blogspot.com/2020/06/2020-06-10-hypercane-part-2.html)
● Working With Archives Unleashed Cloud
(https://ws-dl.blogspot.com/2020/07/2020-07-29-working-with-archives.html)
38