SlideShare a Scribd company logo
1 of 20
Download to read offline
CDX Summary:
Web Archival Collection Insights
Sawood Alam and Mark Graham
Wayback Machine, Internet Archive, San Francisco, CA 94118, USA
{sawood,mark}@archive.org
@ibnesayeed @waybackmachine @internetarchive
TPDL, September 21, 2022, Padua, Italy
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Internet Archive
Collections
2
Various media types
Collections of
various media types
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Rich
Collections
3
Distinguishable
cover art/thumbnails
Title, authors, year,
and other metadata
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Items Listing
4
Generic thumbnails
Similar titles
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Metadata
5
Limited metadata
How many captures,
URLs, domains, …? What else might be report here?
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
CDX Summary
6
Now, some IA web
collections include
summary of their
archival contents
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Rendered HTML vs. Source Code
7
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
HTTP Response vs. WARC Record
8
HTTP headers
Payload
WARC headers
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9
Capture Index (CDX)
edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc
edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc
● URL Key (SURT)
● Datetime
● Original URL
● Media Type
● Status Code
● Digest
● Length
● Offset
● WARC File Path
https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: CLI Installation and Usage
10
$ pip install cdxsummary
$ cdxsummary --help
usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]
Summarize web archive capture index (CDX) files.
positional arguments:
input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')
optional arguments:
-h, --help show this help message and exit
-a [QUERY], --api [QUERY]
CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL
-i, --item Treat the input argument as a Petabox item identifier instead of a file path
-j, --json Generate summary in JSON format
-l, --load Load JSON report instead of CDX
-o [FILE], --out [FILE]
Write output to the given file (default: STDOUT)
-r, --report Generate non-summarized JSON report
-s [N], --samples [N]
Number of sample memento URLs in summary (default: 10)
-t [N], --tophosts [N]
Number of hosts with maximum captures in summary (default: 10)
-v, --version Show version number
https://github.com/internetarchive/cdx-summary/
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: JSON Rendition in a Web Component
11
$ cdxsummary --json covid.cdx.gz > covid.summary.json
https://github.com/internetarchive/cdx-summary/tree/main/webcomponent
<cdx-summary src="covid.summary.json"
type="collection"
name="COVID-19 Collection"
format="short"
thumbs="6"
playback="https://archive.example.com/memento/"
fold="thumbs samples">
</cdx-summary>
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Overview
12
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: MIME Type and Status Code
13
Various number
formats in tooltip
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Path Segment and Query Parameter
14
Root URLs
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Year and Month
15
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Top Hosts
16
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Random HTML Capture Samples
17
Load samples
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
<CDX-Summary> Web Component Testing and Customization
18
https://internetarchive.github.io/cdx-summary/webcomponent/
HTML element
attributes
Custom style
variables
Copy custom
output
Customized
rendition/testing
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Future Work
19
● Statistics on unique URIs
● Visualizations
● Scalability
● Incremental updates
● Heuristics-based highlights
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Conclusions
20
● CLI tool for archival CDX files and API summarization
● Human and machine friendly reports (colorized text and JSON)
● Varying levels of details
● Collection understanding and QA
● Efficient random URL sampling
● Custom web component (<cdx-summary> HTML element)
● Web component testing and customization
● Open source
○ https://github.com/internetarchive/cdx-summary
○ https://internetarchive.github.io/cdx-summary/webcomponent/
○ https://pypi.org/project/cdxsummary/
○ https://www.npmjs.com/package/@internetarchive/cdxsummary

More Related Content

Similar to CDX Summary: Web Archival Collection Insights

Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3adamsilverstein
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Igalia
 
Introduction to web technology and it's implementation
Introduction to web technology and it's implementationIntroduction to web technology and it's implementation
Introduction to web technology and it's implementationSureshSingh142
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastAtlassian
 
Orientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALOrientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALRuss Hunt
 
Container Monitoring with Sysdig
Container Monitoring with SysdigContainer Monitoring with Sysdig
Container Monitoring with SysdigSreenivas Makam
 
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17GreeceJS
 
A peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentA peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentR-Cubed Design Forge
 
SpotifyX Architectural Review
SpotifyX Architectural ReviewSpotifyX Architectural Review
SpotifyX Architectural ReviewMorteza Zakeri
 
Monitoring web application response times, a new approach
Monitoring web application response times, a new approachMonitoring web application response times, a new approach
Monitoring web application response times, a new approachMark Friedman
 
Consuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesConsuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesEdwin Rojas
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013gidgreen
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationDavid Amend
 
Build Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyBuild Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyImran Sayed
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTKai Zhao
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるSadaaki HIRAI
 

Similar to CDX Summary: Web Archival Collection Insights (20)

CDNs para el SharePoint Framework (SPFx)
CDNs para el SharePoint Framework (SPFx)CDNs para el SharePoint Framework (SPFx)
CDNs para el SharePoint Framework (SPFx)
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)
 
Introduction to web technology and it's implementation
Introduction to web technology and it's implementationIntroduction to web technology and it's implementation
Introduction to web technology and it's implementation
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
 
Orientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALOrientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICAL
 
Container Monitoring with Sysdig
Container Monitoring with SysdigContainer Monitoring with Sysdig
Container Monitoring with Sysdig
 
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
 
Google Polymer Framework
Google Polymer FrameworkGoogle Polymer Framework
Google Polymer Framework
 
A peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentA peek into the world of WordPress plugin development
A peek into the world of WordPress plugin development
 
SpotifyX Architectural Review
SpotifyX Architectural ReviewSpotifyX Architectural Review
SpotifyX Architectural Review
 
Monitoring web application response times, a new approach
Monitoring web application response times, a new approachMonitoring web application response times, a new approach
Monitoring web application response times, a new approach
 
Consuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesConsuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL Webservices
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentation
 
Build Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyBuild Fast WordPress Site With Gatsby
Build Fast WordPress Site With Gatsby
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoT
 
Development Workflows on AWS
Development Workflows on AWSDevelopment Workflows on AWS
Development Workflows on AWS
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
 

More from Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Recently uploaded

Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesAreesha Ahmad
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptxCherry
 
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxMETHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxCherry
 
GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)Areesha Ahmad
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversitySteffi Friedrichs
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptsreddyrahul
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Sérgio Sacani
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)Areesha Ahmad
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
 
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
SCHISTOSOMA HEAMATOBIUM life cycle  .pdfSCHISTOSOMA HEAMATOBIUM life cycle  .pdf
SCHISTOSOMA HEAMATOBIUM life cycle .pdfDebdattaGhosh6
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPirithiRaju
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalMd Hasan Tareq
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent Universitypablovgd
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Sérgio Sacani
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...frank0071
 
KOCH'S POSTULATE: an extensive over view.pptx
KOCH'S POSTULATE: an extensive over view.pptxKOCH'S POSTULATE: an extensive over view.pptx
KOCH'S POSTULATE: an extensive over view.pptxOmoniyiDayo
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsSérgio Sacani
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyanmuralinath2
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptxCherry
 

Recently uploaded (20)

Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
 
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxMETHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
 
GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
SCHISTOSOMA HEAMATOBIUM life cycle  .pdfSCHISTOSOMA HEAMATOBIUM life cycle  .pdf
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
 
KOCH'S POSTULATE: an extensive over view.pptx
KOCH'S POSTULATE: an extensive over view.pptxKOCH'S POSTULATE: an extensive over view.pptx
KOCH'S POSTULATE: an extensive over view.pptx
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyan
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptx
 

CDX Summary: Web Archival Collection Insights

  • 1. CDX Summary: Web Archival Collection Insights Sawood Alam and Mark Graham Wayback Machine, Internet Archive, San Francisco, CA 94118, USA {sawood,mark}@archive.org @ibnesayeed @waybackmachine @internetarchive TPDL, September 21, 2022, Padua, Italy
  • 2. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Internet Archive Collections 2 Various media types Collections of various media types
  • 3. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Metadata Rich Collections 3 Distinguishable cover art/thumbnails Title, authors, year, and other metadata
  • 4. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Items Listing 4 Generic thumbnails Similar titles
  • 5. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Metadata 5 Limited metadata How many captures, URLs, domains, …? What else might be report here?
  • 6. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection CDX Summary 6 Now, some IA web collections include summary of their archival contents
  • 7. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Rendered HTML vs. Source Code 7
  • 8. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> HTTP Response vs. WARC Record 8 HTTP headers Payload WARC headers
  • 9. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9 Capture Index (CDX) edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc ● URL Key (SURT) ● Datetime ● Original URL ● Media Type ● Status Code ● Digest ● Length ● Offset ● WARC File Path https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
  • 10. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: CLI Installation and Usage 10 $ pip install cdxsummary $ cdxsummary --help usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input] Summarize web archive capture index (CDX) files. positional arguments: input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-') optional arguments: -h, --help show this help message and exit -a [QUERY], --api [QUERY] CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL -i, --item Treat the input argument as a Petabox item identifier instead of a file path -j, --json Generate summary in JSON format -l, --load Load JSON report instead of CDX -o [FILE], --out [FILE] Write output to the given file (default: STDOUT) -r, --report Generate non-summarized JSON report -s [N], --samples [N] Number of sample memento URLs in summary (default: 10) -t [N], --tophosts [N] Number of hosts with maximum captures in summary (default: 10) -v, --version Show version number https://github.com/internetarchive/cdx-summary/
  • 11. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: JSON Rendition in a Web Component 11 $ cdxsummary --json covid.cdx.gz > covid.summary.json https://github.com/internetarchive/cdx-summary/tree/main/webcomponent <cdx-summary src="covid.summary.json" type="collection" name="COVID-19 Collection" format="short" thumbs="6" playback="https://archive.example.com/memento/" fold="thumbs samples"> </cdx-summary>
  • 12. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Overview 12
  • 13. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: MIME Type and Status Code 13 Various number formats in tooltip
  • 14. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Path Segment and Query Parameter 14 Root URLs
  • 15. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Year and Month 15
  • 16. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Top Hosts 16
  • 17. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Random HTML Capture Samples 17 Load samples
  • 18. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> <CDX-Summary> Web Component Testing and Customization 18 https://internetarchive.github.io/cdx-summary/webcomponent/ HTML element attributes Custom style variables Copy custom output Customized rendition/testing
  • 19. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Future Work 19 ● Statistics on unique URIs ● Visualizations ● Scalability ● Incremental updates ● Heuristics-based highlights
  • 20. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Conclusions 20 ● CLI tool for archival CDX files and API summarization ● Human and machine friendly reports (colorized text and JSON) ● Varying levels of details ● Collection understanding and QA ● Efficient random URL sampling ● Custom web component (<cdx-summary> HTML element) ● Web component testing and customization ● Open source ○ https://github.com/internetarchive/cdx-summary ○ https://internetarchive.github.io/cdx-summary/webcomponent/ ○ https://pypi.org/project/cdxsummary/ ○ https://www.npmjs.com/package/@internetarchive/cdxsummary