SlideShare a Scribd company logo
1 of 20
Download to read offline
CDX Summary:
Web Archival Collection Insights
Sawood Alam and Mark Graham
Wayback Machine, Internet Archive, San Francisco, CA 94118, USA
{sawood,mark}@archive.org
@ibnesayeed @waybackmachine @internetarchive
TPDL, September 21, 2022, Padua, Italy
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Internet Archive
Collections
2
Various media types
Collections of
various media types
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Rich
Collections
3
Distinguishable
cover art/thumbnails
Title, authors, year,
and other metadata
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Items Listing
4
Generic thumbnails
Similar titles
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Metadata
5
Limited metadata
How many captures,
URLs, domains, …? What else might be report here?
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
CDX Summary
6
Now, some IA web
collections include
summary of their
archival contents
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Rendered HTML vs. Source Code
7
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
HTTP Response vs. WARC Record
8
HTTP headers
Payload
WARC headers
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9
Capture Index (CDX)
edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc
edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc
● URL Key (SURT)
● Datetime
● Original URL
● Media Type
● Status Code
● Digest
● Length
● Offset
● WARC File Path
https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: CLI Installation and Usage
10
$ pip install cdxsummary
$ cdxsummary --help
usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]
Summarize web archive capture index (CDX) files.
positional arguments:
input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')
optional arguments:
-h, --help show this help message and exit
-a [QUERY], --api [QUERY]
CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL
-i, --item Treat the input argument as a Petabox item identifier instead of a file path
-j, --json Generate summary in JSON format
-l, --load Load JSON report instead of CDX
-o [FILE], --out [FILE]
Write output to the given file (default: STDOUT)
-r, --report Generate non-summarized JSON report
-s [N], --samples [N]
Number of sample memento URLs in summary (default: 10)
-t [N], --tophosts [N]
Number of hosts with maximum captures in summary (default: 10)
-v, --version Show version number
https://github.com/internetarchive/cdx-summary/
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: JSON Rendition in a Web Component
11
$ cdxsummary --json covid.cdx.gz > covid.summary.json
https://github.com/internetarchive/cdx-summary/tree/main/webcomponent
<cdx-summary src="covid.summary.json"
type="collection"
name="COVID-19 Collection"
format="short"
thumbs="6"
playback="https://archive.example.com/memento/"
fold="thumbs samples">
</cdx-summary>
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Overview
12
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: MIME Type and Status Code
13
Various number
formats in tooltip
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Path Segment and Query Parameter
14
Root URLs
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Year and Month
15
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Top Hosts
16
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Random HTML Capture Samples
17
Load samples
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
<CDX-Summary> Web Component Testing and Customization
18
https://internetarchive.github.io/cdx-summary/webcomponent/
HTML element
attributes
Custom style
variables
Copy custom
output
Customized
rendition/testing
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Future Work
19
● Statistics on unique URIs
● Visualizations
● Scalability
● Incremental updates
● Heuristics-based highlights
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Conclusions
20
● CLI tool for archival CDX files and API summarization
● Human and machine friendly reports (colorized text and JSON)
● Varying levels of details
● Collection understanding and QA
● Efficient random URL sampling
● Custom web component (<cdx-summary> HTML element)
● Web component testing and customization
● Open source
○ https://github.com/internetarchive/cdx-summary
○ https://internetarchive.github.io/cdx-summary/webcomponent/
○ https://pypi.org/project/cdxsummary/
○ https://www.npmjs.com/package/@internetarchive/cdxsummary

More Related Content

Similar to CDX Summary: Web Archival Collection Insights

Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3adamsilverstein
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Igalia
 
Introduction to web technology and it's implementation
Introduction to web technology and it's implementationIntroduction to web technology and it's implementation
Introduction to web technology and it's implementationSureshSingh142
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastAtlassian
 
Orientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALOrientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALRuss Hunt
 
Container Monitoring with Sysdig
Container Monitoring with SysdigContainer Monitoring with Sysdig
Container Monitoring with SysdigSreenivas Makam
 
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17GreeceJS
 
A peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentA peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentR-Cubed Design Forge
 
SpotifyX Architectural Review
SpotifyX Architectural ReviewSpotifyX Architectural Review
SpotifyX Architectural ReviewMorteza Zakeri
 
Monitoring web application response times, a new approach
Monitoring web application response times, a new approachMonitoring web application response times, a new approach
Monitoring web application response times, a new approachMark Friedman
 
Consuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesConsuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesEdwin Rojas
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013gidgreen
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationDavid Amend
 
Build Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyBuild Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyImran Sayed
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTKai Zhao
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるSadaaki HIRAI
 

Similar to CDX Summary: Web Archival Collection Insights (20)

CDNs para el SharePoint Framework (SPFx)
CDNs para el SharePoint Framework (SPFx)CDNs para el SharePoint Framework (SPFx)
CDNs para el SharePoint Framework (SPFx)
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)
 
Introduction to web technology and it's implementation
Introduction to web technology and it's implementationIntroduction to web technology and it's implementation
Introduction to web technology and it's implementation
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
 
Orientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICALOrientation to Contentdm QuickStart for AMICAL
Orientation to Contentdm QuickStart for AMICAL
 
Container Monitoring with Sysdig
Container Monitoring with SysdigContainer Monitoring with Sysdig
Container Monitoring with Sysdig
 
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
The rise of Polymer and Web Components (Kostas Karolemeas) - GreeceJS #17
 
Google Polymer Framework
Google Polymer FrameworkGoogle Polymer Framework
Google Polymer Framework
 
A peek into the world of WordPress plugin development
A peek into the world of WordPress plugin developmentA peek into the world of WordPress plugin development
A peek into the world of WordPress plugin development
 
SpotifyX Architectural Review
SpotifyX Architectural ReviewSpotifyX Architectural Review
SpotifyX Architectural Review
 
Monitoring web application response times, a new approach
Monitoring web application response times, a new approachMonitoring web application response times, a new approach
Monitoring web application response times, a new approach
 
Consuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL WebservicesConsuming GRIN GLOBAL Webservices
Consuming GRIN GLOBAL Webservices
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentation
 
Build Fast WordPress Site With Gatsby
Build Fast WordPress Site With GatsbyBuild Fast WordPress Site With Gatsby
Build Fast WordPress Site With Gatsby
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoT
 
Development Workflows on AWS
Development Workflows on AWSDevelopment Workflows on AWS
Development Workflows on AWS
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
 

More from Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Recently uploaded

The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 

Recently uploaded (20)

The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 

CDX Summary: Web Archival Collection Insights

  • 1. CDX Summary: Web Archival Collection Insights Sawood Alam and Mark Graham Wayback Machine, Internet Archive, San Francisco, CA 94118, USA {sawood,mark}@archive.org @ibnesayeed @waybackmachine @internetarchive TPDL, September 21, 2022, Padua, Italy
  • 2. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Internet Archive Collections 2 Various media types Collections of various media types
  • 3. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Metadata Rich Collections 3 Distinguishable cover art/thumbnails Title, authors, year, and other metadata
  • 4. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Items Listing 4 Generic thumbnails Similar titles
  • 5. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Metadata 5 Limited metadata How many captures, URLs, domains, …? What else might be report here?
  • 6. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection CDX Summary 6 Now, some IA web collections include summary of their archival contents
  • 7. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Rendered HTML vs. Source Code 7
  • 8. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> HTTP Response vs. WARC Record 8 HTTP headers Payload WARC headers
  • 9. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9 Capture Index (CDX) edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc ● URL Key (SURT) ● Datetime ● Original URL ● Media Type ● Status Code ● Digest ● Length ● Offset ● WARC File Path https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
  • 10. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: CLI Installation and Usage 10 $ pip install cdxsummary $ cdxsummary --help usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input] Summarize web archive capture index (CDX) files. positional arguments: input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-') optional arguments: -h, --help show this help message and exit -a [QUERY], --api [QUERY] CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL -i, --item Treat the input argument as a Petabox item identifier instead of a file path -j, --json Generate summary in JSON format -l, --load Load JSON report instead of CDX -o [FILE], --out [FILE] Write output to the given file (default: STDOUT) -r, --report Generate non-summarized JSON report -s [N], --samples [N] Number of sample memento URLs in summary (default: 10) -t [N], --tophosts [N] Number of hosts with maximum captures in summary (default: 10) -v, --version Show version number https://github.com/internetarchive/cdx-summary/
  • 11. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: JSON Rendition in a Web Component 11 $ cdxsummary --json covid.cdx.gz > covid.summary.json https://github.com/internetarchive/cdx-summary/tree/main/webcomponent <cdx-summary src="covid.summary.json" type="collection" name="COVID-19 Collection" format="short" thumbs="6" playback="https://archive.example.com/memento/" fold="thumbs samples"> </cdx-summary>
  • 12. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Overview 12
  • 13. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: MIME Type and Status Code 13 Various number formats in tooltip
  • 14. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Path Segment and Query Parameter 14 Root URLs
  • 15. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Year and Month 15
  • 16. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Top Hosts 16
  • 17. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Random HTML Capture Samples 17 Load samples
  • 18. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> <CDX-Summary> Web Component Testing and Customization 18 https://internetarchive.github.io/cdx-summary/webcomponent/ HTML element attributes Custom style variables Copy custom output Customized rendition/testing
  • 19. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Future Work 19 ● Statistics on unique URIs ● Visualizations ● Scalability ● Incremental updates ● Heuristics-based highlights
  • 20. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Conclusions 20 ● CLI tool for archival CDX files and API summarization ● Human and machine friendly reports (colorized text and JSON) ● Varying levels of details ● Collection understanding and QA ● Efficient random URL sampling ● Custom web component (<cdx-summary> HTML element) ● Web component testing and customization ● Open source ○ https://github.com/internetarchive/cdx-summary ○ https://internetarchive.github.io/cdx-summary/webcomponent/ ○ https://pypi.org/project/cdxsummary/ ○ https://www.npmjs.com/package/@internetarchive/cdxsummary