CDX Summary:
Web Archival Collection Insights
Sawood Alam and Mark Graham
Wayback Machine, Internet Archive, San Francisco, CA 94118, USA
{sawood,mark}@archive.org
@ibnesayeed @waybackmachine @internetarchive
TPDL, September 21, 2022, Padua, Italy
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Internet Archive
Collections
2
Various media types
Collections of
various media types
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Rich
Collections
3
Distinguishable
cover art/thumbnails
Title, authors, year,
and other metadata
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Items Listing
4
Generic thumbnails
Similar titles
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Metadata
5
Limited metadata
How many captures,
URLs, domains, …? What else might be report here?
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
CDX Summary
6
Now, some IA web
collections include
summary of their
archival contents
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Rendered HTML vs. Source Code
7
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
HTTP Response vs. WARC Record
8
HTTP headers
Payload
WARC headers
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9
Capture Index (CDX)
edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc
edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc
● URL Key (SURT)
● Datetime
● Original URL
● Media Type
● Status Code
● Digest
● Length
● Offset
● WARC File Path
https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: CLI Installation and Usage
10
$ pip install cdxsummary
$ cdxsummary --help
usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]
Summarize web archive capture index (CDX) files.
positional arguments:
input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')
optional arguments:
-h, --help show this help message and exit
-a [QUERY], --api [QUERY]
CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL
-i, --item Treat the input argument as a Petabox item identifier instead of a file path
-j, --json Generate summary in JSON format
-l, --load Load JSON report instead of CDX
-o [FILE], --out [FILE]
Write output to the given file (default: STDOUT)
-r, --report Generate non-summarized JSON report
-s [N], --samples [N]
Number of sample memento URLs in summary (default: 10)
-t [N], --tophosts [N]
Number of hosts with maximum captures in summary (default: 10)
-v, --version Show version number
https://github.com/internetarchive/cdx-summary/
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: JSON Rendition in a Web Component
11
$ cdxsummary --json covid.cdx.gz > covid.summary.json
https://github.com/internetarchive/cdx-summary/tree/main/webcomponent
<cdx-summary src="covid.summary.json"
type="collection"
name="COVID-19 Collection"
format="short"
thumbs="6"
playback="https://archive.example.com/memento/"
fold="thumbs samples">
</cdx-summary>
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Overview
12
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: MIME Type and Status Code
13
Various number
formats in tooltip
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Path Segment and Query Parameter
14
Root URLs
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Year and Month
15
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Top Hosts
16
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Random HTML Capture Samples
17
Load samples
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
<CDX-Summary> Web Component Testing and Customization
18
https://internetarchive.github.io/cdx-summary/webcomponent/
HTML element
attributes
Custom style
variables
Copy custom
output
Customized
rendition/testing
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Future Work
19
● Statistics on unique URIs
● Visualizations
● Scalability
● Incremental updates
● Heuristics-based highlights
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Conclusions
20
● CLI tool for archival CDX files and API summarization
● Human and machine friendly reports (colorized text and JSON)
● Varying levels of details
● Collection understanding and QA
● Efficient random URL sampling
● Custom web component (<cdx-summary> HTML element)
● Web component testing and customization
● Open source
○ https://github.com/internetarchive/cdx-summary
○ https://internetarchive.github.io/cdx-summary/webcomponent/
○ https://pypi.org/project/cdxsummary/
○ https://www.npmjs.com/package/@internetarchive/cdxsummary

CDX Summary: Web Archival Collection Insights

  • 1.
    CDX Summary: Web ArchivalCollection Insights Sawood Alam and Mark Graham Wayback Machine, Internet Archive, San Francisco, CA 94118, USA {sawood,mark}@archive.org @ibnesayeed @waybackmachine @internetarchive TPDL, September 21, 2022, Padua, Italy
  • 2.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Internet Archive Collections 2 Various media types Collections of various media types
  • 3.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Metadata Rich Collections 3 Distinguishable cover art/thumbnails Title, authors, year, and other metadata
  • 4.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Items Listing 4 Generic thumbnails Similar titles
  • 5.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection Metadata 5 Limited metadata How many captures, URLs, domains, …? What else might be report here?
  • 6.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Web Collection CDX Summary 6 Now, some IA web collections include summary of their archival contents
  • 7.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Rendered HTML vs. Source Code 7
  • 8.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> HTTP Response vs. WARC Record 8 HTTP headers Payload WARC headers
  • 9.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9 Capture Index (CDX) edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc ● URL Key (SURT) ● Datetime ● Original URL ● Media Type ● Status Code ● Digest ● Length ● Offset ● WARC File Path https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
  • 10.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: CLI Installation and Usage 10 $ pip install cdxsummary $ cdxsummary --help usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input] Summarize web archive capture index (CDX) files. positional arguments: input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-') optional arguments: -h, --help show this help message and exit -a [QUERY], --api [QUERY] CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL -i, --item Treat the input argument as a Petabox item identifier instead of a file path -j, --json Generate summary in JSON format -l, --load Load JSON report instead of CDX -o [FILE], --out [FILE] Write output to the given file (default: STDOUT) -r, --report Generate non-summarized JSON report -s [N], --samples [N] Number of sample memento URLs in summary (default: 10) -t [N], --tophosts [N] Number of hosts with maximum captures in summary (default: 10) -v, --version Show version number https://github.com/internetarchive/cdx-summary/
  • 11.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: JSON Rendition in a Web Component 11 $ cdxsummary --json covid.cdx.gz > covid.summary.json https://github.com/internetarchive/cdx-summary/tree/main/webcomponent <cdx-summary src="covid.summary.json" type="collection" name="COVID-19 Collection" format="short" thumbs="6" playback="https://archive.example.com/memento/" fold="thumbs samples"> </cdx-summary>
  • 12.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Overview 12
  • 13.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: MIME Type and Status Code 13 Various number formats in tooltip
  • 14.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Path Segment and Query Parameter 14 Root URLs
  • 15.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Year and Month 15
  • 16.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Top Hosts 16
  • 17.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> CDX Summary: Random HTML Capture Samples 17 Load samples
  • 18.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> <CDX-Summary> Web Component Testing and Customization 18 https://internetarchive.github.io/cdx-summary/webcomponent/ HTML element attributes Custom style variables Copy custom output Customized rendition/testing
  • 19.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Future Work 19 ● Statistics on unique URIs ● Visualizations ● Scalability ● Incremental updates ● Heuristics-based highlights
  • 20.
    Sawood Alam <@ibnesayeed>| Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Conclusions 20 ● CLI tool for archival CDX files and API summarization ● Human and machine friendly reports (colorized text and JSON) ● Varying levels of details ● Collection understanding and QA ● Efficient random URL sampling ● Custom web component (<cdx-summary> HTML element) ● Web component testing and customization ● Open source ○ https://github.com/internetarchive/cdx-summary ○ https://internetarchive.github.io/cdx-summary/webcomponent/ ○ https://pypi.org/project/cdxsummary/ ○ https://www.npmjs.com/package/@internetarchive/cdxsummary