Large web archival collections are often opaque about their holdings. We created an open-source tool called, CDX Summary, to generate statistical reports based on URIs, hosts, TLDs, paths, query parameters, status codes, media types, date and time, etc. present in the CDX index of a collection of WARC files. Our tool also surfaces a configurable number of potentially good random memento samples from the collection for visual inspection, quality assurance, representative thumbnails generation, etc. The tool generates both human and machine readable reports with varying levels of details for different use cases. Furthermore, we implemented a Web Component that can render generated JSON summaries in HTML documents. Early exploration of CDX insights on Wayback Machine collections helped us improve our crawl operations.
Venue: TPDL 2022
Recording: https://www.youtube.com/watch?v=K5i3XShqW6A
Topic 9- General Principles of International Law.pptx
CDX Summary: Web Archival Collection Insights
1. CDX Summary:
Web Archival Collection Insights
Sawood Alam and Mark Graham
Wayback Machine, Internet Archive, San Francisco, CA 94118, USA
{sawood,mark}@archive.org
@ibnesayeed @waybackmachine @internetarchive
TPDL, September 21, 2022, Padua, Italy
2. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Internet Archive
Collections
2
Various media types
Collections of
various media types
3. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Rich
Collections
3
Distinguishable
cover art/thumbnails
Title, authors, year,
and other metadata
4. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Items Listing
4
Generic thumbnails
Similar titles
5. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
Metadata
5
Limited metadata
How many captures,
URLs, domains, …? What else might be report here?
6. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Web Collection
CDX Summary
6
Now, some IA web
collections include
summary of their
archival contents
7. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Rendered HTML vs. Source Code
7
8. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
HTTP Response vs. WARC Record
8
HTTP headers
Payload
WARC headers
9. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 9
Capture Index (CDX)
edu,odu,cs)/~salam/dweb/ 20180802012013 https://cs.odu.edu/~salam/dweb/ text/html 200 P2RBRGLK… 921 0 hello-dweb.warc
edu,odu,cs)/~salam/dweb/style.css 20180802012013 https://cs.odu.edu/~salam/dweb/style.css text/css 200 NOWM53D5… 427 922 hello-dweb.warc
● URL Key (SURT)
● Datetime
● Original URL
● Media Type
● Status Code
● Digest
● Length
● Offset
● WARC File Path
https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
10. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: CLI Installation and Usage
10
$ pip install cdxsummary
$ cdxsummary --help
usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]
Summarize web archive capture index (CDX) files.
positional arguments:
input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')
optional arguments:
-h, --help show this help message and exit
-a [QUERY], --api [QUERY]
CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL
-i, --item Treat the input argument as a Petabox item identifier instead of a file path
-j, --json Generate summary in JSON format
-l, --load Load JSON report instead of CDX
-o [FILE], --out [FILE]
Write output to the given file (default: STDOUT)
-r, --report Generate non-summarized JSON report
-s [N], --samples [N]
Number of sample memento URLs in summary (default: 10)
-t [N], --tophosts [N]
Number of hosts with maximum captures in summary (default: 10)
-v, --version Show version number
https://github.com/internetarchive/cdx-summary/
11. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: JSON Rendition in a Web Component
11
$ cdxsummary --json covid.cdx.gz > covid.summary.json
https://github.com/internetarchive/cdx-summary/tree/main/webcomponent
<cdx-summary src="covid.summary.json"
type="collection"
name="COVID-19 Collection"
format="short"
thumbs="6"
playback="https://archive.example.com/memento/"
fold="thumbs samples">
</cdx-summary>
12. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Overview
12
13. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: MIME Type and Status Code
13
Various number
formats in tooltip
14. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Path Segment and Query Parameter
14
Root URLs
15. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Year and Month
15
16. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Top Hosts
16
17. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
CDX Summary: Random HTML Capture Samples
17
Load samples
18. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
<CDX-Summary> Web Component Testing and Customization
18
https://internetarchive.github.io/cdx-summary/webcomponent/
HTML element
attributes
Custom style
variables
Copy custom
output
Customized
rendition/testing
19. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Future Work
19
● Statistics on unique URIs
● Visualizations
● Scalability
● Incremental updates
● Heuristics-based highlights
20. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Conclusions
20
● CLI tool for archival CDX files and API summarization
● Human and machine friendly reports (colorized text and JSON)
● Varying levels of details
● Collection understanding and QA
● Efficient random URL sampling
● Custom web component (<cdx-summary> HTML element)
● Web component testing and customization
● Open source
○ https://github.com/internetarchive/cdx-summary
○ https://internetarchive.github.io/cdx-summary/webcomponent/
○ https://pypi.org/project/cdxsummary/
○ https://www.npmjs.com/package/@internetarchive/cdxsummary