Web Data Engineering:
A Technical Perspective on Web Archives
Dr. Helge Holzmann
Web Data Engineer
Internet Archive
helge@archive.org
Open Repositories 2019
Hamburg, Germany
June 12, 2019
What is a web archive?
• Web archives preserve our history as documented on the web…
• … in huge datasets, consisting of all kinds of web resources
• e.g., HTML pages, images, video, scripts, …
• … stored as big files in the standardized (W)ARC format
• along with metadata + request / response headers
• next to lightweight capture index files (CDX)
• … to provide access to webpages from the past
• for users through close reading
• replayed by the Wayback Machine
• for data analysis at scale through distant-reading
• enabled by Big Data processing methods, like Hadoop / Spark, …
Helge Holzmann (helge@archive.org)2019-06-12
2019-06-12 Helge Holzmann (helge@archive.org) 3
2019-06-12 Helge Holzmann (helge@archive.org) 4
Not today's topic …
2019-06-12 Helge Holzmann (helge@archive.org)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
The (archived) web…
• ... is a very valuable dataset to study the web (and the offline world)
• Access to very diverse knowledge from various discliplines (history, politics, …)
• The whole web at your fingertips / processable snapshots
• Adds a temporal dimension to the Web / captures dynamics
• ... is a widely unstructured collection of data
• Access and analysis at scale is challenging
• Processing petabytes of data is expensive and time-consuming
• Difficult to discover, identify, extract records and contained information
• Potentially highly technical, complex access and parsing process
• Low-level details users / researchers / data scientists don't want to / can't deal with
• Data engineering needed to be used in downstream applications / studies
2019-06-12 Helge Holzmann (helge@archive.org)
6
Different perspectives on web archives
• User-centric View
• (Temporal) Search / Information Retrieval
• Direct access / replaying archived pages
• Data-centric View
• (W)ARC and CDX (metadata) datasets
• Big data processing: Hadoop, Spark, …
• Content analysis, historical / evolution studies
• Graph-centric View
• Structural view on the dataset
• Graph algorithms / analysis, structured information
• Hyperlink and host graphs, entity / social networks, facts and more
2019-06-12 Helge Holzmann (helge@archive.org)
7
[Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
Web (archives) as graph
• Foundational model for most downstream applications / analysis tasks
• E.g., Search index construction, term / entity co-occurrence studies, …
• Different ways / approaches to construct / extract (temporal) graphs
• (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc.
• Technical challenges that users don't want to / can't deal with:
• Efficient generation, effective representation, …
Helge Holzmann (helge@archive.org)2019-06-12
8
(Temporal) search in web archives
• Wanted: Enter a textual query, find relevant captures
• Challenges:
• Documents are temporal / consist of multiple versions
• New captures could near-duplicates or relevant changes
• Temporal relevance in addition to textual relevance
• Relevance to the query is not always encoded in the content
• Information needs / query intents are different from traditional IR
• Mostly navigational: Under which URL can I find a specific resource?
• How to turn (temporal) graphs into a searchable index?
• Integrate full-text, titles, headlines, anchor texts, ...?
• Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch
• Adaptation of existing retrieval models
2019-06-12 Helge Holzmann (helge@archive.org)
9
Web Data Engineering
• Transforming data into useful information
• Making it usable for downstream applications
• Search, data science, digital humanities, content analysis, ...
• Regular users, researchers, data scientists / analysts, ...
• Enabling efficient and effective access through...
• ... infrastructures
• ... suitable data formats
• ... simple tools / APIs
• ... optimized indexes
• Technical considerations made by computer scientists
• to help users / researchers focus on their application / study / research
• to hiding complexity / low-level details through flexible abstractions
2019-06-12 Helge Holzmann (helge@archive.org)
10
Example: Language Analysis (1)
• Possible research questions:
• Which pages of a language exist outside the contries ccTLD?
• Which languages are used the most in a certain area / topic?
• How has a language evolved over time on the web?
• Requirements:
• Tools for (W)ARC access, HTML parsing, language detection
• Language-annotated pages / captures
• Challenges:
• Texts too short to detect a language / confidence scores
• Multiple languages on one page / filtering and weighting
• Slow and expensive processing due to large-scale content analysis (weeks)
2019-06-12 Helge Holzmann (helge@archive.org)
11
Example: Language Analysis (2)
• Wanted:
• Efficient access to comprehensive results
• Lightweight, reusable exchange format
• Dynamic threshold / flexible post-filtering
• Solution: (CDX) Attachment Format (ATT / CDXA)
• Leightweight, efficient loading, integrated data validation, decoupled from data
2019-06-12 Helge Holzmann (helge@archive.org)
12
# Language detection using 'square leaf' approach
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12
com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ
com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW
com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5
com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6
com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J
com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO
com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX
*.cdx.lang_2017-18_v2.cdxa.gzCDX (Capture Index) with pointers to correcsponding (W)ARC records:
*.cdx
We have more available (examples)
• Dataset of all homepages in Global Wayback (GWB) – web.archive.org
• Extracted from snapshot 20180911224740
• GWB-20180911224740_homepages.cdx.gz
• Pre-processed attachments
• GWB-20180911224740_homepages-*.cdx.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last.cdxa.gz
2019-06-12 Helge Holzmann (helge@archive.org)
13
# The last available capture
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5
Fatcat.wiki
(beta)
Archive and knowledge graph
of every publicly-accessible
scholarly output with a priority
on long-tail, at-risk publications.
Fatcat.wiki (big catalog)
• At-scale web harvesting of scholarly works
• with descriptive metadata and full-text
• linked with versions and secondary outputs
2019-06-12 Helge Holzmann (helge@archive.org)
15
• API-first accessible /
editable system
Challenge: the Internet Archive is big
• Web archive / Wayback Machine
• 20+ years of web
• 625+ library and other partners
• 753,932,022,000 (captured) URLs
• 362 billion web pages
• More than 5,000 URLs archived every second
• 40+ petabyte
• And there's more:
2019-06-12 Helge Holzmann (helge@archive.org)
Challenge: web archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Common header / metadata format, but various / diverse payloads
• Requires cleaning, filtering, selection, extraction before processing
• MapReduce or variants
• Homogeneous data types / formats
• Distributed batch processing
• load → transform
• aggregate → write
2019-06-12 Helge Holzmann (helge@archive.org)
17
Trade-off: data locality vs. random access
• Direct access allows for exploiting data locality
• Moving computations to the data / sequential scans
• Indirect access with selective random accesses
• Scanning sequentially results in wasted reads (PB)
Helge Holzmann (helge@archive.org)
18
2019-06-12
Efficient processing
• Indirect access via lightweight metadata (CDX)
• Basic operations on metadata before touching the archive (filter, group, sort)
• E.g., offline pages, data types (scripts, styles, images, ...), domains
• Enriching records with data from payload for downstream applications
• E.g., titles, headlines, links, part-of-speach, named entities, ...
2019-06-12 Helge Holzmann (helge@archive.org)
19
Sparkling data processing ☆
• (Internal) data processing library based on Apache Spark
• Goal to integrate all APIs to work with (temporal) web data in one library
• Continuous work in progress, growing with every new task
• Rich of features
• Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, …
• Fast HTML processing without expensive DOM parsing (SAX-like)
• Internal PetaBox authentication / access features
• ATT / CDXA attachment loaders and writers
• Shell / Python integration for computing derivations
• Distributed budget-aware repartitioning (e.g., 1GB per partition / file)
• Advanced retry / timeout / failure handling
• Lots of utilities for logging, file handling, string operations, URL/SURT formatting, …
• Easily configurable, library-wide constants and settings
• …
Helge Holzmann (helge@archive.org)2019-06-12
20
ArchiveSpark
• Expressive and efficient data access and processing
• Declarative workflows, seamless two step loading approach
• Open source
• Available on GitHub: https://github.com/helgeho/ArchiveSpark
• with documentation, docker image, and recipes for common tasks
• Modular / extensible
• Various DataSpecifications and EnrichFunctions
• ArchiveSpark-server: Web service API for ArchiveSpark
• https://github.com/helgeho/ArchiveSpark-server
• Generalizable for archival collections beyond Web archives
• …
Helge Holzmann (helge@archive.org)2019-06-12
21
[Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016]
[Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
Simple and expressive interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative / no deep scala or spark knowledge required
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
Helge Holzmann (helge@archive.org)
val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
22
2019-06-12
Familiar, readable, reusable output
• Nested JSON output encodes lineage of applied enrichments
Helge Holzmann (helge@archive.org)
title
text
entities
persons
23
2019-06-12
Benchmarks vs. Spark / HBase
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
2019-06-12 Helge Holzmann (helge@archive.org)
24
New ArchiveSpark (3.0) very soon
• Major overhaul
• Streamlined dependencies and package structure
• Even more simplified API
• Lots of bug fixes and improvements
• Will be widely based on / include parts of Sparkling
• org.archive.archivespark.sparkling
• Will benefit from Sparkling fixes and updates
• Almost ready
• Please have a little patience and check back soon…
• Follow / star / watch on GitHub
• https://github.com/helgeho/ArchiveSpark
Helge Holzmann (helge@archive.org)2019-06-12
25
We're at your service!
• Archive-It Research Services (ARS)
• WAT (extended metadata files)
• LGA (temporal graphs)
• WANE (named entities)
• Special Seed Services (Artificial Zone Files)
• Language + GeoIP analysis
• Nation Wide Web (NWW) Search
• Customized / regional web + media search
• APIs
• WASAPI data-transfer API (Archive-It)
• Availability API + CDX Server (Wayback)
• More to come soon, stay tuned…
2019-06-12 Helge Holzmann (helge@archive.org)
26
Thank you!
Helge Holzmann (helge@archive.org)
• archive.org
• archive-it.org
• fatcat.wiki
• github.org/helgeho/ArchiveSpark
Questions?
2019-06-12
www.HelgeHolzmann.de
27
If interested in our work,
please get in touch!

Web Data Engineering - A Technical Perspective on Web Archives

  • 1.
    Web Data Engineering: ATechnical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer Internet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019
  • 2.
    What is aweb archive? • Web archives preserve our history as documented on the web… • … in huge datasets, consisting of all kinds of web resources • e.g., HTML pages, images, video, scripts, … • … stored as big files in the standardized (W)ARC format • along with metadata + request / response headers • next to lightweight capture index files (CDX) • … to provide access to webpages from the past • for users through close reading • replayed by the Wayback Machine • for data analysis at scale through distant-reading • enabled by Big Data processing methods, like Hadoop / Spark, … Helge Holzmann (helge@archive.org)2019-06-12
  • 3.
    2019-06-12 Helge Holzmann(helge@archive.org) 3
  • 4.
    2019-06-12 Helge Holzmann(helge@archive.org) 4
  • 5.
    Not today's topic… 2019-06-12 Helge Holzmann (helge@archive.org) http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
  • 6.
    The (archived) web… •... is a very valuable dataset to study the web (and the offline world) • Access to very diverse knowledge from various discliplines (history, politics, …) • The whole web at your fingertips / processable snapshots • Adds a temporal dimension to the Web / captures dynamics • ... is a widely unstructured collection of data • Access and analysis at scale is challenging • Processing petabytes of data is expensive and time-consuming • Difficult to discover, identify, extract records and contained information • Potentially highly technical, complex access and parsing process • Low-level details users / researchers / data scientists don't want to / can't deal with • Data engineering needed to be used in downstream applications / studies 2019-06-12 Helge Holzmann (helge@archive.org) 6
  • 7.
    Different perspectives onweb archives • User-centric View • (Temporal) Search / Information Retrieval • Direct access / replaying archived pages • Data-centric View • (W)ARC and CDX (metadata) datasets • Big data processing: Hadoop, Spark, … • Content analysis, historical / evolution studies • Graph-centric View • Structural view on the dataset • Graph algorithms / analysis, structured information • Hyperlink and host graphs, entity / social networks, facts and more 2019-06-12 Helge Holzmann (helge@archive.org) 7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
  • 8.
    Web (archives) asgraph • Foundational model for most downstream applications / analysis tasks • E.g., Search index construction, term / entity co-occurrence studies, … • Different ways / approaches to construct / extract (temporal) graphs • (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc. • Technical challenges that users don't want to / can't deal with: • Efficient generation, effective representation, … Helge Holzmann (helge@archive.org)2019-06-12 8
  • 9.
    (Temporal) search inweb archives • Wanted: Enter a textual query, find relevant captures • Challenges: • Documents are temporal / consist of multiple versions • New captures could near-duplicates or relevant changes • Temporal relevance in addition to textual relevance • Relevance to the query is not always encoded in the content • Information needs / query intents are different from traditional IR • Mostly navigational: Under which URL can I find a specific resource? • How to turn (temporal) graphs into a searchable index? • Integrate full-text, titles, headlines, anchor texts, ...? • Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch • Adaptation of existing retrieval models 2019-06-12 Helge Holzmann (helge@archive.org) 9
  • 10.
    Web Data Engineering •Transforming data into useful information • Making it usable for downstream applications • Search, data science, digital humanities, content analysis, ... • Regular users, researchers, data scientists / analysts, ... • Enabling efficient and effective access through... • ... infrastructures • ... suitable data formats • ... simple tools / APIs • ... optimized indexes • Technical considerations made by computer scientists • to help users / researchers focus on their application / study / research • to hiding complexity / low-level details through flexible abstractions 2019-06-12 Helge Holzmann (helge@archive.org) 10
  • 11.
    Example: Language Analysis(1) • Possible research questions: • Which pages of a language exist outside the contries ccTLD? • Which languages are used the most in a certain area / topic? • How has a language evolved over time on the web? • Requirements: • Tools for (W)ARC access, HTML parsing, language detection • Language-annotated pages / captures • Challenges: • Texts too short to detect a language / confidence scores • Multiple languages on one page / filtering and weighting • Slow and expensive processing due to large-scale content analysis (weeks) 2019-06-12 Helge Holzmann (helge@archive.org) 11
  • 12.
    Example: Language Analysis(2) • Wanted: • Efficient access to comprehensive results • Lightweight, reusable exchange format • Dynamic threshold / flexible post-filtering • Solution: (CDX) Attachment Format (ATT / CDXA) • Leightweight, efficient loading, integrated data validation, decoupled from data 2019-06-12 Helge Holzmann (helge@archive.org) 12 # Language detection using 'square leaf' approach Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX *.cdx.lang_2017-18_v2.cdxa.gzCDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx
  • 13.
    We have moreavailable (examples) • Dataset of all homepages in Global Wayback (GWB) – web.archive.org • Extracted from snapshot 20180911224740 • GWB-20180911224740_homepages.cdx.gz • Pre-processed attachments • GWB-20180911224740_homepages-*.cdx.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last.cdxa.gz 2019-06-12 Helge Holzmann (helge@archive.org) 13 # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5
  • 14.
    Fatcat.wiki (beta) Archive and knowledgegraph of every publicly-accessible scholarly output with a priority on long-tail, at-risk publications.
  • 15.
    Fatcat.wiki (big catalog) •At-scale web harvesting of scholarly works • with descriptive metadata and full-text • linked with versions and secondary outputs 2019-06-12 Helge Holzmann (helge@archive.org) 15 • API-first accessible / editable system
  • 16.
    Challenge: the InternetArchive is big • Web archive / Wayback Machine • 20+ years of web • 625+ library and other partners • 753,932,022,000 (captured) URLs • 362 billion web pages • More than 5,000 URLs archived every second • 40+ petabyte • And there's more: 2019-06-12 Helge Holzmann (helge@archive.org)
  • 17.
    Challenge: web archivesare Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • Web archive data is heterogeneous, may include text, video, images, … • Common header / metadata format, but various / diverse payloads • Requires cleaning, filtering, selection, extraction before processing • MapReduce or variants • Homogeneous data types / formats • Distributed batch processing • load → transform • aggregate → write 2019-06-12 Helge Holzmann (helge@archive.org) 17
  • 18.
    Trade-off: data localityvs. random access • Direct access allows for exploiting data locality • Moving computations to the data / sequential scans • Indirect access with selective random accesses • Scanning sequentially results in wasted reads (PB) Helge Holzmann (helge@archive.org) 18 2019-06-12
  • 19.
    Efficient processing • Indirectaccess via lightweight metadata (CDX) • Basic operations on metadata before touching the archive (filter, group, sort) • E.g., offline pages, data types (scripts, styles, images, ...), domains • Enriching records with data from payload for downstream applications • E.g., titles, headlines, links, part-of-speach, named entities, ... 2019-06-12 Helge Holzmann (helge@archive.org) 19
  • 20.
    Sparkling data processing☆ • (Internal) data processing library based on Apache Spark • Goal to integrate all APIs to work with (temporal) web data in one library • Continuous work in progress, growing with every new task • Rich of features • Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, … • Fast HTML processing without expensive DOM parsing (SAX-like) • Internal PetaBox authentication / access features • ATT / CDXA attachment loaders and writers • Shell / Python integration for computing derivations • Distributed budget-aware repartitioning (e.g., 1GB per partition / file) • Advanced retry / timeout / failure handling • Lots of utilities for logging, file handling, string operations, URL/SURT formatting, … • Easily configurable, library-wide constants and settings • … Helge Holzmann (helge@archive.org)2019-06-12 20
  • 21.
    ArchiveSpark • Expressive andefficient data access and processing • Declarative workflows, seamless two step loading approach • Open source • Available on GitHub: https://github.com/helgeho/ArchiveSpark • with documentation, docker image, and recipes for common tasks • Modular / extensible • Various DataSpecifications and EnrichFunctions • ArchiveSpark-server: Web service API for ArchiveSpark • https://github.com/helgeho/ArchiveSpark-server • Generalizable for archival collections beyond Web archives • … Helge Holzmann (helge@archive.org)2019-06-12 21 [Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016] [Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
  • 22.
    Simple and expressiveinterface • Based on Spark, powered by Scala • This does not mean you have to learn a new programming language! • The interface is rather declarative / no deep scala or spark knowledge required • Simple data accessors are included • Provide simplified access to the underlying data model • Easy extraction / enrichment mechanisms • Customizable and extensible by advanced users Helge Holzmann (helge@archive.org) val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath)) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz") 22 2019-06-12
  • 23.
    Familiar, readable, reusableoutput • Nested JSON output encodes lineage of applied enrichments Helge Holzmann (helge@archive.org) title text entities persons 23 2019-06-12
  • 24.
    Benchmarks vs. Spark/ HBase • Three scenarios, from basic to more sophisticated: a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month • Benchmarks do not include derivations • Those are applied on top of all three methods and involve third-party libraries 2019-06-12 Helge Holzmann (helge@archive.org) 24
  • 25.
    New ArchiveSpark (3.0)very soon • Major overhaul • Streamlined dependencies and package structure • Even more simplified API • Lots of bug fixes and improvements • Will be widely based on / include parts of Sparkling • org.archive.archivespark.sparkling • Will benefit from Sparkling fixes and updates • Almost ready • Please have a little patience and check back soon… • Follow / star / watch on GitHub • https://github.com/helgeho/ArchiveSpark Helge Holzmann (helge@archive.org)2019-06-12 25
  • 26.
    We're at yourservice! • Archive-It Research Services (ARS) • WAT (extended metadata files) • LGA (temporal graphs) • WANE (named entities) • Special Seed Services (Artificial Zone Files) • Language + GeoIP analysis • Nation Wide Web (NWW) Search • Customized / regional web + media search • APIs • WASAPI data-transfer API (Archive-It) • Availability API + CDX Server (Wayback) • More to come soon, stay tuned… 2019-06-12 Helge Holzmann (helge@archive.org) 26
  • 27.
    Thank you! Helge Holzmann(helge@archive.org) • archive.org • archive-it.org • fatcat.wiki • github.org/helgeho/ArchiveSpark Questions? 2019-06-12 www.HelgeHolzmann.de 27 If interested in our work, please get in touch!