Web Data Engineering - A Technical Perspective on Web Archives

Web Data Engineering:
A Technical Perspective on Web Archives
Dr. Helge Holzmann
Web Data Engineer
Internet Archive
helge@archive.org
Open Repositories 2019
Hamburg, Germany
June 12, 2019

What is a web archive?
• Web archives preserve our history as documented on the web…
• … in huge datasets, consisting of all kinds of web resources
• e.g., HTML pages, images, video, scripts, …
• … stored as big files in the standardized (W)ARC format
• along with metadata + request / response headers
• next to lightweight capture index files (CDX)
• … to provide access to webpages from the past
• for users through close reading
• replayed by the Wayback Machine
• for data analysis at scale through distant-reading
• enabled by Big Data processing methods, like Hadoop / Spark, …
Helge Holzmann (helge@archive.org)2019-06-12

2019-06-12 Helge Holzmann (helge@archive.org) 3

2019-06-12 Helge Holzmann (helge@archive.org) 4

Not today's topic …
2019-06-12 Helge Holzmann (helge@archive.org)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20

The (archived) web…
• ... is a very valuable dataset to study the web (and the offline world)
• Access to very diverse knowledge from various discliplines (history, politics, …)
• The whole web at your fingertips / processable snapshots
• Adds a temporal dimension to the Web / captures dynamics
• ... is a widely unstructured collection of data
• Access and analysis at scale is challenging
• Processing petabytes of data is expensive and time-consuming
• Difficult to discover, identify, extract records and contained information
• Potentially highly technical, complex access and parsing process
• Low-level details users / researchers / data scientists don't want to / can't deal with
• Data engineering needed to be used in downstream applications / studies
6

Different perspectives on web archives
• User-centric View
• (Temporal) Search / Information Retrieval
• Direct access / replaying archived pages
• Data-centric View
• (W)ARC and CDX (metadata) datasets
• Big data processing: Hadoop, Spark, …
• Content analysis, historical / evolution studies
• Graph-centric View
• Structural view on the dataset
• Graph algorithms / analysis, structured information
• Hyperlink and host graphs, entity / social networks, facts and more
7
[Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]

Web (archives) as graph
• Foundational model for most downstream applications / analysis tasks
• E.g., Search index construction, term / entity co-occurrence studies, …
• Different ways / approaches to construct / extract (temporal) graphs
• (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc.
• Technical challenges that users don't want to / can't deal with:
• Efficient generation, effective representation, …
8

(Temporal) search in web archives
• Wanted: Enter a textual query, find relevant captures
• Challenges:
• Documents are temporal / consist of multiple versions
• New captures could near-duplicates or relevant changes
• Temporal relevance in addition to textual relevance
• Relevance to the query is not always encoded in the content
• Information needs / query intents are different from traditional IR
• Mostly navigational: Under which URL can I find a specific resource?
• How to turn (temporal) graphs into a searchable index?
• Integrate full-text, titles, headlines, anchor texts, ...?
• Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch
• Adaptation of existing retrieval models
9

Web Data Engineering
• Transforming data into useful information
• Making it usable for downstream applications
• Search, data science, digital humanities, content analysis, ...
• Regular users, researchers, data scientists / analysts, ...
• Enabling efficient and effective access through...
• ... infrastructures
• ... suitable data formats
• ... simple tools / APIs
• ... optimized indexes
• Technical considerations made by computer scientists
• to help users / researchers focus on their application / study / research
• to hiding complexity / low-level details through flexible abstractions
10

Example: Language Analysis (1)
• Possible research questions:
• Which pages of a language exist outside the contries ccTLD?
• Which languages are used the most in a certain area / topic?
• How has a language evolved over time on the web?
• Requirements:
• Tools for (W)ARC access, HTML parsing, language detection
• Language-annotated pages / captures
• Challenges:
• Texts too short to detect a language / confidence scores
• Multiple languages on one page / filtering and weighting
• Slow and expensive processing due to large-scale content analysis (weeks)
11

Example: Language Analysis (2)
• Wanted:
• Efficient access to comprehensive results
• Lightweight, reusable exchange format
• Dynamic threshold / flexible post-filtering
• Solution: (CDX) Attachment Format (ATT / CDXA)
• Leightweight, efficient loading, integrated data validation, decoupled from data
12
# Language detection using 'square leaf' approach
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12
com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ
com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW
com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5
com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6
com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J
com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO
com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX
*.cdx.lang_2017-18_v2.cdxa.gzCDX (Capture Index) with pointers to correcsponding (W)ARC records:
*.cdx

We have more available (examples)
• Dataset of all homepages in Global Wayback (GWB) – web.archive.org
• Extracted from snapshot 20180911224740
• GWB-20180911224740_homepages.cdx.gz
• Pre-processed attachments
• GWB-20180911224740_homepages-*.cdx.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last.cdxa.gz
13
# The last available capture
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5

Fatcat.wiki
(beta)
Archive and knowledge graph
of every publicly-accessible
scholarly output with a priority
on long-tail, at-risk publications.

Fatcat.wiki (big catalog)
• At-scale web harvesting of scholarly works
• with descriptive metadata and full-text
• linked with versions and secondary outputs
15
• API-first accessible /
editable system

Challenge: the Internet Archive is big
• Web archive / Wayback Machine
• 20+ years of web
• 625+ library and other partners
• 753,932,022,000 (captured) URLs
• 362 billion web pages
• More than 5,000 URLs archived every second
• 40+ petabyte
• And there's more:

Challenge: web archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Common header / metadata format, but various / diverse payloads
• Requires cleaning, filtering, selection, extraction before processing
• MapReduce or variants
• Homogeneous data types / formats
• Distributed batch processing
• load → transform
• aggregate → write
17

Trade-off: data locality vs. random access
• Direct access allows for exploiting data locality
• Moving computations to the data / sequential scans
• Indirect access with selective random accesses
• Scanning sequentially results in wasted reads (PB)
Helge Holzmann (helge@archive.org)
18
2019-06-12

Efficient processing
• Indirect access via lightweight metadata (CDX)
• Basic operations on metadata before touching the archive (filter, group, sort)
• E.g., offline pages, data types (scripts, styles, images, ...), domains
• Enriching records with data from payload for downstream applications
• E.g., titles, headlines, links, part-of-speach, named entities, ...
19

Sparkling data processing ☆
• (Internal) data processing library based on Apache Spark
• Goal to integrate all APIs to work with (temporal) web data in one library
• Continuous work in progress, growing with every new task
• Rich of features
• Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, …
• Fast HTML processing without expensive DOM parsing (SAX-like)
• Internal PetaBox authentication / access features
• ATT / CDXA attachment loaders and writers
• Shell / Python integration for computing derivations
• Distributed budget-aware repartitioning (e.g., 1GB per partition / file)
• Advanced retry / timeout / failure handling
• Lots of utilities for logging, file handling, string operations, URL/SURT formatting, …
• Easily configurable, library-wide constants and settings
• …
20

ArchiveSpark
• Expressive and efficient data access and processing
• Declarative workflows, seamless two step loading approach
• Open source
• Available on GitHub: https://github.com/helgeho/ArchiveSpark
• with documentation, docker image, and recipes for common tasks
• Modular / extensible
• Various DataSpecifications and EnrichFunctions
• ArchiveSpark-server: Web service API for ArchiveSpark
• https://github.com/helgeho/ArchiveSpark-server
• Generalizable for archival collections beyond Web archives
• …
21
[Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016]
[Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]

Simple and expressive interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative / no deep scala or spark knowledge required
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
22
2019-06-12

Familiar, readable, reusable output
• Nested JSON output encodes lineage of applied enrichments
title
text
entities
persons
23
2019-06-12

Benchmarks vs. Spark / HBase
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
24

New ArchiveSpark (3.0) very soon
• Major overhaul
• Streamlined dependencies and package structure
• Even more simplified API
• Lots of bug fixes and improvements
• Will be widely based on / include parts of Sparkling
• org.archive.archivespark.sparkling
• Will benefit from Sparkling fixes and updates
• Almost ready
• Please have a little patience and check back soon…
• Follow / star / watch on GitHub
• https://github.com/helgeho/ArchiveSpark
25

We're at your service!
• Archive-It Research Services (ARS)
• WAT (extended metadata files)
• LGA (temporal graphs)
• WANE (named entities)
• Special Seed Services (Artificial Zone Files)
• Language + GeoIP analysis
• Nation Wide Web (NWW) Search
• Customized / regional web + media search
• APIs
• WASAPI data-transfer API (Archive-It)
• Availability API + CDX Server (Wayback)
• More to come soon, stay tuned…
26

Thank you!
• archive.org
• archive-it.org
• fatcat.wiki
• github.org/helgeho/ArchiveSpark
Questions?
2019-06-12
www.HelgeHolzmann.de
27
If interested in our work,
please get in touch!

Web Data Engineering - A Technical Perspective on Web Archives

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Web Data Engineering - A Technical Perspective on Web Archives

Similar to Web Data Engineering - A Technical Perspective on Web Archives (20)

Recently uploaded

Recently uploaded (20)

Web Data Engineering - A Technical Perspective on Web Archives