SlideShare a Scribd company logo
Web Data Engineering:
A Technical Perspective on Web Archives
Dr. Helge Holzmann
Web Data Engineer
Internet Archive
helge@archive.org
Open Repositories 2019
Hamburg, Germany
June 12, 2019
What is a web archive?
• Web archives preserve our history as documented on the web…
• … in huge datasets, consisting of all kinds of web resources
• e.g., HTML pages, images, video, scripts, …
• … stored as big files in the standardized (W)ARC format
• along with metadata + request / response headers
• next to lightweight capture index files (CDX)
• … to provide access to webpages from the past
• for users through close reading
• replayed by the Wayback Machine
• for data analysis at scale through distant-reading
• enabled by Big Data processing methods, like Hadoop / Spark, …
Helge Holzmann (helge@archive.org)2019-06-12
2019-06-12 Helge Holzmann (helge@archive.org) 3
2019-06-12 Helge Holzmann (helge@archive.org) 4
Not today's topic …
2019-06-12 Helge Holzmann (helge@archive.org)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
The (archived) web…
• ... is a very valuable dataset to study the web (and the offline world)
• Access to very diverse knowledge from various discliplines (history, politics, …)
• The whole web at your fingertips / processable snapshots
• Adds a temporal dimension to the Web / captures dynamics
• ... is a widely unstructured collection of data
• Access and analysis at scale is challenging
• Processing petabytes of data is expensive and time-consuming
• Difficult to discover, identify, extract records and contained information
• Potentially highly technical, complex access and parsing process
• Low-level details users / researchers / data scientists don't want to / can't deal with
• Data engineering needed to be used in downstream applications / studies
2019-06-12 Helge Holzmann (helge@archive.org)
6
Different perspectives on web archives
• User-centric View
• (Temporal) Search / Information Retrieval
• Direct access / replaying archived pages
• Data-centric View
• (W)ARC and CDX (metadata) datasets
• Big data processing: Hadoop, Spark, …
• Content analysis, historical / evolution studies
• Graph-centric View
• Structural view on the dataset
• Graph algorithms / analysis, structured information
• Hyperlink and host graphs, entity / social networks, facts and more
2019-06-12 Helge Holzmann (helge@archive.org)
7
[Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
Web (archives) as graph
• Foundational model for most downstream applications / analysis tasks
• E.g., Search index construction, term / entity co-occurrence studies, …
• Different ways / approaches to construct / extract (temporal) graphs
• (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc.
• Technical challenges that users don't want to / can't deal with:
• Efficient generation, effective representation, …
Helge Holzmann (helge@archive.org)2019-06-12
8
(Temporal) search in web archives
• Wanted: Enter a textual query, find relevant captures
• Challenges:
• Documents are temporal / consist of multiple versions
• New captures could near-duplicates or relevant changes
• Temporal relevance in addition to textual relevance
• Relevance to the query is not always encoded in the content
• Information needs / query intents are different from traditional IR
• Mostly navigational: Under which URL can I find a specific resource?
• How to turn (temporal) graphs into a searchable index?
• Integrate full-text, titles, headlines, anchor texts, ...?
• Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch
• Adaptation of existing retrieval models
2019-06-12 Helge Holzmann (helge@archive.org)
9
Web Data Engineering
• Transforming data into useful information
• Making it usable for downstream applications
• Search, data science, digital humanities, content analysis, ...
• Regular users, researchers, data scientists / analysts, ...
• Enabling efficient and effective access through...
• ... infrastructures
• ... suitable data formats
• ... simple tools / APIs
• ... optimized indexes
• Technical considerations made by computer scientists
• to help users / researchers focus on their application / study / research
• to hiding complexity / low-level details through flexible abstractions
2019-06-12 Helge Holzmann (helge@archive.org)
10
Example: Language Analysis (1)
• Possible research questions:
• Which pages of a language exist outside the contries ccTLD?
• Which languages are used the most in a certain area / topic?
• How has a language evolved over time on the web?
• Requirements:
• Tools for (W)ARC access, HTML parsing, language detection
• Language-annotated pages / captures
• Challenges:
• Texts too short to detect a language / confidence scores
• Multiple languages on one page / filtering and weighting
• Slow and expensive processing due to large-scale content analysis (weeks)
2019-06-12 Helge Holzmann (helge@archive.org)
11
Example: Language Analysis (2)
• Wanted:
• Efficient access to comprehensive results
• Lightweight, reusable exchange format
• Dynamic threshold / flexible post-filtering
• Solution: (CDX) Attachment Format (ATT / CDXA)
• Leightweight, efficient loading, integrated data validation, decoupled from data
2019-06-12 Helge Holzmann (helge@archive.org)
12
# Language detection using 'square leaf' approach
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12
com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ
com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW
com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5
com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6
com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J
com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO
com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX
*.cdx.lang_2017-18_v2.cdxa.gzCDX (Capture Index) with pointers to correcsponding (W)ARC records:
*.cdx
We have more available (examples)
• Dataset of all homepages in Global Wayback (GWB) – web.archive.org
• Extracted from snapshot 20180911224740
• GWB-20180911224740_homepages.cdx.gz
• Pre-processed attachments
• GWB-20180911224740_homepages-*.cdx.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz
• GWB-20180911224740_homepages-*.cdx.last.cdxa.gz
2019-06-12 Helge Holzmann (helge@archive.org)
13
# The last available capture
Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA
3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF
5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5
Fatcat.wiki
(beta)
Archive and knowledge graph
of every publicly-accessible
scholarly output with a priority
on long-tail, at-risk publications.
Fatcat.wiki (big catalog)
• At-scale web harvesting of scholarly works
• with descriptive metadata and full-text
• linked with versions and secondary outputs
2019-06-12 Helge Holzmann (helge@archive.org)
15
• API-first accessible /
editable system
Challenge: the Internet Archive is big
• Web archive / Wayback Machine
• 20+ years of web
• 625+ library and other partners
• 753,932,022,000 (captured) URLs
• 362 billion web pages
• More than 5,000 URLs archived every second
• 40+ petabyte
• And there's more:
2019-06-12 Helge Holzmann (helge@archive.org)
Challenge: web archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Common header / metadata format, but various / diverse payloads
• Requires cleaning, filtering, selection, extraction before processing
• MapReduce or variants
• Homogeneous data types / formats
• Distributed batch processing
• load → transform
• aggregate → write
2019-06-12 Helge Holzmann (helge@archive.org)
17
Trade-off: data locality vs. random access
• Direct access allows for exploiting data locality
• Moving computations to the data / sequential scans
• Indirect access with selective random accesses
• Scanning sequentially results in wasted reads (PB)
Helge Holzmann (helge@archive.org)
18
2019-06-12
Efficient processing
• Indirect access via lightweight metadata (CDX)
• Basic operations on metadata before touching the archive (filter, group, sort)
• E.g., offline pages, data types (scripts, styles, images, ...), domains
• Enriching records with data from payload for downstream applications
• E.g., titles, headlines, links, part-of-speach, named entities, ...
2019-06-12 Helge Holzmann (helge@archive.org)
19
Sparkling data processing ☆
• (Internal) data processing library based on Apache Spark
• Goal to integrate all APIs to work with (temporal) web data in one library
• Continuous work in progress, growing with every new task
• Rich of features
• Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, …
• Fast HTML processing without expensive DOM parsing (SAX-like)
• Internal PetaBox authentication / access features
• ATT / CDXA attachment loaders and writers
• Shell / Python integration for computing derivations
• Distributed budget-aware repartitioning (e.g., 1GB per partition / file)
• Advanced retry / timeout / failure handling
• Lots of utilities for logging, file handling, string operations, URL/SURT formatting, …
• Easily configurable, library-wide constants and settings
• …
Helge Holzmann (helge@archive.org)2019-06-12
20
ArchiveSpark
• Expressive and efficient data access and processing
• Declarative workflows, seamless two step loading approach
• Open source
• Available on GitHub: https://github.com/helgeho/ArchiveSpark
• with documentation, docker image, and recipes for common tasks
• Modular / extensible
• Various DataSpecifications and EnrichFunctions
• ArchiveSpark-server: Web service API for ArchiveSpark
• https://github.com/helgeho/ArchiveSpark-server
• Generalizable for archival collections beyond Web archives
• …
Helge Holzmann (helge@archive.org)2019-06-12
21
[Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016]
[Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
Simple and expressive interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative / no deep scala or spark knowledge required
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
Helge Holzmann (helge@archive.org)
val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
22
2019-06-12
Familiar, readable, reusable output
• Nested JSON output encodes lineage of applied enrichments
Helge Holzmann (helge@archive.org)
title
text
entities
persons
23
2019-06-12
Benchmarks vs. Spark / HBase
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
2019-06-12 Helge Holzmann (helge@archive.org)
24
New ArchiveSpark (3.0) very soon
• Major overhaul
• Streamlined dependencies and package structure
• Even more simplified API
• Lots of bug fixes and improvements
• Will be widely based on / include parts of Sparkling
• org.archive.archivespark.sparkling
• Will benefit from Sparkling fixes and updates
• Almost ready
• Please have a little patience and check back soon…
• Follow / star / watch on GitHub
• https://github.com/helgeho/ArchiveSpark
Helge Holzmann (helge@archive.org)2019-06-12
25
We're at your service!
• Archive-It Research Services (ARS)
• WAT (extended metadata files)
• LGA (temporal graphs)
• WANE (named entities)
• Special Seed Services (Artificial Zone Files)
• Language + GeoIP analysis
• Nation Wide Web (NWW) Search
• Customized / regional web + media search
• APIs
• WASAPI data-transfer API (Archive-It)
• Availability API + CDX Server (Wayback)
• More to come soon, stay tuned…
2019-06-12 Helge Holzmann (helge@archive.org)
26
Thank you!
Helge Holzmann (helge@archive.org)
• archive.org
• archive-it.org
• fatcat.wiki
• github.org/helgeho/ArchiveSpark
Questions?
2019-06-12
www.HelgeHolzmann.de
27
If interested in our work,
please get in touch!

More Related Content

What's hot

Web Mining
Web MiningWeb Mining
Web Mining
Mudit Dholakia
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
Alessandro Adamou
 
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Jan Polowinski
 
Shawn-Averkamp-feb25
Shawn-Averkamp-feb25Shawn-Averkamp-feb25
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
Jens Mittelbach
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Thamme Gowda
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
semanticsconference
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
Data Science Society
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
Linked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemLinked Data from a Digital Object Management System
Linked Data from a Digital Object Management System
Uldis Bojars
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
AarshDhokai
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
semanticsconference
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
sopekmir
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
Pascal-Nicolas Becker
 
A possible future role of schema.org for business reporting
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reporting
sopekmir
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Tony Ross-Hellauer
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
datacite
 

What's hot (19)

Web Mining
Web MiningWeb Mining
Web Mining
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
 
Shawn-Averkamp-feb25
Shawn-Averkamp-feb25Shawn-Averkamp-feb25
Shawn-Averkamp-feb25
 
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Linked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemLinked Data from a Digital Object Management System
Linked Data from a Digital Object Management System
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
A possible future role of schema.org for business reporting
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reporting
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
 

Similar to Web Data Engineering - A Technical Perspective on Web Archives

CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
Helge Holzmann
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
NamrataBhatt8
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
Marin Dimitrov
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
inside-BigData.com
 
Archival Technologies
Archival TechnologiesArchival Technologies
Archival Technologies
Cliff Landis
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
Hisham Arafat
 
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 HackathonArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
Helge Holzmann
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
Bernhard Haslhofer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13
Leander Seige
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
WARCnet
 
Digital library services and the changing environment
Digital library services and the changing environmentDigital library services and the changing environment
Digital library services and the changing environment
John MacColl
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
kammeyer
 
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Sebastian Hellmann
 
Coping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited StorageCoping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited Storage
Globus
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
Gautier Poupeau
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 

Similar to Web Data Engineering - A Technical Perspective on Web Archives (20)

CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Archival Technologies
Archival TechnologiesArchival Technologies
Archival Technologies
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 HackathonArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
 
Digital library services and the changing environment
Digital library services and the changing environmentDigital library services and the changing environment
Digital library services and the changing environment
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
 
Coping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited StorageCoping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited Storage
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 

Recently uploaded

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Web Data Engineering - A Technical Perspective on Web Archives

  • 1. Web Data Engineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer Internet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019
  • 2. What is a web archive? • Web archives preserve our history as documented on the web… • … in huge datasets, consisting of all kinds of web resources • e.g., HTML pages, images, video, scripts, … • … stored as big files in the standardized (W)ARC format • along with metadata + request / response headers • next to lightweight capture index files (CDX) • … to provide access to webpages from the past • for users through close reading • replayed by the Wayback Machine • for data analysis at scale through distant-reading • enabled by Big Data processing methods, like Hadoop / Spark, … Helge Holzmann (helge@archive.org)2019-06-12
  • 3. 2019-06-12 Helge Holzmann (helge@archive.org) 3
  • 4. 2019-06-12 Helge Holzmann (helge@archive.org) 4
  • 5. Not today's topic … 2019-06-12 Helge Holzmann (helge@archive.org) http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
  • 6. The (archived) web… • ... is a very valuable dataset to study the web (and the offline world) • Access to very diverse knowledge from various discliplines (history, politics, …) • The whole web at your fingertips / processable snapshots • Adds a temporal dimension to the Web / captures dynamics • ... is a widely unstructured collection of data • Access and analysis at scale is challenging • Processing petabytes of data is expensive and time-consuming • Difficult to discover, identify, extract records and contained information • Potentially highly technical, complex access and parsing process • Low-level details users / researchers / data scientists don't want to / can't deal with • Data engineering needed to be used in downstream applications / studies 2019-06-12 Helge Holzmann (helge@archive.org) 6
  • 7. Different perspectives on web archives • User-centric View • (Temporal) Search / Information Retrieval • Direct access / replaying archived pages • Data-centric View • (W)ARC and CDX (metadata) datasets • Big data processing: Hadoop, Spark, … • Content analysis, historical / evolution studies • Graph-centric View • Structural view on the dataset • Graph algorithms / analysis, structured information • Hyperlink and host graphs, entity / social networks, facts and more 2019-06-12 Helge Holzmann (helge@archive.org) 7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
  • 8. Web (archives) as graph • Foundational model for most downstream applications / analysis tasks • E.g., Search index construction, term / entity co-occurrence studies, … • Different ways / approaches to construct / extract (temporal) graphs • (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc. • Technical challenges that users don't want to / can't deal with: • Efficient generation, effective representation, … Helge Holzmann (helge@archive.org)2019-06-12 8
  • 9. (Temporal) search in web archives • Wanted: Enter a textual query, find relevant captures • Challenges: • Documents are temporal / consist of multiple versions • New captures could near-duplicates or relevant changes • Temporal relevance in addition to textual relevance • Relevance to the query is not always encoded in the content • Information needs / query intents are different from traditional IR • Mostly navigational: Under which URL can I find a specific resource? • How to turn (temporal) graphs into a searchable index? • Integrate full-text, titles, headlines, anchor texts, ...? • Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch • Adaptation of existing retrieval models 2019-06-12 Helge Holzmann (helge@archive.org) 9
  • 10. Web Data Engineering • Transforming data into useful information • Making it usable for downstream applications • Search, data science, digital humanities, content analysis, ... • Regular users, researchers, data scientists / analysts, ... • Enabling efficient and effective access through... • ... infrastructures • ... suitable data formats • ... simple tools / APIs • ... optimized indexes • Technical considerations made by computer scientists • to help users / researchers focus on their application / study / research • to hiding complexity / low-level details through flexible abstractions 2019-06-12 Helge Holzmann (helge@archive.org) 10
  • 11. Example: Language Analysis (1) • Possible research questions: • Which pages of a language exist outside the contries ccTLD? • Which languages are used the most in a certain area / topic? • How has a language evolved over time on the web? • Requirements: • Tools for (W)ARC access, HTML parsing, language detection • Language-annotated pages / captures • Challenges: • Texts too short to detect a language / confidence scores • Multiple languages on one page / filtering and weighting • Slow and expensive processing due to large-scale content analysis (weeks) 2019-06-12 Helge Holzmann (helge@archive.org) 11
  • 12. Example: Language Analysis (2) • Wanted: • Efficient access to comprehensive results • Lightweight, reusable exchange format • Dynamic threshold / flexible post-filtering • Solution: (CDX) Attachment Format (ATT / CDXA) • Leightweight, efficient loading, integrated data validation, decoupled from data 2019-06-12 Helge Holzmann (helge@archive.org) 12 # Language detection using 'square leaf' approach Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX *.cdx.lang_2017-18_v2.cdxa.gzCDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx
  • 13. We have more available (examples) • Dataset of all homepages in Global Wayback (GWB) – web.archive.org • Extracted from snapshot 20180911224740 • GWB-20180911224740_homepages.cdx.gz • Pre-processed attachments • GWB-20180911224740_homepages-*.cdx.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz • GWB-20180911224740_homepages-*.cdx.last.cdxa.gz 2019-06-12 Helge Holzmann (helge@archive.org) 13 # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5
  • 14. Fatcat.wiki (beta) Archive and knowledge graph of every publicly-accessible scholarly output with a priority on long-tail, at-risk publications.
  • 15. Fatcat.wiki (big catalog) • At-scale web harvesting of scholarly works • with descriptive metadata and full-text • linked with versions and secondary outputs 2019-06-12 Helge Holzmann (helge@archive.org) 15 • API-first accessible / editable system
  • 16. Challenge: the Internet Archive is big • Web archive / Wayback Machine • 20+ years of web • 625+ library and other partners • 753,932,022,000 (captured) URLs • 362 billion web pages • More than 5,000 URLs archived every second • 40+ petabyte • And there's more: 2019-06-12 Helge Holzmann (helge@archive.org)
  • 17. Challenge: web archives are Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • Web archive data is heterogeneous, may include text, video, images, … • Common header / metadata format, but various / diverse payloads • Requires cleaning, filtering, selection, extraction before processing • MapReduce or variants • Homogeneous data types / formats • Distributed batch processing • load → transform • aggregate → write 2019-06-12 Helge Holzmann (helge@archive.org) 17
  • 18. Trade-off: data locality vs. random access • Direct access allows for exploiting data locality • Moving computations to the data / sequential scans • Indirect access with selective random accesses • Scanning sequentially results in wasted reads (PB) Helge Holzmann (helge@archive.org) 18 2019-06-12
  • 19. Efficient processing • Indirect access via lightweight metadata (CDX) • Basic operations on metadata before touching the archive (filter, group, sort) • E.g., offline pages, data types (scripts, styles, images, ...), domains • Enriching records with data from payload for downstream applications • E.g., titles, headlines, links, part-of-speach, named entities, ... 2019-06-12 Helge Holzmann (helge@archive.org) 19
  • 20. Sparkling data processing ☆ • (Internal) data processing library based on Apache Spark • Goal to integrate all APIs to work with (temporal) web data in one library • Continuous work in progress, growing with every new task • Rich of features • Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, … • Fast HTML processing without expensive DOM parsing (SAX-like) • Internal PetaBox authentication / access features • ATT / CDXA attachment loaders and writers • Shell / Python integration for computing derivations • Distributed budget-aware repartitioning (e.g., 1GB per partition / file) • Advanced retry / timeout / failure handling • Lots of utilities for logging, file handling, string operations, URL/SURT formatting, … • Easily configurable, library-wide constants and settings • … Helge Holzmann (helge@archive.org)2019-06-12 20
  • 21. ArchiveSpark • Expressive and efficient data access and processing • Declarative workflows, seamless two step loading approach • Open source • Available on GitHub: https://github.com/helgeho/ArchiveSpark • with documentation, docker image, and recipes for common tasks • Modular / extensible • Various DataSpecifications and EnrichFunctions • ArchiveSpark-server: Web service API for ArchiveSpark • https://github.com/helgeho/ArchiveSpark-server • Generalizable for archival collections beyond Web archives • … Helge Holzmann (helge@archive.org)2019-06-12 21 [Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016] [Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
  • 22. Simple and expressive interface • Based on Spark, powered by Scala • This does not mean you have to learn a new programming language! • The interface is rather declarative / no deep scala or spark knowledge required • Simple data accessors are included • Provide simplified access to the underlying data model • Easy extraction / enrichment mechanisms • Customizable and extensible by advanced users Helge Holzmann (helge@archive.org) val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath)) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz") 22 2019-06-12
  • 23. Familiar, readable, reusable output • Nested JSON output encodes lineage of applied enrichments Helge Holzmann (helge@archive.org) title text entities persons 23 2019-06-12
  • 24. Benchmarks vs. Spark / HBase • Three scenarios, from basic to more sophisticated: a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month • Benchmarks do not include derivations • Those are applied on top of all three methods and involve third-party libraries 2019-06-12 Helge Holzmann (helge@archive.org) 24
  • 25. New ArchiveSpark (3.0) very soon • Major overhaul • Streamlined dependencies and package structure • Even more simplified API • Lots of bug fixes and improvements • Will be widely based on / include parts of Sparkling • org.archive.archivespark.sparkling • Will benefit from Sparkling fixes and updates • Almost ready • Please have a little patience and check back soon… • Follow / star / watch on GitHub • https://github.com/helgeho/ArchiveSpark Helge Holzmann (helge@archive.org)2019-06-12 25
  • 26. We're at your service! • Archive-It Research Services (ARS) • WAT (extended metadata files) • LGA (temporal graphs) • WANE (named entities) • Special Seed Services (Artificial Zone Files) • Language + GeoIP analysis • Nation Wide Web (NWW) Search • Customized / regional web + media search • APIs • WASAPI data-transfer API (Archive-It) • Availability API + CDX Server (Wayback) • More to come soon, stay tuned… 2019-06-12 Helge Holzmann (helge@archive.org) 26
  • 27. Thank you! Helge Holzmann (helge@archive.org) • archive.org • archive-it.org • fatcat.wiki • github.org/helgeho/ArchiveSpark Questions? 2019-06-12 www.HelgeHolzmann.de 27 If interested in our work, please get in touch!