SlideShare a Scribd company logo
1 of 16
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Public Data Sets: How to Stage
Petabytes of Data for Analysis in AWS
W P S 3 2 6
Jed Sundwall
Manager, AWS Open Data
Amazon Web Services
Sebastian Nagel
Crawl Engineer
Common Crawl
Dave Rocamora
Solutions Architect
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About Common Crawl
• We’re a non-profit that makes web data accessible to anyone
• Lower the barrier to “the web as a data set”
• Running your own large-scale web crawl is expensive and challenging
• Innovation occurs by using the data, rarely through new collection methods
• 10 years of web data
• Common Crawl founded in 2007 by Gil Elbaz
• First crawl 2008
• Since 2012 as public data set on AWS
• 3 Petabytes of web data on s3://commoncrawl/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
10 Years or 3 petabytes of web data
• Monthly “snapshots” during the last years
• Broad sample crawls covering about 3 billion pages every month
• Partially overlapping with previous months
• HTML pages only (small percentage of other document formats)
• Released for free without additional intellectual property restrictions
• Used for natural language processing, data mining, information
retrieval, web science, market research, linked data and semantic web,
internet security, graph processing, benchmark and test ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diverse community
• Research and education vs. company or startup
• Not (yet) familiar with AWS vs. already on AWS
• Available time and budget
• Web archiving vs. big data community
• Conservative or open regarding data formats and processing tools
• Diverse use cases prefer different data formats
• Natural language processing: HTML converted to (annotated) plain text
• Internet security: binary payload plus HTTP headers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data formats
• WARC (Web ARChive)
• HTML content + HTTP header + capture metadata
• Every WARC file is a random sample by its own
• Secondary formats
• Preprocessed and smaller in size
• WAT: page metadata and outgoing links as JSON
• WET: plain text with little metadata
• URL index to WARC files
• As a service – https://index.commoncrawl.org/
• Columnar index (Parquet format)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WARC (Web ARChive) in detail
• WARC is a sustainable format
• ISO standard since 2009
• Wrappers for many programming languages (Python, Java, Go, R, PHP …)
• It’s easy to ready
<WARC record metadata: URL, capture date and time, IP address, checksum,
record-length>
r n r n
<HTTP header>
r n r n
<payload: HTML or binary (PDF, JPEG, etc.)>
r n r n
• Random-access .warc.gz
• Compression per record (gzip spec allows multiple deflate blocks)
• 10% larger than full gzip
• Random access with indexed record offsets
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
URL index
• The URL index allows to query by URL and:
• Metadata (HTTP status, capture time, MIME type, content language)
• Get WARC file path, record offset and length and
• Retrieve WARC records from Amazon S3
• CDX index https://index.commoncrawl.org/
• Look up URLs (also by domain or prefix)
• Memento / way-back machine
• Columnar index on Amazon S3 (Parquet format)
• SQL analytics with Amazon Athena, Spark, Hive …
• Support the “vertical” use case – process web pages of:
• A single domain or a single content language
• URLs matching a pattern
• ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Get help from Common Crawl and our community
• Links to data sets and a short description of formats and tools
https://commoncrawl.org/the-data/get-started/
• Examples and tutorials from our community
https://commoncrawl.org/the-data/examples/
https://commoncrawl.org/the-data/tutorials/
• Tools and examples maintained by Common Crawl
https://github.com/commoncrawl/
• Developer group
https://github.com/commoncrawl/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
session = SparkSession.builder.getOrCreate()
df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/')
df.createOrReplaceTempView('ccindex')
sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset,
warc_record_length
FROM "ccindex"
WHERE crawl = "CC-MAIN-2018-43"
AND subset = "warc"
AND content_languages = "isl"')
# alternatively load the result of a SQL query by Amazon Athena
sqldf = session.read.format("csv").option("header", True) 
.option("inferSchema", True).load(".../path/to/result.csv")
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset",
"warc_record_length").rdd
word_counts = warc_recs.mapPartitions(fetch_process_warc_records) 
.reduceByKey(lambda a, b: a + b)
# imports to fetch and process WARC records
import boto3
import botocore
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
# simple Unicode-aware word tokenization (not suitable for CJK languages)
word_pattern = re.compile('w+', re.UNICODE)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
def fetch_process_warc_records(self, rows):
s3client = boto3.client('s3')
for row in rows:
url = row['url']
warc_path = row['warc_filename']
offset = int(row['warc_record_offset'])
length = int(row['warc_record_length'])
rangereq = 'bytes={}-{}'.format(offset, (offset+length-1))
response = s3client.get_object(Bucket='commoncrawl',
Key=warc_path,
Range=rangereq)
record_stream = BytesIO(response["Body"].read())
for record in ArchiveIterator(record_stream):
page = record.content_stream().read()
text = html_to_text(page)
words = map(lambda w: w.lower(), word_pattern.findall(text))
for word in words:
yield word, 1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
# get text from HTML
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
def html_to_text(page):
try:
encoding = EncodingDetector.find_declared_encoding(page, is_html=True)
soup = BeautifulSoup(page, "lxml", from_encoding=encoding)
for script in soup(["script", "style"]):
script.extract()
return soup.get_text(" ", strip=True)
except:
return ""
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
# most frequent words in about 1 million Icelandic web pages
22994307 og
19802034 í
15765245 að
15724978 á
8290840 er
8088372 um
6578254 sem
6313945 til
5264266 við
4872877 1
4432790 með
4423975 fyrir
3893682 2018
3817965 2
3308209 ekki
3205015 is
3165578 af
3051413 en
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
Embedding Data & Analytics With Looker
Embedding Data & Analytics With LookerEmbedding Data & Analytics With Looker
Embedding Data & Analytics With LookerLooker
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkLuiz Henrique Zambom Santana
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Website Analysis Seo Report
Website Analysis Seo ReportWebsite Analysis Seo Report
Website Analysis Seo ReportSEO Google Guru
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of SearchBill Slawski
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsNeo4j
 

What's hot (20)

On Page SEO
On Page SEOOn Page SEO
On Page SEO
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Big query
Big queryBig query
Big query
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Web mining
Web miningWeb mining
Web mining
 
Embedding Data & Analytics With Looker
Embedding Data & Analytics With LookerEmbedding Data & Analytics With Looker
Embedding Data & Analytics With Looker
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Website Analysis Seo Report
Website Analysis Seo ReportWebsite Analysis Seo Report
Website Analysis Seo Report
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of Search
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of Graphs
 

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL PipelinesAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Amazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczAmazon Web Services
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018 (20)

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS W P S 3 2 6 Jed Sundwall Manager, AWS Open Data Amazon Web Services Sebastian Nagel Crawl Engineer Common Crawl Dave Rocamora Solutions Architect Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About Common Crawl • We’re a non-profit that makes web data accessible to anyone • Lower the barrier to “the web as a data set” • Running your own large-scale web crawl is expensive and challenging • Innovation occurs by using the data, rarely through new collection methods • 10 years of web data • Common Crawl founded in 2007 by Gil Elbaz • First crawl 2008 • Since 2012 as public data set on AWS • 3 Petabytes of web data on s3://commoncrawl/
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 10 Years or 3 petabytes of web data • Monthly “snapshots” during the last years • Broad sample crawls covering about 3 billion pages every month • Partially overlapping with previous months • HTML pages only (small percentage of other document formats) • Released for free without additional intellectual property restrictions • Used for natural language processing, data mining, information retrieval, web science, market research, linked data and semantic web, internet security, graph processing, benchmark and test ...
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diverse community • Research and education vs. company or startup • Not (yet) familiar with AWS vs. already on AWS • Available time and budget • Web archiving vs. big data community • Conservative or open regarding data formats and processing tools • Diverse use cases prefer different data formats • Natural language processing: HTML converted to (annotated) plain text • Internet security: binary payload plus HTTP headers
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data formats • WARC (Web ARChive) • HTML content + HTTP header + capture metadata • Every WARC file is a random sample by its own • Secondary formats • Preprocessed and smaller in size • WAT: page metadata and outgoing links as JSON • WET: plain text with little metadata • URL index to WARC files • As a service – https://index.commoncrawl.org/ • Columnar index (Parquet format)
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WARC (Web ARChive) in detail • WARC is a sustainable format • ISO standard since 2009 • Wrappers for many programming languages (Python, Java, Go, R, PHP …) • It’s easy to ready <WARC record metadata: URL, capture date and time, IP address, checksum, record-length> r n r n <HTTP header> r n r n <payload: HTML or binary (PDF, JPEG, etc.)> r n r n • Random-access .warc.gz • Compression per record (gzip spec allows multiple deflate blocks) • 10% larger than full gzip • Random access with indexed record offsets
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. URL index • The URL index allows to query by URL and: • Metadata (HTTP status, capture time, MIME type, content language) • Get WARC file path, record offset and length and • Retrieve WARC records from Amazon S3 • CDX index https://index.commoncrawl.org/ • Look up URLs (also by domain or prefix) • Memento / way-back machine • Columnar index on Amazon S3 (Parquet format) • SQL analytics with Amazon Athena, Spark, Hive … • Support the “vertical” use case – process web pages of: • A single domain or a single content language • URLs matching a pattern • ...
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Get help from Common Crawl and our community • Links to data sets and a short description of formats and tools https://commoncrawl.org/the-data/get-started/ • Examples and tutorials from our community https://commoncrawl.org/the-data/examples/ https://commoncrawl.org/the-data/tutorials/ • Tools and examples maintained by Common Crawl https://github.com/commoncrawl/ • Developer group https://github.com/commoncrawl/
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count session = SparkSession.builder.getOrCreate() df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/') df.createOrReplaceTempView('ccindex') sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset, warc_record_length FROM "ccindex" WHERE crawl = "CC-MAIN-2018-43" AND subset = "warc" AND content_languages = "isl"') # alternatively load the result of a SQL query by Amazon Athena sqldf = session.read.format("csv").option("header", True) .option("inferSchema", True).load(".../path/to/result.csv")
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd word_counts = warc_recs.mapPartitions(fetch_process_warc_records) .reduceByKey(lambda a, b: a + b) # imports to fetch and process WARC records import boto3 import botocore from warcio.archiveiterator import ArchiveIterator from warcio.recordloader import ArchiveLoadFailed # simple Unicode-aware word tokenization (not suitable for CJK languages) word_pattern = re.compile('w+', re.UNICODE)
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count def fetch_process_warc_records(self, rows): s3client = boto3.client('s3') for row in rows: url = row['url'] warc_path = row['warc_filename'] offset = int(row['warc_record_offset']) length = int(row['warc_record_length']) rangereq = 'bytes={}-{}'.format(offset, (offset+length-1)) response = s3client.get_object(Bucket='commoncrawl', Key=warc_path, Range=rangereq) record_stream = BytesIO(response["Body"].read()) for record in ArchiveIterator(record_stream): page = record.content_stream().read() text = html_to_text(page) words = map(lambda w: w.lower(), word_pattern.findall(text)) for word in words: yield word, 1
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # get text from HTML from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector def html_to_text(page): try: encoding = EncodingDetector.find_declared_encoding(page, is_html=True) soup = BeautifulSoup(page, "lxml", from_encoding=encoding) for script in soup(["script", "style"]): script.extract() return soup.get_text(" ", strip=True) except: return ""
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # most frequent words in about 1 million Icelandic web pages 22994307 og 19802034 í 15765245 að 15724978 á 8290840 er 8088372 um 6578254 sem 6313945 til 5264266 við 4872877 1 4432790 með 4423975 fyrir 3893682 2018 3817965 2 3308209 ekki 3205015 is 3165578 af 3051413 en
  • 15. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.