SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Public Data Sets: How to Stage
Petabytes of Data for Analysis in AWS
W P S 3 2 6
Jed Sundwall
Manager, AWS Open Data
Amazon Web Services
Sebastian Nagel
Crawl Engineer
Common Crawl
Dave Rocamora
Solutions Architect
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About Common Crawl
• We’re a non-profit that makes web data accessible to anyone
• Lower the barrier to “the web as a data set”
• Running your own large-scale web crawl is expensive and challenging
• Innovation occurs by using the data, rarely through new collection methods
• 10 years of web data
• Common Crawl founded in 2007 by Gil Elbaz
• First crawl 2008
• Since 2012 as public data set on AWS
• 3 Petabytes of web data on s3://commoncrawl/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
10 Years or 3 petabytes of web data
• Monthly “snapshots” during the last years
• Broad sample crawls covering about 3 billion pages every month
• Partially overlapping with previous months
• HTML pages only (small percentage of other document formats)
• Released for free without additional intellectual property restrictions
• Used for natural language processing, data mining, information
retrieval, web science, market research, linked data and semantic web,
internet security, graph processing, benchmark and test ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diverse community
• Research and education vs. company or startup
• Not (yet) familiar with AWS vs. already on AWS
• Available time and budget
• Web archiving vs. big data community
• Conservative or open regarding data formats and processing tools
• Diverse use cases prefer different data formats
• Natural language processing: HTML converted to (annotated) plain text
• Internet security: binary payload plus HTTP headers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data formats
• WARC (Web ARChive)
• HTML content + HTTP header + capture metadata
• Every WARC file is a random sample by its own
• Secondary formats
• Preprocessed and smaller in size
• WAT: page metadata and outgoing links as JSON
• WET: plain text with little metadata
• URL index to WARC files
• As a service – https://index.commoncrawl.org/
• Columnar index (Parquet format)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WARC (Web ARChive) in detail
• WARC is a sustainable format
• ISO standard since 2009
• Wrappers for many programming languages (Python, Java, Go, R, PHP …)
• It’s easy to ready
<WARC record metadata: URL, capture date and time, IP address, checksum,
record-length>
r n r n
<HTTP header>
r n r n
<payload: HTML or binary (PDF, JPEG, etc.)>
r n r n
• Random-access .warc.gz
• Compression per record (gzip spec allows multiple deflate blocks)
• 10% larger than full gzip
• Random access with indexed record offsets
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
URL index
• The URL index allows to query by URL and:
• Metadata (HTTP status, capture time, MIME type, content language)
• Get WARC file path, record offset and length and
• Retrieve WARC records from Amazon S3
• CDX index https://index.commoncrawl.org/
• Look up URLs (also by domain or prefix)
• Memento / way-back machine
• Columnar index on Amazon S3 (Parquet format)
• SQL analytics with Amazon Athena, Spark, Hive …
• Support the “vertical” use case – process web pages of:
• A single domain or a single content language
• URLs matching a pattern
• ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Get help from Common Crawl and our community
• Links to data sets and a short description of formats and tools
https://commoncrawl.org/the-data/get-started/
• Examples and tutorials from our community
https://commoncrawl.org/the-data/examples/
https://commoncrawl.org/the-data/tutorials/
• Tools and examples maintained by Common Crawl
https://github.com/commoncrawl/
• Developer group
https://github.com/commoncrawl/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
session = SparkSession.builder.getOrCreate()
df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/')
df.createOrReplaceTempView('ccindex')
sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset,
warc_record_length
FROM "ccindex"
WHERE crawl = "CC-MAIN-2018-43"
AND subset = "warc"
AND content_languages = "isl"')
# alternatively load the result of a SQL query by Amazon Athena
sqldf = session.read.format("csv").option("header", True) 
.option("inferSchema", True).load(".../path/to/result.csv")
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset",
"warc_record_length").rdd
word_counts = warc_recs.mapPartitions(fetch_process_warc_records) 
.reduceByKey(lambda a, b: a + b)
# imports to fetch and process WARC records
import boto3
import botocore
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
# simple Unicode-aware word tokenization (not suitable for CJK languages)
word_pattern = re.compile('w+', re.UNICODE)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
def fetch_process_warc_records(self, rows):
s3client = boto3.client('s3')
for row in rows:
url = row['url']
warc_path = row['warc_filename']
offset = int(row['warc_record_offset'])
length = int(row['warc_record_length'])
rangereq = 'bytes={}-{}'.format(offset, (offset+length-1))
response = s3client.get_object(Bucket='commoncrawl',
Key=warc_path,
Range=rangereq)
record_stream = BytesIO(response["Body"].read())
for record in ArchiveIterator(record_stream):
page = record.content_stream().read()
text = html_to_text(page)
words = map(lambda w: w.lower(), word_pattern.findall(text))
for word in words:
yield word, 1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
# get text from HTML
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
def html_to_text(page):
try:
encoding = EncodingDetector.find_declared_encoding(page, is_html=True)
soup = BeautifulSoup(page, "lxml", from_encoding=encoding)
for script in soup(["script", "style"]):
script.extract()
return soup.get_text(" ", strip=True)
except:
return ""
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example use case: Icelandic Word Count
# most frequent words in about 1 million Icelandic web pages
22994307 og
19802034 í
15765245 að
15724978 á
8290840 er
8088372 um
6578254 sem
6313945 til
5264266 við
4872877 1
4432790 með
4423975 fyrir
3893682 2018
3817965 2
3308209 ekki
3205015 is
3165578 af
3051413 en
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
Primal Pappachan
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developers
Patrick Savalle
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
Joe Griffin
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
Dinesh K
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

What's hot (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developers
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
Amazon Web Services
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Amazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Amazon Web Services
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
Cloudera, Inc.
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018 (20)

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS W P S 3 2 6 Jed Sundwall Manager, AWS Open Data Amazon Web Services Sebastian Nagel Crawl Engineer Common Crawl Dave Rocamora Solutions Architect Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About Common Crawl • We’re a non-profit that makes web data accessible to anyone • Lower the barrier to “the web as a data set” • Running your own large-scale web crawl is expensive and challenging • Innovation occurs by using the data, rarely through new collection methods • 10 years of web data • Common Crawl founded in 2007 by Gil Elbaz • First crawl 2008 • Since 2012 as public data set on AWS • 3 Petabytes of web data on s3://commoncrawl/
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 10 Years or 3 petabytes of web data • Monthly “snapshots” during the last years • Broad sample crawls covering about 3 billion pages every month • Partially overlapping with previous months • HTML pages only (small percentage of other document formats) • Released for free without additional intellectual property restrictions • Used for natural language processing, data mining, information retrieval, web science, market research, linked data and semantic web, internet security, graph processing, benchmark and test ...
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diverse community • Research and education vs. company or startup • Not (yet) familiar with AWS vs. already on AWS • Available time and budget • Web archiving vs. big data community • Conservative or open regarding data formats and processing tools • Diverse use cases prefer different data formats • Natural language processing: HTML converted to (annotated) plain text • Internet security: binary payload plus HTTP headers
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data formats • WARC (Web ARChive) • HTML content + HTTP header + capture metadata • Every WARC file is a random sample by its own • Secondary formats • Preprocessed and smaller in size • WAT: page metadata and outgoing links as JSON • WET: plain text with little metadata • URL index to WARC files • As a service – https://index.commoncrawl.org/ • Columnar index (Parquet format)
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WARC (Web ARChive) in detail • WARC is a sustainable format • ISO standard since 2009 • Wrappers for many programming languages (Python, Java, Go, R, PHP …) • It’s easy to ready <WARC record metadata: URL, capture date and time, IP address, checksum, record-length> r n r n <HTTP header> r n r n <payload: HTML or binary (PDF, JPEG, etc.)> r n r n • Random-access .warc.gz • Compression per record (gzip spec allows multiple deflate blocks) • 10% larger than full gzip • Random access with indexed record offsets
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. URL index • The URL index allows to query by URL and: • Metadata (HTTP status, capture time, MIME type, content language) • Get WARC file path, record offset and length and • Retrieve WARC records from Amazon S3 • CDX index https://index.commoncrawl.org/ • Look up URLs (also by domain or prefix) • Memento / way-back machine • Columnar index on Amazon S3 (Parquet format) • SQL analytics with Amazon Athena, Spark, Hive … • Support the “vertical” use case – process web pages of: • A single domain or a single content language • URLs matching a pattern • ...
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Get help from Common Crawl and our community • Links to data sets and a short description of formats and tools https://commoncrawl.org/the-data/get-started/ • Examples and tutorials from our community https://commoncrawl.org/the-data/examples/ https://commoncrawl.org/the-data/tutorials/ • Tools and examples maintained by Common Crawl https://github.com/commoncrawl/ • Developer group https://github.com/commoncrawl/
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count session = SparkSession.builder.getOrCreate() df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/') df.createOrReplaceTempView('ccindex') sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset, warc_record_length FROM "ccindex" WHERE crawl = "CC-MAIN-2018-43" AND subset = "warc" AND content_languages = "isl"') # alternatively load the result of a SQL query by Amazon Athena sqldf = session.read.format("csv").option("header", True) .option("inferSchema", True).load(".../path/to/result.csv")
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd word_counts = warc_recs.mapPartitions(fetch_process_warc_records) .reduceByKey(lambda a, b: a + b) # imports to fetch and process WARC records import boto3 import botocore from warcio.archiveiterator import ArchiveIterator from warcio.recordloader import ArchiveLoadFailed # simple Unicode-aware word tokenization (not suitable for CJK languages) word_pattern = re.compile('w+', re.UNICODE)
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count def fetch_process_warc_records(self, rows): s3client = boto3.client('s3') for row in rows: url = row['url'] warc_path = row['warc_filename'] offset = int(row['warc_record_offset']) length = int(row['warc_record_length']) rangereq = 'bytes={}-{}'.format(offset, (offset+length-1)) response = s3client.get_object(Bucket='commoncrawl', Key=warc_path, Range=rangereq) record_stream = BytesIO(response["Body"].read()) for record in ArchiveIterator(record_stream): page = record.content_stream().read() text = html_to_text(page) words = map(lambda w: w.lower(), word_pattern.findall(text)) for word in words: yield word, 1
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # get text from HTML from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector def html_to_text(page): try: encoding = EncodingDetector.find_declared_encoding(page, is_html=True) soup = BeautifulSoup(page, "lxml", from_encoding=encoding) for script in soup(["script", "style"]): script.extract() return soup.get_text(" ", strip=True) except: return ""
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # most frequent words in about 1 million Icelandic web pages 22994307 og 19802034 í 15765245 að 15724978 á 8290840 er 8088372 um 6578254 sem 6313945 til 5264266 við 4872877 1 4432790 með 4423975 fyrir 3893682 2018 3817965 2 3308209 ekki 3205015 is 3165578 af 3051413 en
  • 15. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.