AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Public Data Sets: How to Stage
Petabytes of Data for Analysis in AWS
W P S 3 2 6
Jed Sundwall
Manager, AWS Open Data
Amazon Web Services
Sebastian Nagel
Crawl Engineer
Common Crawl
Dave Rocamora
Solutions Architect
Amazon Web Services

About Common Crawl
• We’re a non-profit that makes web data accessible to anyone
• Lower the barrier to “the web as a data set”
• Running your own large-scale web crawl is expensive and challenging
• Innovation occurs by using the data, rarely through new collection methods
• 10 years of web data
• Common Crawl founded in 2007 by Gil Elbaz
• First crawl 2008
• Since 2012 as public data set on AWS
• 3 Petabytes of web data on s3://commoncrawl/

10 Years or 3 petabytes of web data
• Monthly “snapshots” during the last years
• Broad sample crawls covering about 3 billion pages every month
• Partially overlapping with previous months
• HTML pages only (small percentage of other document formats)
• Released for free without additional intellectual property restrictions
• Used for natural language processing, data mining, information
retrieval, web science, market research, linked data and semantic web,
internet security, graph processing, benchmark and test ...

Diverse community
• Research and education vs. company or startup
• Not (yet) familiar with AWS vs. already on AWS
• Available time and budget
• Web archiving vs. big data community
• Conservative or open regarding data formats and processing tools
• Diverse use cases prefer different data formats
• Natural language processing: HTML converted to (annotated) plain text
• Internet security: binary payload plus HTTP headers

Data formats
• WARC (Web ARChive)
• HTML content + HTTP header + capture metadata
• Every WARC file is a random sample by its own
• Secondary formats
• Preprocessed and smaller in size
• WAT: page metadata and outgoing links as JSON
• WET: plain text with little metadata
• URL index to WARC files
• As a service – https://index.commoncrawl.org/
• Columnar index (Parquet format)

WARC (Web ARChive) in detail
• WARC is a sustainable format
• ISO standard since 2009
• Wrappers for many programming languages (Python, Java, Go, R, PHP …)
• It’s easy to ready
<WARC record metadata: URL, capture date and time, IP address, checksum,
record-length>
r n r n
<HTTP header>
r n r n
<payload: HTML or binary (PDF, JPEG, etc.)>
r n r n
• Random-access .warc.gz
• Compression per record (gzip spec allows multiple deflate blocks)
• 10% larger than full gzip
• Random access with indexed record offsets

URL index
• The URL index allows to query by URL and:
• Metadata (HTTP status, capture time, MIME type, content language)
• Get WARC file path, record offset and length and
• Retrieve WARC records from Amazon S3
• CDX index https://index.commoncrawl.org/
• Look up URLs (also by domain or prefix)
• Memento / way-back machine
• Columnar index on Amazon S3 (Parquet format)
• SQL analytics with Amazon Athena, Spark, Hive …
• Support the “vertical” use case – process web pages of:
• A single domain or a single content language
• URLs matching a pattern
• ...

Get help from Common Crawl and our community
• Links to data sets and a short description of formats and tools
https://commoncrawl.org/the-data/get-started/
• Examples and tutorials from our community
https://commoncrawl.org/the-data/examples/
https://commoncrawl.org/the-data/tutorials/
• Tools and examples maintained by Common Crawl
https://github.com/commoncrawl/
• Developer group
https://github.com/commoncrawl/

Example use case: Icelandic Word Count
session = SparkSession.builder.getOrCreate()
df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/')
df.createOrReplaceTempView('ccindex')
sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset,
warc_record_length
FROM "ccindex"
WHERE crawl = "CC-MAIN-2018-43"
AND subset = "warc"
AND content_languages = "isl"')
# alternatively load the result of a SQL query by Amazon Athena
sqldf = session.read.format("csv").option("header", True)
.option("inferSchema", True).load(".../path/to/result.csv")

warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset",
"warc_record_length").rdd
word_counts = warc_recs.mapPartitions(fetch_process_warc_records)
.reduceByKey(lambda a, b: a + b)
# imports to fetch and process WARC records
import boto3
import botocore
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
# simple Unicode-aware word tokenization (not suitable for CJK languages)
word_pattern = re.compile('w+', re.UNICODE)

def fetch_process_warc_records(self, rows):
s3client = boto3.client('s3')
for row in rows:
url = row['url']
warc_path = row['warc_filename']
offset = int(row['warc_record_offset'])
length = int(row['warc_record_length'])
rangereq = 'bytes={}-{}'.format(offset, (offset+length-1))
response = s3client.get_object(Bucket='commoncrawl',
Key=warc_path,
Range=rangereq)
record_stream = BytesIO(response["Body"].read())
for record in ArchiveIterator(record_stream):
page = record.content_stream().read()
text = html_to_text(page)
words = map(lambda w: w.lower(), word_pattern.findall(text))
for word in words:
yield word, 1

# get text from HTML
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
def html_to_text(page):
try:
encoding = EncodingDetector.find_declared_encoding(page, is_html=True)
soup = BeautifulSoup(page, "lxml", from_encoding=encoding)
for script in soup(["script", "style"]):
script.extract()
return soup.get_text(" ", strip=True)
except:
return ""

# most frequent words in about 1 million Icelandic web pages
22994307 og
19802034 í
15765245 að
15724978 á
8290840 er
8088372 um
6578254 sem
6313945 til
5264266 við
4872877 1
4432790 með
4423975 fyrir
3893682 2018
3817965 2
3308209 ekki
3205015 is
3165578 af
3051413 en

Thank you!

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

Similar to AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018