Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 16

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

2

Share

Download to read offline

AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS326) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS W P S 3 2 6 Jed Sundwall Manager, AWS Open Data Amazon Web Services Sebastian Nagel Crawl Engineer Common Crawl Dave Rocamora Solutions Architect Amazon Web Services
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About Common Crawl • We’re a non-profit that makes web data accessible to anyone • Lower the barrier to “the web as a data set” • Running your own large-scale web crawl is expensive and challenging • Innovation occurs by using the data, rarely through new collection methods • 10 years of web data • Common Crawl founded in 2007 by Gil Elbaz • First crawl 2008 • Since 2012 as public data set on AWS • 3 Petabytes of web data on s3://commoncrawl/
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 10 Years or 3 petabytes of web data • Monthly “snapshots” during the last years • Broad sample crawls covering about 3 billion pages every month • Partially overlapping with previous months • HTML pages only (small percentage of other document formats) • Released for free without additional intellectual property restrictions • Used for natural language processing, data mining, information retrieval, web science, market research, linked data and semantic web, internet security, graph processing, benchmark and test ...
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diverse community • Research and education vs. company or startup • Not (yet) familiar with AWS vs. already on AWS • Available time and budget • Web archiving vs. big data community • Conservative or open regarding data formats and processing tools • Diverse use cases prefer different data formats • Natural language processing: HTML converted to (annotated) plain text • Internet security: binary payload plus HTTP headers
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data formats • WARC (Web ARChive) • HTML content + HTTP header + capture metadata • Every WARC file is a random sample by its own • Secondary formats • Preprocessed and smaller in size • WAT: page metadata and outgoing links as JSON • WET: plain text with little metadata • URL index to WARC files • As a service – https://index.commoncrawl.org/ • Columnar index (Parquet format)
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WARC (Web ARChive) in detail • WARC is a sustainable format • ISO standard since 2009 • Wrappers for many programming languages (Python, Java, Go, R, PHP …) • It’s easy to ready <WARC record metadata: URL, capture date and time, IP address, checksum, record-length> r n r n <HTTP header> r n r n <payload: HTML or binary (PDF, JPEG, etc.)> r n r n • Random-access .warc.gz • Compression per record (gzip spec allows multiple deflate blocks) • 10% larger than full gzip • Random access with indexed record offsets
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. URL index • The URL index allows to query by URL and: • Metadata (HTTP status, capture time, MIME type, content language) • Get WARC file path, record offset and length and • Retrieve WARC records from Amazon S3 • CDX index https://index.commoncrawl.org/ • Look up URLs (also by domain or prefix) • Memento / way-back machine • Columnar index on Amazon S3 (Parquet format) • SQL analytics with Amazon Athena, Spark, Hive … • Support the “vertical” use case – process web pages of: • A single domain or a single content language • URLs matching a pattern • ...
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Get help from Common Crawl and our community • Links to data sets and a short description of formats and tools https://commoncrawl.org/the-data/get-started/ • Examples and tutorials from our community https://commoncrawl.org/the-data/examples/ https://commoncrawl.org/the-data/tutorials/ • Tools and examples maintained by Common Crawl https://github.com/commoncrawl/ • Developer group https://github.com/commoncrawl/
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count session = SparkSession.builder.getOrCreate() df = spark.read.load('s3://commoncrawl/cc-index/table/cc-main/warc/') df.createOrReplaceTempView('ccindex') sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset, warc_record_length FROM "ccindex" WHERE crawl = "CC-MAIN-2018-43" AND subset = "warc" AND content_languages = "isl"') # alternatively load the result of a SQL query by Amazon Athena sqldf = session.read.format("csv").option("header", True) .option("inferSchema", True).load(".../path/to/result.csv")
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count warc_recs = sqldf.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd word_counts = warc_recs.mapPartitions(fetch_process_warc_records) .reduceByKey(lambda a, b: a + b) # imports to fetch and process WARC records import boto3 import botocore from warcio.archiveiterator import ArchiveIterator from warcio.recordloader import ArchiveLoadFailed # simple Unicode-aware word tokenization (not suitable for CJK languages) word_pattern = re.compile('w+', re.UNICODE)
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count def fetch_process_warc_records(self, rows): s3client = boto3.client('s3') for row in rows: url = row['url'] warc_path = row['warc_filename'] offset = int(row['warc_record_offset']) length = int(row['warc_record_length']) rangereq = 'bytes={}-{}'.format(offset, (offset+length-1)) response = s3client.get_object(Bucket='commoncrawl', Key=warc_path, Range=rangereq) record_stream = BytesIO(response["Body"].read()) for record in ArchiveIterator(record_stream): page = record.content_stream().read() text = html_to_text(page) words = map(lambda w: w.lower(), word_pattern.findall(text)) for word in words: yield word, 1
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # get text from HTML from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector def html_to_text(page): try: encoding = EncodingDetector.find_declared_encoding(page, is_html=True) soup = BeautifulSoup(page, "lxml", from_encoding=encoding) for script in soup(["script", "style"]): script.extract() return soup.get_text(" ", strip=True) except: return ""
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example use case: Icelandic Word Count # most frequent words in about 1 million Icelandic web pages 22994307 og 19802034 í 15765245 að 15724978 á 8290840 er 8088372 um 6578254 sem 6313945 til 5264266 við 4872877 1 4432790 með 4423975 fyrir 3893682 2018 3817965 2 3308209 ekki 3205015 is 3165578 af 3051413 en
  14. 14. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×