Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD215_Serverless Data Prep with AWS Glue

615 views

Published on

In this session, you learn how to set up a crawler to automatically discover your data and build your AWS Glue Data Catalog. You then auto-generate an AWS Glue ETL script, download it, and interactively edit it using a Zeppelin notebook, connected to an AWS Glue development endpoint. After that, you upload this script to Amazon S3, reuse it across multiple jobs, and add trigger conditions to run the jobs. The resulting datasets automatically get registered in the AWS Glue Data Catalog and you can then query these new datasets from Amazon EMR and Amazon Athena. Prerequisites: Knowledge of Python and familiarity with big data applications is preferred but not required. Attendees must bring their own laptops.

  • Be the first to comment

ABD215_Serverless Data Prep with AWS Glue

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Serverless Data Prep with AWS Glue ABD215 R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t N o v e m b e r 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Chat on AWS Glue & Spark Data Transformation Machine Learning Explore Review workshop architecture We talk You build Check access to required products
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Overview  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer Endpoint with Interactive Notebook Job Authoring Job Execution Data Catalog
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Developer Endpoint Explore, visualize and develop using a personal, serverless environment with interactive REPL and Notebooks.
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark Apache Spark is a fast, easy to use general engine for large-scale data processing and machine learning. Spark Core Spark SQL Spark Streaming MLlib GraphX
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real World Application 1. Web scraping – Automate a process to crape forum comments to analyze customer experience and challenges with a product • Automate scraping, parsing and reformatting of data • Prepare data for machine learning • Build machine learning models to extract insight from data 2. Venue Ratings – Build graph representation of users, venues and ratings • Consume a collection of venue checkins and ratings • Map users to venues • Map venues to rating
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture Web Forums Venue Ratings Zeppelin Notebook AWS Glue Amazon S3
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started 1. Make sure your AWS user account has the following permissions: • AmazonEC2FullAccess • IAMFullAccess 2. Visit the link below to setup permissions and launch your dev endpoint 3. At the same link, download the 3 workshop notebooks to your machine 4. Login to Zeppelin running on your dev endpoint and upload the notebooks 5. Work through each notebook at your own pace http://workshop-public.s3-website- us-east-1.amazonaws.com/
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cleanup To make sure you don’t incur unnecessary costs please make sure to remove all resources created. 1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it 2. From AWS Glue console, select the Dev Endpoint and delete it 3. From AWS Glue console, select the databases, tables and crawlers created during the session and delete them 4. From S3 console, select any buckets or prefixes (folders) you used for the workshop and delete them
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Continue Learning • AWS Glue • Apache Spark • Apache Zeppelin • Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!

×