SlideShare a Scribd company logo
1 of 13
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Serverless Data Prep with AWS Glue
ABD215
R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r
S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r
L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t
N o v e m b e r 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Chat on AWS Glue & Spark
Data Transformation Machine Learning Explore
Review workshop architecture
We talk
You build
Check access to required products
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Overview
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
 Auto-generates ETL code
 Build on open frameworks – Python and Spark
 Developer Endpoint with Interactive Notebook
Job Authoring
Job Execution
Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Data Catalog
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum,
Amazon EMR and API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – ETL
Automatically generated ETL code running on serverless Apache Spark with the power
and flexibility to bring data together.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Developer Endpoint
Explore, visualize and develop using a personal, serverless environment with
interactive REPL and Notebooks.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
Apache Spark is a fast, easy to use general engine for large-scale data processing and
machine learning.
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real World Application
1. Web scraping – Automate a process to crape forum comments to analyze customer
experience and challenges with a product
• Automate scraping, parsing and reformatting of data
• Prepare data for machine learning
• Build machine learning models to extract insight from data
2. Venue Ratings – Build graph representation of users, venues and ratings
• Consume a collection of venue checkins and ratings
• Map users to venues
• Map venues to rating
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
Web
Forums
Venue
Ratings
Zeppelin
Notebook
AWS
Glue
Amazon
S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started
1. Make sure your AWS user account has the following permissions:
• AmazonEC2FullAccess
• IAMFullAccess
2. Visit the link below to setup permissions and launch your dev endpoint
3. At the same link, download the 3 workshop notebooks to your machine
4. Login to Zeppelin running on your dev endpoint and upload the notebooks
5. Work through each notebook at your own pace
http://workshop-public.s3-website-
us-east-1.amazonaws.com/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cleanup
To make sure you don’t incur unnecessary costs please make sure to remove all
resources created.
1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it
2. From AWS Glue console, select the Dev Endpoint and delete it
3. From AWS Glue console, select the databases, tables and crawlers created during
the session and delete them
4. From S3 console, select any buckets or prefixes (folders) you used for the workshop
and delete them
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Continue Learning
• AWS Glue
• Apache Spark
• Apache Zeppelin
• Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

More Related Content

What's hot

GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyAmazon Web Services
 
AMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAmazon Web Services
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017Amazon Web Services
 
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage Management
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage ManagementSTG311_Deep Dive on Amazon S3 & Amazon Glacier Storage Management
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage ManagementAmazon Web Services
 
DAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineDAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineAmazon Web Services
 
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...Amazon Web Services
 
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017Serverless DevOps to the Rescue - SRV330 - re:Invent 2017
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017Amazon Web Services
 
CON209_Interstella 8888 Learn How to Use Docker on AWS
CON209_Interstella 8888 Learn How to Use Docker on AWSCON209_Interstella 8888 Learn How to Use Docker on AWS
CON209_Interstella 8888 Learn How to Use Docker on AWSAmazon Web Services
 
GPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsGPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsAmazon Web Services
 
Building Serverless Real-time Data Processing (workshop)
Building Serverless Real-time Data Processing (workshop)Building Serverless Real-time Data Processing (workshop)
Building Serverless Real-time Data Processing (workshop)Amazon Web Services
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceAmazon Web Services
 
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017Amazon Web Services
 
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in Minutes
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in MinutesSRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in Minutes
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in MinutesAmazon Web Services
 
NET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private CloudNET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private CloudAmazon Web Services
 
ARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWSARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWSAmazon Web Services
 
SRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing PipelinesSRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing PipelinesAmazon Web Services
 
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud Data
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud DataGPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud Data
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud DataAmazon Web Services
 
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...Amazon Web Services
 
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)Amazon Web Services
 

What's hot (20)

GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made Easy
 
AMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AI
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
 
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage Management
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage ManagementSTG311_Deep Dive on Amazon S3 & Amazon Glacier Storage Management
STG311_Deep Dive on Amazon S3 & Amazon Glacier Storage Management
 
ARC205_Born in the Cloud
ARC205_Born in the CloudARC205_Born in the Cloud
ARC205_Born in the Cloud
 
DAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineDAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC Online
 
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
 
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017Serverless DevOps to the Rescue - SRV330 - re:Invent 2017
Serverless DevOps to the Rescue - SRV330 - re:Invent 2017
 
CON209_Interstella 8888 Learn How to Use Docker on AWS
CON209_Interstella 8888 Learn How to Use Docker on AWSCON209_Interstella 8888 Learn How to Use Docker on AWS
CON209_Interstella 8888 Learn How to Use Docker on AWS
 
GPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsGPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital Markets
 
Building Serverless Real-time Data Processing (workshop)
Building Serverless Real-time Data Processing (workshop)Building Serverless Real-time Data Processing (workshop)
Building Serverless Real-time Data Processing (workshop)
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL Performance
 
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017
Reinforcement Learning – The Ultimate AI - ARC320 - re:Invent 2017
 
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in Minutes
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in MinutesSRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in Minutes
SRV314_Building a Serverless Pipeline to Transcode a Two-Hour Video in Minutes
 
NET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private CloudNET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private Cloud
 
ARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWSARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWS
 
SRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing PipelinesSRV304_Building High-Throughput Serverless Data Processing Pipelines
SRV304_Building High-Throughput Serverless Data Processing Pipelines
 
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud Data
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud DataGPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud Data
GPSTEC315_GPS Optimizing Tips Amazon Redshift for Cloud Data
 
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...
GPSTEC313_GPS Real-Time Data Processing with AWS Lambda Quickly, at Scale, an...
 
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)
STG307_Deep Dive on Amazon Elastic File System (Amazon EFS)
 

Similar to Serverless Data Prep with AWS Glue

CON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSCON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSAmazon Web Services
 
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Amazon Web Services
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)Amazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Amazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAmazon Web Services
 
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Amazon Web Services
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Amazon Web Services
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...Amazon Web Services
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfAmazon Web Services
 
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017Amazon Web Services
 
LFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfLFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfAmazon Web Services
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...Amazon Web Services
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...Amazon Web Services
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudAmazon Web Services
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Brendan Bouffler
 
Design, Build, and Modernize Your Web Applications with AWS
 Design, Build, and Modernize Your Web Applications with AWS Design, Build, and Modernize Your Web Applications with AWS
Design, Build, and Modernize Your Web Applications with AWSDonnie Prakoso
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture PatternsAmazon Web Services
 

Similar to Serverless Data Prep with AWS Glue (20)

CON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSCON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWS
 
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
 
Building Web Apps on AWS
Building Web Apps on AWSBuilding Web Apps on AWS
Building Web Apps on AWS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
 
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
 
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
 
LFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfLFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdf
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the Cloud
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018
 
Design, Build, and Modernize Your Web Applications with AWS
 Design, Build, and Modernize Your Web Applications with AWS Design, Build, and Modernize Your Web Applications with AWS
Design, Build, and Modernize Your Web Applications with AWS
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture Patterns
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Serverless Data Prep with AWS Glue

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Serverless Data Prep with AWS Glue ABD215 R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t N o v e m b e r 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Chat on AWS Glue & Spark Data Transformation Machine Learning Explore Review workshop architecture We talk You build Check access to required products
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Overview  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer Endpoint with Interactive Notebook Job Authoring Job Execution Data Catalog
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Developer Endpoint Explore, visualize and develop using a personal, serverless environment with interactive REPL and Notebooks.
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark Apache Spark is a fast, easy to use general engine for large-scale data processing and machine learning. Spark Core Spark SQL Spark Streaming MLlib GraphX
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real World Application 1. Web scraping – Automate a process to crape forum comments to analyze customer experience and challenges with a product • Automate scraping, parsing and reformatting of data • Prepare data for machine learning • Build machine learning models to extract insight from data 2. Venue Ratings – Build graph representation of users, venues and ratings • Consume a collection of venue checkins and ratings • Map users to venues • Map venues to rating
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture Web Forums Venue Ratings Zeppelin Notebook AWS Glue Amazon S3
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started 1. Make sure your AWS user account has the following permissions: • AmazonEC2FullAccess • IAMFullAccess 2. Visit the link below to setup permissions and launch your dev endpoint 3. At the same link, download the 3 workshop notebooks to your machine 4. Login to Zeppelin running on your dev endpoint and upload the notebooks 5. Work through each notebook at your own pace http://workshop-public.s3-website- us-east-1.amazonaws.com/
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cleanup To make sure you don’t incur unnecessary costs please make sure to remove all resources created. 1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it 2. From AWS Glue console, select the Dev Endpoint and delete it 3. From AWS Glue console, select the databases, tables and crawlers created during the session and delete them 4. From S3 console, select any buckets or prefixes (folders) you used for the workshop and delete them
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Continue Learning • AWS Glue • Apache Spark • Apache Zeppelin • Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!