SlideShare a Scribd company logo
1 of 59
Download to read offline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Serverless data prep with AWS Glue
Raghu Prabhu
Senior BDM
Data Lakes on AWS
A D B 3 0 6
Radhika Ravirala
Specialist Solutions Architect
Data and Analytics, AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
You are a data engineer at AnyCompany*
Create a dataset for
reporting and visualization
Cleanse
Transform
Optimize for reporting queries
Help solve a
Machine Learning problem
AnyCompany’s data scientists need
to understand passenger tipping
behavior
* a completely made-up ride-hailing startup
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Your dataset: NYC Taxi Trips
Yellow and Green taxi trips
• Pick-up and drop-off dates/times
• Pick-up and drop-off locations
• Trip distances
• Itemized fares
• Tip amount
• Driver-reported passenger counts
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
For-hire Vehicle (FHV)
• Pick-up and drop-off dates/times
• Pick-up and drop-off locations
• Dispatching base
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Your dataset: NYC Taxi Trips
Original raw dataset
Green and yellow taxi + FHV
Years 2009 to 2018
~1.6Bn rows
215 files
253 GB total
Simplified raw dataset
Only yellow taxi + few look-ups
Jan to March 2017
~2M rows
3 files
2.5 GB+ uncompressed
Ready in a publicly accessible Amazon
S3 bucket
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Navigate to the AWS Glue workshop (ADB306) website
Use Firefox or Chrome. Keep the website open in a separate tab.
http://bit.ly/2JuCqs5
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key resources AWS CloudFormation will create
1. A new data lake Amazon S3 bucket
2. Necessary IAM policies and roles for AWS Glue, Amazon Athena, and Amazon
SageMaker
3. An AWS Glue development endpoint
4. A number of named queries in Amazon Athena
Finally, the NYC Taxi trips raw dataset is copied into your Amazon S3 bucket
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue automates
the undifferentiated heavy lifting of ETL
Automatically discover and categorize your data making it immediately searchable and
queryable across data sources.
Generate code to clean, enrich, and reliably move data between various data sources
Run your jobs on a serverless, fully managed, scale-out environment. No compute
resources to manage.
Discover
Develop
Deploy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data Catalog
▪ Hive Metastore compatible with enhanced functionality
▪ Crawlers automatically extracts metadata and creates tables
▪ Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
▪ Run jobs on a serverless Spark platform
▪ Provides flexible scheduling
▪ Handles dependency resolution, monitoring and alerting
Job Authoring
▪ Auto-generates ETL code
▪ Build on open frameworks – Python and Spark
▪ Developer-centric – editing, debugging, sharing
AWS Glue components
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Crawlers and the AWS Glue Data Catalog
A crawler…
• Samples and classifies your data
• Extracts metadata
• Infers schema and partitioning format
• Creates tables in your account’s AWS Glue Data Catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Why query with Amazon Athena?
• Interactive query service
• Makes it easy to analyze data in Amazon S3 using standard SQL
• Out-of-the-box integrated with AWS Glue Data Catalog
• Serverless
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Crawler
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
1
Concepts in action
2
3
4
1. Crawler crawls raw dataset in
Amazon S3 bucket
2. Crawler writes metadata into AWS
Glue Data Catalog
3. You query raw dataset in Athena
4. Athena uses schema definition to
read raw dataset from Amazon S3
and returns results
You
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Interactive ETL development with AWS Glue and Amazon SageMaker
What is Amazon SageMaker?
• ... is a fully-managed ML platform
• … enables you to build, train, and deploy machine
learning models at any scale
• … provides a Jupyter notebook environment
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connect your IDE to an AWS Glue development endpoint.
Environment to interactively develop, debug, and test ETL code.
AWS Glue Spark environment
Interpreter
Server
Remote
Interpreter
What is an AWS Glue development endpoint?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Interactive ETL development with AWS Glue and Amazon SageMaker
AWS Glue enables you to
… create an Amazon SageMaker notebook environment
... connect it to an AWS Glue dev endpoint
… and finally, access your Jupyter notebook
All without leaving the AWS Glue console
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
3
Concepts in action
2
1
1. You author ETL code in an Amazon
SageMaker notebook
2. ETL code runs on AWS Glue Dev
endpoint
3. ETL code…
a. Reads raw dataset
b. Applies transformations
c. Writes optimized dataset back to your
Amazon S3 bucket
4. You use your ETL code in notebook
to create a script file
Amazon SageMaker
Notebook
NYC Taxi
Optimized Dataset
4
You
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Transformations we’ll apply to raw data
• Join NYC trips dataset with look-up tables (denormalization)
• Create new Timestamp columns for pick-up & drop-off
• Drop unnecessary columns
• Partition the dataset
• Convert to a columnar file format (parquet)
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Concepts in action
NYC Taxi
Optimized Dataset
AWS Glue
Crawler
AWS Glue
Data Catalog
1
2
3
4
1. Crawler crawls optimized dataset
in Amazon S3 bucket
2. Crawler writes metadata into AWS
Glue Data Catalog
3. You query optimized dataset in
Athena
4. Athena uses schema definition to
read data from Amazon S3 and
returns results
You
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is an AWS Glue job?
An AWS Glue job encapsulates the business logic that performs extract,
transform, and load (ETL) work.
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped, monitored
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is an AWS Glue trigger?
Triggers are the ‘glue’ in your AWS Glue ETL pipeline.
Triggers...
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs at once
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Three ways to set up an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State Machine-driven
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Schedule-driven AWS Glue ETL pipeline
We work our way backwards from a daily SLA deadline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
State Machine-driven AWS Glue ETL pipeline
Let AWS Step Functions drive the pipeline.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Concepts in action
NYC Taxi
Optimized Dataset
AWS Glue
Crawler for
Raw dataset
AWS Glue
Data Catalog
1
4
We’ll build a Schedule-driven pipeline.
1. Scheduled crawler runs on raw dataset and
updates AWS Glue Data Catalog
2. AWS Glue job starts based on trigger(s)
3. Job reads raw dataset, applies
transformations, and writes optimized
dataset to your S3 bucket
4. Scheduled crawler runs on optimized
dataset and updates Data Catalog
AWS Glue Job to
created
Optimized dataset
NYC Taxi
Raw Dataset
AWS Glue
Crawler for
Optimized
dataset
3
2
Trigger
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Workshop map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
ML Problem Definition
Tip given to Taxi Drivers is substantial part of their Income
They can also serve as loose metric for Service Quality that can be useful for taxi
companies
What are influences for higher tips (e.g., Taxi starting and ending in Richer part of
cities?)
We present analysis of factors affecting Tips and use these factors to predict Tips
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
ML Architecture
Results
transform()
Test Data
fit()
XGBoost-Regression
Data Splitting
Feature Engineering
Endpoint
Endpoint Configuration
Model Creation
Training Job
Amazon SageMaker
Notebook Instance – Glue Spark
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Concepts in action
NYC Taxi
Optimized
1. Launch Notebook
2. Read Optimized Dataset
3. Train the ML Model
4. Host the trained model
and perform Prediction
AWS Glue
Data Catalog
AWS Glue Development Endpoint
SageMaker
notebook
2
3
4
1
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on:
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Run ML Notebook interactively
• Initialize variables and import spark libraries
• Retrieve Optimized NYC Taxi trips Dataset
• Observe features against target label (tip_amount)
• Perform feature engineering
• Split feature engineered dataset into Train and Test
• Launch Spark XGBoostEstimator (Observe hyperparameters)
• Perform Prediction and observe accuracy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
You made it!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Raghu Prabhu (prabhuru@amazon.com)
Radhika Ravirala (ravirala@amazon.com)

More Related Content

What's hot

利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統Amazon Web Services
 
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
 No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ... No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...AWS Summits
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Amazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Summits
 
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...Making CI/CD pipelines safer with application monitoring and tracing - MAD202...
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...Amazon Web Services
 
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...Amazon Web Services
 
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...Amazon Web Services
 
Certificate management concepts in AWS - SEC205 - New York AWS Summit
Certificate management concepts in AWS - SEC205 - New York AWS SummitCertificate management concepts in AWS - SEC205 - New York AWS Summit
Certificate management concepts in AWS - SEC205 - New York AWS SummitAmazon Web Services
 
Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...Amazon Web Services
 
Progetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWSProgetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWSAmazon Web Services
 
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...Amazon Web Services
 
Introducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-RegionIntroducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-RegionAmazon Web Services
 
Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...Amazon Web Services
 
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Amazon Web Services
 
Create, map, and drive performance with Amazon FSx for Windows File Server - ...
Create, map, and drive performance with Amazon FSx for Windows File Server - ...Create, map, and drive performance with Amazon FSx for Windows File Server - ...
Create, map, and drive performance with Amazon FSx for Windows File Server - ...Amazon Web Services
 
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...Amazon Web Services
 
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS Summit
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS SummitConnecting your devices at scale, ft. Discovery - SVC205 - New York AWS Summit
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS SummitAmazon Web Services
 
Getting Started with Microservices, Containers, and Serverless Architectures
Getting Started with Microservices, Containers, and Serverless ArchitecturesGetting Started with Microservices, Containers, and Serverless Architectures
Getting Started with Microservices, Containers, and Serverless ArchitecturesAmazon Web Services
 
Using ML to detect and prevent fraud without compromising user experience - F...
Using ML to detect and prevent fraud without compromising user experience - F...Using ML to detect and prevent fraud without compromising user experience - F...
Using ML to detect and prevent fraud without compromising user experience - F...Amazon Web Services
 

What's hot (20)

Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
 
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
 No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ... No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...Making CI/CD pipelines safer with application monitoring and tracing - MAD202...
Making CI/CD pipelines safer with application monitoring and tracing - MAD202...
 
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
 
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...
Searching for patterns: Log analytics using Amazon ES - ADB205 - New York AWS...
 
Certificate management concepts in AWS - SEC205 - New York AWS Summit
Certificate management concepts in AWS - SEC205 - New York AWS SummitCertificate management concepts in AWS - SEC205 - New York AWS Summit
Certificate management concepts in AWS - SEC205 - New York AWS Summit
 
Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...
 
Progetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWSProgetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWS
 
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...
Deploy and scale your first cloud application with Amazon Lightsail - CMP208 ...
 
Introducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-RegionIntroducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-Region
 
Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...
 
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
 
Create, map, and drive performance with Amazon FSx for Windows File Server - ...
Create, map, and drive performance with Amazon FSx for Windows File Server - ...Create, map, and drive performance with Amazon FSx for Windows File Server - ...
Create, map, and drive performance with Amazon FSx for Windows File Server - ...
 
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
 
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS Summit
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS SummitConnecting your devices at scale, ft. Discovery - SVC205 - New York AWS Summit
Connecting your devices at scale, ft. Discovery - SVC205 - New York AWS Summit
 
Getting Started with Microservices, Containers, and Serverless Architectures
Getting Started with Microservices, Containers, and Serverless ArchitecturesGetting Started with Microservices, Containers, and Serverless Architectures
Getting Started with Microservices, Containers, and Serverless Architectures
 
Using ML to detect and prevent fraud without compromising user experience - F...
Using ML to detect and prevent fraud without compromising user experience - F...Using ML to detect and prevent fraud without compromising user experience - F...
Using ML to detect and prevent fraud without compromising user experience - F...
 

Similar to Serverless data prep with AWS Glue - ADB306 - New York AWS Summit

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Amazon Web Services
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Amazon Web Services
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitAmazon Web Services
 
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...Amazon Web Services
 
Well Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfWell Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfConradoDeBiasi
 
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitBuilding well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitAmazon Web Services
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Amazon Web Services
 
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS Summit
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS SummitDeveloping with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS Summit
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS SummitAmazon Web Services
 
Enhancing Your Developer eXperience on AWS - AWS Summit Sydney
Enhancing Your Developer eXperience on AWS - AWS Summit SydneyEnhancing Your Developer eXperience on AWS - AWS Summit Sydney
Enhancing Your Developer eXperience on AWS - AWS Summit SydneyAmazon Web Services
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019javier ramirez
 
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...Amazon Web Services
 
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Amazon Web Services
 
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...Amazon Web Services
 
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Julien SIMON
 
Build, train and deploy machine learning models at scale using AWS
Build, train and deploy machine learning models at scale using AWSBuild, train and deploy machine learning models at scale using AWS
Build, train and deploy machine learning models at scale using AWSAmazon Web Services
 
Breaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersBreaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersAmazon Web Services
 
AWS e SAP - Il futuro delle Enterprise Applications
AWS e SAP - Il futuro delle Enterprise ApplicationsAWS e SAP - Il futuro delle Enterprise Applications
AWS e SAP - Il futuro delle Enterprise ApplicationsAmazon Web Services
 

Similar to Serverless data prep with AWS Glue - ADB306 - New York AWS Summit (20)

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...
Developing serverless applications with .NET using AWS SDK & tools - MAD311 -...
 
Well Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfWell Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdf
 
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitBuilding well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
You're in the Cloud, now What?
You're in the Cloud, now What?You're in the Cloud, now What?
You're in the Cloud, now What?
 
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
 
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS Summit
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS SummitDeveloping with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS Summit
Developing with .NET Core on AWS - What's new - MAD306 - Santa Clara AWS Summit
 
Enhancing Your Developer eXperience on AWS - AWS Summit Sydney
Enhancing Your Developer eXperience on AWS - AWS Summit SydneyEnhancing Your Developer eXperience on AWS - AWS Summit Sydney
Enhancing Your Developer eXperience on AWS - AWS Summit Sydney
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
 
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
 
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
 
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...
Open by Design: Accelerating the Enterprise Cloud Journey - DEM02-S - Anaheim...
 
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
 
Build, train and deploy machine learning models at scale using AWS
Build, train and deploy machine learning models at scale using AWSBuild, train and deploy machine learning models at scale using AWS
Build, train and deploy machine learning models at scale using AWS
 
Breaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersBreaking Up the Monolith with Containers
Breaking Up the Monolith with Containers
 
AWS e SAP - Il futuro delle Enterprise Applications
AWS e SAP - Il futuro delle Enterprise ApplicationsAWS e SAP - Il futuro delle Enterprise Applications
AWS e SAP - Il futuro delle Enterprise Applications
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Serverless data prep with AWS Glue - ADB306 - New York AWS Summit

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Serverless data prep with AWS Glue Raghu Prabhu Senior BDM Data Lakes on AWS A D B 3 0 6 Radhika Ravirala Specialist Solutions Architect Data and Analytics, AWS
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 4. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T You are a data engineer at AnyCompany* Create a dataset for reporting and visualization Cleanse Transform Optimize for reporting queries Help solve a Machine Learning problem AnyCompany’s data scientists need to understand passenger tipping behavior * a completely made-up ride-hailing startup
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Your dataset: NYC Taxi Trips Yellow and Green taxi trips • Pick-up and drop-off dates/times • Pick-up and drop-off locations • Trip distances • Itemized fares • Tip amount • Driver-reported passenger counts http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml For-hire Vehicle (FHV) • Pick-up and drop-off dates/times • Pick-up and drop-off locations • Dispatching base
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Your dataset: NYC Taxi Trips Original raw dataset Green and yellow taxi + FHV Years 2009 to 2018 ~1.6Bn rows 215 files 253 GB total Simplified raw dataset Only yellow taxi + few look-ups Jan to March 2017 ~2M rows 3 files 2.5 GB+ uncompressed Ready in a publicly accessible Amazon S3 bucket
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Navigate to the AWS Glue workshop (ADB306) website Use Firefox or Chrome. Keep the website open in a separate tab. http://bit.ly/2JuCqs5
  • 11. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key resources AWS CloudFormation will create 1. A new data lake Amazon S3 bucket 2. Necessary IAM policies and roles for AWS Glue, Amazon Athena, and Amazon SageMaker 3. An AWS Glue development endpoint 4. A number of named queries in Amazon Athena Finally, the NYC Taxi trips raw dataset is copied into your Amazon S3 bucket
  • 13. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue automates the undifferentiated heavy lifting of ETL Automatically discover and categorize your data making it immediately searchable and queryable across data sources. Generate code to clean, enrich, and reliably move data between various data sources Run your jobs on a serverless, fully managed, scale-out environment. No compute resources to manage. Discover Develop Deploy
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Catalog ▪ Hive Metastore compatible with enhanced functionality ▪ Crawlers automatically extracts metadata and creates tables ▪ Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution ▪ Run jobs on a serverless Spark platform ▪ Provides flexible scheduling ▪ Handles dependency resolution, monitoring and alerting Job Authoring ▪ Auto-generates ETL code ▪ Build on open frameworks – Python and Spark ▪ Developer-centric – editing, debugging, sharing AWS Glue components
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Crawlers and the AWS Glue Data Catalog A crawler… • Samples and classifies your data • Extracts metadata • Infers schema and partitioning format • Creates tables in your account’s AWS Glue Data Catalog
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Why query with Amazon Athena? • Interactive query service • Makes it easy to analyze data in Amazon S3 using standard SQL • Out-of-the-box integrated with AWS Glue Data Catalog • Serverless
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Crawler AWS Glue Data Catalog NYC Taxi Raw Dataset 1 Concepts in action 2 3 4 1. Crawler crawls raw dataset in Amazon S3 bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query raw dataset in Athena 4. Athena uses schema definition to read raw dataset from Amazon S3 and returns results You
  • 21. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 22. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 23. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 24. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Interactive ETL development with AWS Glue and Amazon SageMaker What is Amazon SageMaker? • ... is a fully-managed ML platform • … enables you to build, train, and deploy machine learning models at any scale • … provides a Jupyter notebook environment
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connect your IDE to an AWS Glue development endpoint. Environment to interactively develop, debug, and test ETL code. AWS Glue Spark environment Interpreter Server Remote Interpreter What is an AWS Glue development endpoint?
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Interactive ETL development with AWS Glue and Amazon SageMaker AWS Glue enables you to … create an Amazon SageMaker notebook environment ... connect it to an AWS Glue dev endpoint … and finally, access your Jupyter notebook All without leaving the AWS Glue console
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Data Catalog NYC Taxi Raw Dataset 3 Concepts in action 2 1 1. You author ETL code in an Amazon SageMaker notebook 2. ETL code runs on AWS Glue Dev endpoint 3. ETL code… a. Reads raw dataset b. Applies transformations c. Writes optimized dataset back to your Amazon S3 bucket 4. You use your ETL code in notebook to create a script file Amazon SageMaker Notebook NYC Taxi Optimized Dataset 4 You
  • 30. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 31. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Transformations we’ll apply to raw data • Join NYC trips dataset with look-up tables (denormalization) • Create new Timestamp columns for pick-up & drop-off • Drop unnecessary columns • Partition the dataset • Convert to a columnar file format (parquet)
  • 33. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Concepts in action NYC Taxi Optimized Dataset AWS Glue Crawler AWS Glue Data Catalog 1 2 3 4 1. Crawler crawls optimized dataset in Amazon S3 bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query optimized dataset in Athena 4. Athena uses schema definition to read data from Amazon S3 and returns results You
  • 36. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 37. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 38. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is an AWS Glue job? An AWS Glue job encapsulates the business logic that performs extract, transform, and load (ETL) work. • A core building block in your production ETL pipeline • Provide your PySpark ETL script or have one automatically generated • Supports a rich set of built-in AWS Glue transformations • Jobs can be started, stopped, monitored
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is an AWS Glue trigger? Triggers are the ‘glue’ in your AWS Glue ETL pipeline. Triggers... • Can be used to chain multiple AWS Glue jobs in a series • Can start multiple jobs at once • Can be scheduled, on-demand, or based on job events • Can pass unique parameters to customize AWS Glue job runs
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Three ways to set up an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State Machine-driven
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Schedule-driven AWS Glue ETL pipeline We work our way backwards from a daily SLA deadline
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Event-driven AWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline.
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T State Machine-driven AWS Glue ETL pipeline Let AWS Step Functions drive the pipeline.
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Concepts in action NYC Taxi Optimized Dataset AWS Glue Crawler for Raw dataset AWS Glue Data Catalog 1 4 We’ll build a Schedule-driven pipeline. 1. Scheduled crawler runs on raw dataset and updates AWS Glue Data Catalog 2. AWS Glue job starts based on trigger(s) 3. Job reads raw dataset, applies transformations, and writes optimized dataset to your S3 bucket 4. Scheduled crawler runs on optimized dataset and updates Data Catalog AWS Glue Job to created Optimized dataset NYC Taxi Raw Dataset AWS Glue Crawler for Optimized dataset 3 2 Trigger
  • 47. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 48. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 49. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 50. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 51. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Workshop map
  • 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T ML Problem Definition Tip given to Taxi Drivers is substantial part of their Income They can also serve as loose metric for Service Quality that can be useful for taxi companies What are influences for higher tips (e.g., Taxi starting and ending in Richer part of cities?) We present analysis of factors affecting Tips and use these factors to predict Tips
  • 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T ML Architecture Results transform() Test Data fit() XGBoost-Regression Data Splitting Feature Engineering Endpoint Endpoint Configuration Model Creation Training Job Amazon SageMaker Notebook Instance – Glue Spark
  • 55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Concepts in action NYC Taxi Optimized 1. Launch Notebook 2. Read Optimized Dataset 3. Train the ML Model 4. Host the trained model and perform Prediction AWS Glue Data Catalog AWS Glue Development Endpoint SageMaker notebook 2 3 4 1
  • 56. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on:
  • 57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Run ML Notebook interactively • Initialize variables and import spark libraries • Retrieve Optimized NYC Taxi trips Dataset • Observe features against target label (tip_amount) • Perform feature engineering • Split feature engineered dataset into Train and Test • Launch Spark XGBoostEstimator (Observe hyperparameters) • Perform Prediction and observe accuracy
  • 58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T You made it!
  • 59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Raghu Prabhu (prabhuru@amazon.com) Radhika Ravirala (ravirala@amazon.com)