SlideShare a Scribd company logo
Glue
Let's Get Stuck In!
Gold Sponsors
Silver Sponsors
Bronze Sponsors Local Partners
Introduction
to Containers
Chris Taylor
- Worked with SQL Server since 2001
- MCSE – Data Platform
- SQLNE PASS Chapter Group Leader
- SQLRelay Organiser
- Cricket/Football Coaching
Agenda
• Session Aim
• The Problem
• What is AWS Glue?
• Use Cases
• Demos
• Costs
• Q&A
Not on the Agenda
• Comparison with other cloud offerings
Session Aim
An understanding
of the issues faced
with ETL
Development
Learn by example Enough of a taste to
get the Glue bug and
start experimenting!
The Problem
“….consumes 70 percent of the
resources needed for
implementation and maintenance of
a typical data warehouse”
R. Kimball and J. Caserta. The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming, and Delivering Data. Wiley, 2004.
The Problem
70% of ETL Jobs are hand-coded
with no use of ETL Tools
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Why hand-code?
• Flexible
• Powerful
• Unit test
• Deploy with other code
• You know your dev tools
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Involves a lot of effort
• Data formats change
• Source/target schemas change
• You add sources
• Data volume grows
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
What is AWS Glue?
• Fully managed, ETL service
• Serverless
• Automates the undifferentiated heavy lifting of ETL
•Discover, Develop, Deploy
• For Developers by Developers
Components
• Data Catalog
• Hive Metastore compatible
• Crawlers automatically extracts metadata and creates tables
• Integrated with Amazon Athena, Amazon Redshift Spectrum
• Job Authoring
• Auto-generates ETL code
• Build on open frameworks – Python and Spark
• Developer-centric
• Job Execution
• Run jobs on a serverless Spark platform
• Provides flexible scheduling
• Handles dependency resolution, monitoring and alerting
Use Cases?
Understand your data
https://aws.amazon.com/glue/
Query your data lake on Amazon S3
https://aws.amazon.com/glue/
Build event driven ETL pipelines
https://aws.amazon.com/glue/
DEMO
Costs
• https://aws.amazon.com/free/
• 1 Million objects stored in the AWS Glue Data Catalog**
• 1 Million requests made per month to the AWS Glue Data
Catalog**
** These free tier offers do not automatically expire at the end of your 12
month AWS Free Tier term, but are available to both existing and new AWS
customers indefinitely
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs
• DPU
• Compute based usage:
• AWS Glue pricing ETL jobs, development endpoints, and crawlers $0.44 per
DPU-Hour
• 1 minute increments
• 10-minute minimum 
• A single DPU Unit = 4 vCPU and 16 GB of memory
• Data Catalog usage:
• Data Catalog Storage:
• Free for the first million objects stored $1 per 100,000 objects, per month, stored
above 1M
• Data Catalog Requests:
• Free for the first million requests per month $1 per million requests above 1M
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs Example #1
• ETL job
• Ran for 10 minutes on a 6 DPU environment.
• The price of 1 DPU-Hour in US East (N. Virginia) is $0.44.
• The cost for this job run = 6 DPUs * (10/60) hour * $0.44 per DPU-
Hour or $0.44.
• Development Endpoint
• Active for 24 min.
• Each development endpoint is provisioned with 5 DPUs
• The cost to use the development endpoint = 5 DPUs * (24/ 60) hour *
0.44 per DPU-Hour or $0.88.
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs Example #2
• Store 1 million tables in your Data Catalog in a given month and
make 1 million requests to access these tables.
• You pay $0 for using data catalog.
• You are covered under the Data Catalog free tier.
• Your requests double to 2 million requests.
• You will only be paying for one million requests above the free tier,
which is $1
• If you use crawlers to find new tables and they run for 30 min and use
2 DPUs. You will pay for 2 DPUs * (30/60) hour * $0.44 per DPU-Hour
or $0.44. Your total monthly bill = $0 + $1 + $0.44 or $1.44
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Why can’t I just use Data Pipeline?
Glue
• Discovering unstructured data
• Managed ETL service
• Runs on a serverless Apache Spark
environment.
• Takes a data first approach
• Provides an integrated data catalog
/ metadata
• Querying via Amazon Athena and
Amazon Redshift Spectrum
• ETL jobs are Scala or Python based
Data Pipeline
• Simple data replication tasks
• Managed orchestration service
• Greater flexibility (environment,
access and compute resources)
• Launches compute resources in your
account allowing you direct access to
the Amazon EC2 instances or Amazon
EMR clusters.
• Run on a different engine (Hive, Pig)
Conclusion
Good
• Fully Managed ETL
• Serverless
• Crawlers for discovering and
relationalizing semi /
unstructured data
• Developer Endpoints
Not so good
• Complex costing
• 10 minute minimum Job run
• Developer Endpoint £££££££
• AWS Documentation is lacking
• Multiple Files in folder (Athena)
• Complex non-scheduled
automation
• None for Crawlers!
Summary
• Session Aim
• The Problem
• What is AWS Glue?
• Use Cases
• Demos
• Costs
Questions?
Contact
Links
• http://aws.amazon.com/documentation/glue
• https://www.slideshare.net/search/slideshow?searchfrom=header&q=aws
+glue
• https://www.slideshare.net/AmazonWebServices/building-serverless-etl-
pipelines-with-aws-glue-aws-summit-sydney-2018?qid=b3da6acd-c11b-
4576-8f40-88906fb6c3f3&v=&b=&from_search=6
• https://www.slideshare.net/MichaelRainey3/going-serverless-an-
introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-
88906fb6c3f3&v=&b=&from_search=5
• https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-
using-aws-step-functions-and-aws-lambda/
• https://gluent.com/access-catalog-query-enterprise-data-gluent-cloud-
sync-aws-glue/
Best Practices and Questions
• https://docs.aws.amazon.com/athena/latest/ug/glue-best-
practices.html
• https://aws.amazon.com/glue/faqs/
• https://www.accenture.com/us-en/blogs/blogs-kalyani-sayyed-
amazon-glue-etl

More Related Content

What's hot

Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
Info Alchemy Corporation
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
Amazon Web Services LATAM
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
Amazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Amazon Web Services
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
Amazon Web Services
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
Amazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Amazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
Amazon Web Services
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
Mahesh Raj
 

What's hot (20)

Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 

Similar to AWS Glue - let's get stuck in!

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
Peter Shi
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
IDERA Software
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
DATAVERSITY
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
Michael Rainey
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
Ike Ellis
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
Paweł Mitruś
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
Ovais Tariq
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
walk2talk srl
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
Azure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solutionAzure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solution
Gelis Wu
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
DATAVERSITY
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 

Similar to AWS Glue - let's get stuck in! (20)

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
AWS Melbourne Cost Mgt. and Opti. Meetup - 20181109 - v2.2
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
 
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
Azure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solutionAzure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solution
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

AWS Glue - let's get stuck in!

  • 2. Gold Sponsors Silver Sponsors Bronze Sponsors Local Partners
  • 3. Introduction to Containers Chris Taylor - Worked with SQL Server since 2001 - MCSE – Data Platform - SQLNE PASS Chapter Group Leader - SQLRelay Organiser - Cricket/Football Coaching
  • 4. Agenda • Session Aim • The Problem • What is AWS Glue? • Use Cases • Demos • Costs • Q&A
  • 5. Not on the Agenda • Comparison with other cloud offerings
  • 6. Session Aim An understanding of the issues faced with ETL Development Learn by example Enough of a taste to get the Glue bug and start experimenting!
  • 7. The Problem “….consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse” R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley, 2004.
  • 8. The Problem 70% of ETL Jobs are hand-coded with no use of ETL Tools https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 9. Why hand-code? • Flexible • Powerful • Unit test • Deploy with other code • You know your dev tools https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 10. Involves a lot of effort • Data formats change • Source/target schemas change • You add sources • Data volume grows https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 11. What is AWS Glue? • Fully managed, ETL service • Serverless • Automates the undifferentiated heavy lifting of ETL •Discover, Develop, Deploy • For Developers by Developers
  • 12. Components • Data Catalog • Hive Metastore compatible • Crawlers automatically extracts metadata and creates tables • Integrated with Amazon Athena, Amazon Redshift Spectrum • Job Authoring • Auto-generates ETL code • Build on open frameworks – Python and Spark • Developer-centric • Job Execution • Run jobs on a serverless Spark platform • Provides flexible scheduling • Handles dependency resolution, monitoring and alerting
  • 15. Query your data lake on Amazon S3 https://aws.amazon.com/glue/
  • 16. Build event driven ETL pipelines https://aws.amazon.com/glue/
  • 17. DEMO
  • 18. Costs • https://aws.amazon.com/free/ • 1 Million objects stored in the AWS Glue Data Catalog** • 1 Million requests made per month to the AWS Glue Data Catalog** ** These free tier offers do not automatically expire at the end of your 12 month AWS Free Tier term, but are available to both existing and new AWS customers indefinitely https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 19. Costs • DPU • Compute based usage: • AWS Glue pricing ETL jobs, development endpoints, and crawlers $0.44 per DPU-Hour • 1 minute increments • 10-minute minimum  • A single DPU Unit = 4 vCPU and 16 GB of memory • Data Catalog usage: • Data Catalog Storage: • Free for the first million objects stored $1 per 100,000 objects, per month, stored above 1M • Data Catalog Requests: • Free for the first million requests per month $1 per million requests above 1M https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 20. Costs Example #1 • ETL job • Ran for 10 minutes on a 6 DPU environment. • The price of 1 DPU-Hour in US East (N. Virginia) is $0.44. • The cost for this job run = 6 DPUs * (10/60) hour * $0.44 per DPU- Hour or $0.44. • Development Endpoint • Active for 24 min. • Each development endpoint is provisioned with 5 DPUs • The cost to use the development endpoint = 5 DPUs * (24/ 60) hour * 0.44 per DPU-Hour or $0.88. https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 21. Costs Example #2 • Store 1 million tables in your Data Catalog in a given month and make 1 million requests to access these tables. • You pay $0 for using data catalog. • You are covered under the Data Catalog free tier. • Your requests double to 2 million requests. • You will only be paying for one million requests above the free tier, which is $1 • If you use crawlers to find new tables and they run for 30 min and use 2 DPUs. You will pay for 2 DPUs * (30/60) hour * $0.44 per DPU-Hour or $0.44. Your total monthly bill = $0 + $1 + $0.44 or $1.44 https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 22. Why can’t I just use Data Pipeline? Glue • Discovering unstructured data • Managed ETL service • Runs on a serverless Apache Spark environment. • Takes a data first approach • Provides an integrated data catalog / metadata • Querying via Amazon Athena and Amazon Redshift Spectrum • ETL jobs are Scala or Python based Data Pipeline • Simple data replication tasks • Managed orchestration service • Greater flexibility (environment, access and compute resources) • Launches compute resources in your account allowing you direct access to the Amazon EC2 instances or Amazon EMR clusters. • Run on a different engine (Hive, Pig)
  • 23. Conclusion Good • Fully Managed ETL • Serverless • Crawlers for discovering and relationalizing semi / unstructured data • Developer Endpoints Not so good • Complex costing • 10 minute minimum Job run • Developer Endpoint £££££££ • AWS Documentation is lacking • Multiple Files in folder (Athena) • Complex non-scheduled automation • None for Crawlers!
  • 24. Summary • Session Aim • The Problem • What is AWS Glue? • Use Cases • Demos • Costs
  • 27. Links • http://aws.amazon.com/documentation/glue • https://www.slideshare.net/search/slideshow?searchfrom=header&q=aws +glue • https://www.slideshare.net/AmazonWebServices/building-serverless-etl- pipelines-with-aws-glue-aws-summit-sydney-2018?qid=b3da6acd-c11b- 4576-8f40-88906fb6c3f3&v=&b=&from_search=6 • https://www.slideshare.net/MichaelRainey3/going-serverless-an- introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40- 88906fb6c3f3&v=&b=&from_search=5 • https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs- using-aws-step-functions-and-aws-lambda/ • https://gluent.com/access-catalog-query-enterprise-data-gluent-cloud- sync-aws-glue/
  • 28. Best Practices and Questions • https://docs.aws.amazon.com/athena/latest/ug/glue-best- practices.html • https://aws.amazon.com/glue/faqs/ • https://www.accenture.com/us-en/blogs/blogs-kalyani-sayyed- amazon-glue-etl