SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Presenter Name
• AWS Glue
Introduction to AWS Glue
Ryan Malecky
Solutions Architect
rmalecky@amazon.com
AWS Glue
Simple, flexible, cost-effective ETL
§ AWS Glue is a fully managed ETL (extract, transform, and load) service
§ Categorize your data, clean it, enrich it and move it reliably between
various data stores
§ Once catalogued, your data is immediately searchable and queryable
across your data silos
§ Simple and cost-effective
§ Serverless; runs on a fully managed, auto-scaling Spark environment
Why would AWS get into the ETL space?
We have lots of ETL partners
Amazon Redshift Partner Page for Data Integration
Fivetran
but...Customers are still hand-coding ETL…
70% of ETL jobs are hand-coded
With no use of ETL tools
Actually…
It’s over 90% in the cloud
Code is flexible Code is powerful
You can unit test You can deploy with other code You know your dev tools
Why do we see so much hand-coding?
Hand-coding involves a lot of
undifferentiated heavy lifting…
Brittle Error-prone Laborious
► As data formats change
► As target schemas change
► As you add sources
► As data volume grows
Glue automates the undifferentiated heavy-lifting of ETL
ü Discover and organize data, regardless of
where it lives
ü ETL jobs run under a Serverless execution model
ü Focus on writing transformations,
not handling undifferentiating heavy lifting
AWS Glue: big picture
AWS Glue: components
Data Catalog
§ Discover and Organize your data in various databases,
data warehouses and data lakes
Job Execution
§ Runs jobs in Spark containers – automatic scaling based on SLA
§ Glue is serverless – only pay for the resources you consume
Job Authoring
§ Focus on the writing transformations
§ Generate code through a wizard
§ Write your own code
Glue data catalog
Discover and organize your data sets
Transformations
Data Lakes
Datawarehouses
Databases
Glue data
catalog
Amazon
RDS
Amazon
Redshift
Amazon S3
Describes
Describes
Describes
Glue ETL
Uses
Ad-hoc
investigation
Data warehouse
Analytics
Amazon
Athena
Redshift
Spectrum
EMR
(Hadoop/Spark)
Uses
Uses
Uses
Discover and Organize Data
Glue data catalog
Manage table metadata through a
Web Interface, Hive metastore API,
Hive SQL, or automated through
crawlers
Listening to our customers, we’ve added:
§ Search metadata for data discovery
§ Connections to RDS, Redshift and JDBC
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas
Crawlers: auto-populate data catalog
Run crawlers on-demand and on a schedule to discover new data and schema changes.
Serverless – only pay when crawls run.
Automatic schema inference:
• Built-in classifiers
• Detect file type
• Extract schema
• Identify partitions
• Add your own classifiers
• Grok for each of use
Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Datawarehouses
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
Job authoring in Glue
Script generated by AWS Glue
Existing script brought into AWS Glue
Blank script authored by you
You have
choices on how
to get started…
1. Pick sources and targets from the data catalog
2. Glue generates transformation graph and Python code
3. Specify trigger condition and execute job
Every Friday
at 3PM GMT
Source table
@ Amazon S3
Transform
Relationalize
Transform
Filter table
Target table
@ Amazon S3
Target table
@ Amazon Redshift
Automatic Code Generation
Glue transformations are flexible and adaptive
Semi-structured schema Relational schema
FKA B B C.X C.Y
PK ValueOffset
A C D [ ]
X Y
B B
§ Flatten semi-structured objects with arbitrary complexity into relational tables, on-the-fly.
§ Pivot arrays and other collection types into a separate table, generating key-foreign key values.
§ Modify mapping as the source schema changes, and modify the target schemas as needed.
§ Human-readable code run on a scalable platform, PySpark
§ Forgiving in the face of failures – handles bad data and crashes
§ Flexible: handles complex semi-structured data, and adapts to source schema changes
Glue ETL scripts are forgiving and flexible
Add Custom Modules and Files
§ Add external Python modules
§ Java JARs required by the script
§ Additional files such as configuration, etc.
Orchestration & resource management
Fully managed, serverless job execution
Serverless job execution
§ Warm pools: pre-configured fleets of
instances to reduce job startup time
§ Auto-configure VPC and role-based access
§ Automatically scale resources to meet SLA
and cost objectives
§ You pay only for the resources you
consume while consuming them.
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Warm pool of instances
Job composition and triggers
Compose jobs globally with event-
based dependencies
§ Easy to reuse and leverage work
across organization boundaries
Multiple triggering mechanisms
§ Schedule-based: e.g., time of day
§ Event-based: e.g., data availability, job
completion
§ External sources: e.g., AWS Lambda
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
weekly
sales
Data
based
Developer Endpoints
Environment to iteratively develop
and test ETL scripts.
Develop your script in a notebook
and point to an AWS Glue endpoint
to test it.
When you are satisfied with the
results of your development process
you can create an ETL job that runs
your script.
Customer VPC
Glue Instances
S3 Endpoint
Amazon
RDS
Amazon
Redshift
Development
Notebook
Glue ETL Security Model
• ETL Jobs that do not require special handling can be launched
without additional configuration
• For jobs requiring restricted access to data Glue utilizes a VPC
endpoint launching dedicated ENIs into the customer’s VPC which
are assigned private IP address from the customer’s subnet.
• For such jobs, Glue also enforces the use of an S3 VPC endpoint
Usage and Pricing
Glue ETL Pricing
• With Glue, you only pay for the time your ETL job takes to run.
• There are no resources to manage and no upfront costs, and you are not
charged for startup or shutdown time.
• You are charged an hourly rate based on the number of Data Processing Units
(or DPUs) used to run your ETL job.
• A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of
memory and corresponding networking capabilities.
• You are billed per hour in increments of 1 minute, rounded up to the nearest
minute, with a 10-minute minimum duration for each job.
Glue Data Catalog and Crawler Pricing
Data catalog:
• With the AWS Glue data catalog, you can store up to a million objects per month for free. If you
store more than a million objects, you will be charged per 100,000 objects over a million.
• An object in the AWS Glue data catalog is a table, a partition, or a database.
• The first million access requests per month to the AWS Glue data catalog are free. If you exceed
a million requests in a month, you will be charged per million requests over the first million.
• Some common requests are CreateTable, CreatePartition, GetTable and GetPartitions.
Crawlers:
• You will pay an hourly rate for AWS Glue crawler runtime to populate the Glue data catalog,
based on the number of Data Processing Units (or DPUs) used to run your crawler.
• A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory and
corresponding networking capabilities.
• You are billed in increments of 1 minute, rounded up to the nearest minute.
• Use of AWS Glue crawlers is optional, and you can populate the Glue data catalog directly
through the API.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Hands-on Lab: Amazon Athena & Glue
http://github.com/wrbaldwin/da-week/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

More Related Content

What's hot

Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
Info Alchemy Corporation
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Amazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
Amazon Web Services
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
Amazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
Amazon Web Services
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Amazon Web Services
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
AWS Germany
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
Wasm1953
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
Amazon Web Services
 

What's hot (20)

Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 

Similar to Introduction to AWS Glue

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Amazon Web Services
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
Sridevi Murugayen
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
Amazon Web Services
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Amazon Web Services
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
Michael Rainey
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
Marek Kuczynski
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
Amazon Web Services
 
Database Freedom - ADB304 - Santa Clara AWS Summit
Database Freedom - ADB304 - Santa Clara AWS SummitDatabase Freedom - ADB304 - Santa Clara AWS Summit
Database Freedom - ADB304 - Santa Clara AWS Summit
Amazon Web Services
 
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
Amazon Web Services Korea
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Amazon Web Services
 

Similar to Introduction to AWS Glue (20)

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Database Freedom - ADB304 - Santa Clara AWS Summit
Database Freedom - ADB304 - Santa Clara AWS SummitDatabase Freedom - ADB304 - Santa Clara AWS Summit
Database Freedom - ADB304 - Santa Clara AWS Summit
 
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Introduction to AWS Glue

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Presenter Name • AWS Glue Introduction to AWS Glue Ryan Malecky Solutions Architect rmalecky@amazon.com
  • 2. AWS Glue Simple, flexible, cost-effective ETL § AWS Glue is a fully managed ETL (extract, transform, and load) service § Categorize your data, clean it, enrich it and move it reliably between various data stores § Once catalogued, your data is immediately searchable and queryable across your data silos § Simple and cost-effective § Serverless; runs on a fully managed, auto-scaling Spark environment
  • 3. Why would AWS get into the ETL space?
  • 4. We have lots of ETL partners Amazon Redshift Partner Page for Data Integration Fivetran
  • 5. but...Customers are still hand-coding ETL… 70% of ETL jobs are hand-coded With no use of ETL tools Actually… It’s over 90% in the cloud
  • 6. Code is flexible Code is powerful You can unit test You can deploy with other code You know your dev tools Why do we see so much hand-coding?
  • 7. Hand-coding involves a lot of undifferentiated heavy lifting… Brittle Error-prone Laborious ► As data formats change ► As target schemas change ► As you add sources ► As data volume grows
  • 8. Glue automates the undifferentiated heavy-lifting of ETL ü Discover and organize data, regardless of where it lives ü ETL jobs run under a Serverless execution model ü Focus on writing transformations, not handling undifferentiating heavy lifting
  • 9. AWS Glue: big picture
  • 10. AWS Glue: components Data Catalog § Discover and Organize your data in various databases, data warehouses and data lakes Job Execution § Runs jobs in Spark containers – automatic scaling based on SLA § Glue is serverless – only pay for the resources you consume Job Authoring § Focus on the writing transformations § Generate code through a wizard § Write your own code
  • 11. Glue data catalog Discover and organize your data sets
  • 12. Transformations Data Lakes Datawarehouses Databases Glue data catalog Amazon RDS Amazon Redshift Amazon S3 Describes Describes Describes Glue ETL Uses Ad-hoc investigation Data warehouse Analytics Amazon Athena Redshift Spectrum EMR (Hadoop/Spark) Uses Uses Uses Discover and Organize Data
  • 13. Glue data catalog Manage table metadata through a Web Interface, Hive metastore API, Hive SQL, or automated through crawlers Listening to our customers, we’ve added: § Search metadata for data discovery § Connections to RDS, Redshift and JDBC § Classification for identifying and parsing files § Versioning of table metadata as schemas
  • 14. Crawlers: auto-populate data catalog Run crawlers on-demand and on a schedule to discover new data and schema changes. Serverless – only pay when crawls run. Automatic schema inference: • Built-in classifiers • Detect file type • Extract schema • Identify partitions • Add your own classifiers • Grok for each of use
  • 15. Crawlers: Classifiers IAM Role Glue Crawler Data Lakes Datawarehouses Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora Redshift Avro Parquet ORC JSON & BJSON Logs (Apache, Linux, MS, Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressed Formats (ZIP, BZIP, GZIP, LZ4, Snappy) Create additional Custom Classifiers with Grok!
  • 16. Job authoring in Glue Script generated by AWS Glue Existing script brought into AWS Glue Blank script authored by you You have choices on how to get started…
  • 17. 1. Pick sources and targets from the data catalog 2. Glue generates transformation graph and Python code 3. Specify trigger condition and execute job Every Friday at 3PM GMT Source table @ Amazon S3 Transform Relationalize Transform Filter table Target table @ Amazon S3 Target table @ Amazon Redshift Automatic Code Generation
  • 18. Glue transformations are flexible and adaptive Semi-structured schema Relational schema FKA B B C.X C.Y PK ValueOffset A C D [ ] X Y B B § Flatten semi-structured objects with arbitrary complexity into relational tables, on-the-fly. § Pivot arrays and other collection types into a separate table, generating key-foreign key values. § Modify mapping as the source schema changes, and modify the target schemas as needed.
  • 19. § Human-readable code run on a scalable platform, PySpark § Forgiving in the face of failures – handles bad data and crashes § Flexible: handles complex semi-structured data, and adapts to source schema changes Glue ETL scripts are forgiving and flexible
  • 20. Add Custom Modules and Files § Add external Python modules § Java JARs required by the script § Additional files such as configuration, etc.
  • 21. Orchestration & resource management Fully managed, serverless job execution
  • 22. Serverless job execution § Warm pools: pre-configured fleets of instances to reduce job startup time § Auto-configure VPC and role-based access § Automatically scale resources to meet SLA and cost objectives § You pay only for the resources you consume while consuming them. There is no need to provision, configure, or manage servers Customer VPC Customer VPC Warm pool of instances
  • 23. Job composition and triggers Compose jobs globally with event- based dependencies § Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms § Schedule-based: e.g., time of day § Event-based: e.g., data availability, job completion § External sources: e.g., AWS Lambda Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment weekly sales Data based
  • 24. Developer Endpoints Environment to iteratively develop and test ETL scripts. Develop your script in a notebook and point to an AWS Glue endpoint to test it. When you are satisfied with the results of your development process you can create an ETL job that runs your script. Customer VPC Glue Instances S3 Endpoint Amazon RDS Amazon Redshift Development Notebook
  • 25. Glue ETL Security Model • ETL Jobs that do not require special handling can be launched without additional configuration • For jobs requiring restricted access to data Glue utilizes a VPC endpoint launching dedicated ENIs into the customer’s VPC which are assigned private IP address from the customer’s subnet. • For such jobs, Glue also enforces the use of an S3 VPC endpoint
  • 27. Glue ETL Pricing • With Glue, you only pay for the time your ETL job takes to run. • There are no resources to manage and no upfront costs, and you are not charged for startup or shutdown time. • You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your ETL job. • A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory and corresponding networking capabilities. • You are billed per hour in increments of 1 minute, rounded up to the nearest minute, with a 10-minute minimum duration for each job.
  • 28. Glue Data Catalog and Crawler Pricing Data catalog: • With the AWS Glue data catalog, you can store up to a million objects per month for free. If you store more than a million objects, you will be charged per 100,000 objects over a million. • An object in the AWS Glue data catalog is a table, a partition, or a database. • The first million access requests per month to the AWS Glue data catalog are free. If you exceed a million requests in a month, you will be charged per million requests over the first million. • Some common requests are CreateTable, CreatePartition, GetTable and GetPartitions. Crawlers: • You will pay an hourly rate for AWS Glue crawler runtime to populate the Glue data catalog, based on the number of Data Processing Units (or DPUs) used to run your crawler. • A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory and corresponding networking capabilities. • You are billed in increments of 1 minute, rounded up to the nearest minute. • Use of AWS Glue crawlers is optional, and you can populate the Glue data catalog directly through the API.
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Hands-on Lab: Amazon Athena & Glue http://github.com/wrbaldwin/da-week/
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS