SlideShare a Scribd company logo
1 of 77
AWS Glue Technical
Enablement Training
Kyle Escosia
Jr. Data Science Specialist
Info Alchemy
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Data at scale
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications
Why data
preparation?
Data preparation is the first mile of
Analytics
Business
Intelligence Machine
Learning
Data preparation is hard
Lots of data! Infrastructure
management
Data grows fast 10x
every5years
Data is more diverse
Most jobshand-coded
Brittle and error prone
Machine / instance sizing Cluster
lifecyclemanagement
Scheduling andmonitoring
Managingmetastores
Needs customization
AWS Glue has evolved
Then Now
Fully Managed extract-transform-load
(ETL) Service
For developers, built
by developers
Serverless data preparation service
ETL developers, data engineers, data
scientists, business analysts, and more
SelectAWS Glue
customers
Amazon S3
data lakestorage
Building data
lakes
Break silos, store data in Amazon S3
AWSGlue jobs and workflows to
ingest, process, and refine data instages
Access data lakes viaa
variety of cloud analytic engines
Amazon RDS Other databases On-premises data Streaming data
AWS Gluecrawlers
load and maintain the Data Catalog
AWS Lake Formation permissions to
secure the data lake
AWS Glue Concepts
AWS
Glue
Fully managed, serverless ETLservice
for developers and datascientists
Serverlessreview
No infrastructure provisioning,
no management
Automatic scaling
Pay for value Highly available andsecure
Easily de-duplicate your data with ML
transforms
ETL Jobs
No resources to manage
Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour
provides 4 vCPU and 16 GB of memory
Three types
Apache Spark
Python Shell
Spark Streaming
Data Catalog
Free for the first million objects stored (table, table version, partition, or database)
$1.00 per 100,000 objects stored above 1M, per month
Crawlers
Charged hourly based on Data Processing Units (DPUs)
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
With AWS Glue, you only pay for the time your ETL job takes to run.
AWS Glue Usage and Pricing
AWS Glue Deep Dive Components
Security: IAM Permissions – A refresher
IAM Users
consist of a username and a password
IAM Groups
collection of users
IAM Role
an identity used to delegate access to AWS resources
IAM Service Role
a role that a service assumes to perform actions in your
account on your behalf
IAM Policy
an entity, when attached to an identity, defines their permissions
AWS Glue Permissions
Follow the least privilege access principle
Requires an IAM Role
AWS Managed Policy: AWSGlueServiceRole
Custom Policy – fine-grained access
Some related services
Amazon S3, Amazon Redshift, Amazon CloudWatch
AWS Glue Components
Crawlers
Load andmaintain
Data Catalog
Infer metadata:
schema, table
structure
Supports schema
evolution
AWS GlueData
Catalog
Apache Hive Metastore
compatible
Many integrated
analytic services
Extract,
transform, and load
Serverless execution
Apache Spark / Python
shell jobs
Interactive development
Auto-generate ETLcode
Orchestrate triggers,
crawlers, and jobs
Build and monitor
complex flows
Reliable execution
Workflow
management
AWS Glue is used to cleanse, prep, and
catalog
AWS Glue DataCatalog
Workflows orchestrate dataflows
Process data instages
Crawlers populate/maintain catalog
Jobs execute ETLtransforms
What arecrawlers?
Automatically discover new data and extract schema definitions
detect schema changes and maintain tables detect Apache
Hive style partitions on Amazon S3
Built-in classifiers for popular datatypes
create your own custom classifier using Grok expressions
Run on demand, on a schedule, or as parts of workflows
Crawlers discoverstructure
Handles complex, nested fields
Detects Hive-style partitions
What can crawlers classify?
Use excludepatterns to remove unnecessary files
To ignore all Metadata files in the
folders year=‘2017’ and for
location s3://mydatasets
s3://mydatasets
year=2017/**/METADATA.txt
Improve performance with multiple crawlers
Periodically audit long running crawlers to balance workloads
Often crawlers are processing multiple datasets / tables
Improve performance by using multiple crawlers
Crawler granularity is table or dataset
What is anAWS Glue
job?
An AWS Glue job encapsulates the business logic that
performs extract, transform, and load (ETL)work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped,monitored
Under the hood:Apache Spark and AWSGlue
ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics
• AWS Glue builds on the Apache Spark runtime to offer ETL-specific
functionality
SparkSQL AWS GlueETL
Spark DataFrames AWS GlueDynamicFrames
Spark Core:RDDs
Apache Spark – What is it?
HDFS
YARN
MapReduce Spark
Cassandra NoSQL
Mesos
Tez
Distributed Storage Layer
Cluster Resource Management
Processing Framework Layer
Let’s try that again..
Think of a Bee Hive as your Distributed Storage
A Bee Hive needs to have a Queen
This Queen, serves as your Spark Driver
The Worker Bees, serves as your worker nodes
Putting it together..
Generates the Spark Context
Main Method
Access to the Resource Manager
Spark Driver
Resource
Manager
Executor
Cache
Executor
Cache
Executor
Cache
Executor
Cache
The Queen
The Worker Bees
DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema upfront
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames forETL
Designed for processing semi-structured
data, e.g., JSON, Avro, Apachelogs
schema per-record, noupfront schema needed
Easy to restructure, tag,modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
Dynamic Frame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
type
id type
id
Dynamic Frame schema
type
id
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
type
id
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Stage 1
Repartition
Write
Stage 2
Job 1
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
AWS Glue executionmodel: jobs and stages
Filter
Read
Repartition
Write
Read
Job 1
Stage 1
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Actions
AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Jobs
AWS Glue executionmodel: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided intopartitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput islimited
by the number of partitions
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Jobs
Filter
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Apply
Mapping
Read Filter
Apply
Mapping
Job 2
Show
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Text – xSV, JSON
• May or may not be compressed
• Human readable whenuncompressed
• Not optimized foranalytics
• Columnar – Parquet & ORC
• Compressed in a binaryformat
• Integrated indexes and stats
• Optimized read performance when selecting only a subset of columns
• Row – Avro
• Compressed in a binaryformat
• Optimized read performance when selecting all columns of a subset of
rows
File formats
Partitioning guidance
• Chose columns that have low cardinality (uniqueness)
• Partitioning on day/month/year has 365 unique values per year
• Partitioning on seconds has millions of values per year
• You can partition on any column, not just date
• For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx)
• Look at your query patterns – what data do you want to query, and what do
you want to filter out?
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Standard
Provide the maximum capacity of DPUs (max. 100)
4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors
G.1X
Provide the number of workers (max. 299)
A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per
worker
Recommended for memory-intensive jobs
G.2X
Provide the number of workers (max. 149)
A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker
Recommended for memory-intensive jobs that run ML Transforms
Worker Types
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Use G.1X and G.2X instances when your jobs need lots of memory
• Executor memory issues happen most often during sort and shuffle
operations
• The driver most often runs out of memory when processing a very
large number of input partitions
What is anAWS Glue
trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs atonce
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
Three ways to set up anAWS Glue ETL
pipeline
• Schedule-driven
• Event-driven
• State machine–driven
Schedule-drivenAWS Glue ETL
pipeline
We work our way backward from a daily SLA deadline
Event-drivenAWS Glue ETL
pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
Example ETL
flow
Create and run a job that will
• Consume data in S3
• Join the data
• Select only the required columns
• Write the results to a data lake on Amazon Simple Storage
Service (AmazonS3)
Monitor the running job Analyze
the resulting dataset
Join Data
Select
Columns
Fill null values
• Fill null values
Goal: prepare and analyze
POS Data
What are workflows and how do they work?
DAGs with triggers, jobs, andcrawlers
Graphical canvas for authoringworkflows
Run / rerun and monitor workflow executions
Share parameters across entities in the workflow
Workflow buildingblocks
Building workflows
Build workflows with:
Graphical canvas
APIs
AWS CloudFormation templates
Monitoring workflows
Easily monitor /see:
workflows running now
completed workflows
status /errors
Track previously processed data
Enable |disable |pause bookmarks onsources
Rollback to a previous state if necessary
Incrementaldata processing with job
bookmarks
Examples uses:
Process POS Data filesdaily
Process log fileshourly
Track timestamps or primary keys in DBs
Track generated foreign keysfor
normalization
Bookmarks are per-job checkpoints that
track the work done in previous runs.
They persist the state of sources,
transforms, and sinks on each run.
run 1 run 2 run 3
Incrementaldata processing withjobbookmarks
Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire
dataset every time
Pause
Temporarily disable advancing the
bookmark
run 1 run 2
enable
disable
pause
run 3
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchivetable
Pause: Process the previous githubarchive partition
Job bookmark options
Job bookmark example
year
…
…
2017
11 12
28
month
day 27
hour …
year
…
…
2017
11 12
28
month
day 27
hour …
Input table
… …
run 1
run 2
…
Output table
Periodically run ajob
avoid reprocessing
previous input
avoid generating
duplicate output
Questions?
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
AWS Glue Configurations
Key Concepts
Virtual Private Cloud (VPC)
allows you to specify an IP address range for the VPC, add subnets, associate security
groups, and configure route tables.
Subnet
is a range of IP addresses in your VPC.
Public Subnet
Internet
Private Subnet
No Internet
VPN connection
Virtual Private Gateway (VGW)
Amazon Side
Customer Gateway (CGW)
Physical device on your Corporate Network
Security Groups
controls inbound and outbound traffic for your instances
Accessing on premise network
10.10.10.0/24
Detailed Architecture
AWS VPC
(10.10.0.0/16)
10.10.11.0/24
NAT-GW
IGW
AWS Glue
ENIs: 10.10.10.x
Amazon RDS
VGW
Amazon S3
VPCe
VPN Tunnel CGW
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
Destination Target
10.10.0.0/16 local
0.0.0.0 IGW-id
JDBC Connection
Internet
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
172.31.0.0/16 VGW-id
Questions?
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Reference Architecture
AWS Glue
CPFI Data lake Architecture
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Recent innovations
RecentAWS Glue innovations
Merge/
transition/purge
SageMaker
notebooks
AWS Glue
streaming
Vertical scaling
PartitionIndex
Pause and
resume
workflows
Bahrain
Spark UI
Crawler
performance
Sao Paulo
Custom JDBC
certificates
Milan AWS GlueVPC
sharing
AWS Glue2.0
C-based
libraries
MongoDB
Amazon
DocumentDB
Self-managed
Kafka support
AWS Glue
Studio
Spark 2.4.3
AVRO
support
Continuous
logging
Hong Kong
Resource tags
Python shell
jobs
GovCloud
AWS Glue
workflows
Python 3.7on
Spark Stockholm
Wheel
dependency
Job bookmarks
FindMatches
ML transforms
China Regions
AWS GlueETL
binaries
50+ new features
and regions
AWS Glue 2.0:New engine for real-time
workloads
Cost effective
New job execution engine with a new scheduler
10x faster job start times
Predictable job latencies
Enables micro-batching
Latency-sensitive workloads
Fast and predictable
Diverse workloads
1-minute minimum billing
4 5 % cost savings on average
AWS Glue Studio: New visual ETL
interface
M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of
glass
Distributed processing without the learning curve
Advanced transforms through code snippets
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services
Complementary AWS Services
AWS Glue DataBrew
V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G
GenerallyAvailable!
AmazonManagedWorkflowsforApacheAirflow
H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R
A PA C H E A I R F LO W
Preview
AWSLake Formation
Build a secure data lake in days
Simplify security
management
Centrally define security,governance
and auditing policies
Enforce policiesconsistently
across multiple services
Integrates with IAM andKMS
Provide self-service
access to data
Build a data catalogthat
describes your data
Enable analysts and datascientists
to easily find relevantdata
Analyze with multipleanalytics
services without moving data
Build datalakes
quickly
Move, store, catalog, and clean
your data faster
Transform to openformats
like Parquet and ORC
ML-based deduplication
and recordmatching
AWS API
Boto3 for Python
https://boto3.amazonaws.com
/v1/documentation/api/latest/
guide/index.html
Examples:
Upload files to S3
Download files from S3
Run a Glue Job
Run a Workflow
Thank you!
Kyle Escosia
kescosia@info-alchemy.net

More Related Content

What's hot

AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018Amazon Web Services
 
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon Web Services Korea
 
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Edureka!
 
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Amazon Web Services
 
Amazon EventBridge
Amazon EventBridgeAmazon EventBridge
Amazon EventBridgeDhaval Nagar
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3Mark Cohen
 

What's hot (20)

Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
 
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
 
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
 
Setting Up a Landing Zone
Setting Up a Landing ZoneSetting Up a Landing Zone
Setting Up a Landing Zone
 
Amazon EventBridge
Amazon EventBridgeAmazon EventBridge
Amazon EventBridge
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3
 
AWS Technical Essentials Day
AWS Technical Essentials DayAWS Technical Essentials Day
AWS Technical Essentials Day
 
AWS 101
AWS 101AWS 101
AWS 101
 

Similar to AWS glue technical enablement training

BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014Amazon Web Services
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Amazon Web Services
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your StartupAmazon Web Services
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...Amazon Web Services
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersAmazon Web Services
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAmazon Web Services
 
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014Amazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 

Similar to AWS glue technical enablement training (20)

BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014
(SOV204) Scaling Up to Your First 10 Million Users | AWS re:Invent 2014
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your Startup
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million users
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
 
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
 
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

AWS glue technical enablement training

  • 1. AWS Glue Technical Enablement Training Kyle Escosia Jr. Data Science Specialist Info Alchemy
  • 2. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 3. Data at scale Growing exponentially From new sources Increasingly diverse Used by many people Analyzed by many applications
  • 4. Why data preparation? Data preparation is the first mile of Analytics Business Intelligence Machine Learning
  • 5. Data preparation is hard Lots of data! Infrastructure management Data grows fast 10x every5years Data is more diverse Most jobshand-coded Brittle and error prone Machine / instance sizing Cluster lifecyclemanagement Scheduling andmonitoring Managingmetastores Needs customization
  • 6. AWS Glue has evolved Then Now Fully Managed extract-transform-load (ETL) Service For developers, built by developers Serverless data preparation service ETL developers, data engineers, data scientists, business analysts, and more
  • 8. Amazon S3 data lakestorage Building data lakes Break silos, store data in Amazon S3 AWSGlue jobs and workflows to ingest, process, and refine data instages Access data lakes viaa variety of cloud analytic engines Amazon RDS Other databases On-premises data Streaming data AWS Gluecrawlers load and maintain the Data Catalog AWS Lake Formation permissions to secure the data lake
  • 10. AWS Glue Fully managed, serverless ETLservice for developers and datascientists
  • 11. Serverlessreview No infrastructure provisioning, no management Automatic scaling Pay for value Highly available andsecure
  • 12. Easily de-duplicate your data with ML transforms
  • 13. ETL Jobs No resources to manage Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour provides 4 vCPU and 16 GB of memory Three types Apache Spark Python Shell Spark Streaming Data Catalog Free for the first million objects stored (table, table version, partition, or database) $1.00 per 100,000 objects stored above 1M, per month Crawlers Charged hourly based on Data Processing Units (DPUs) $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run With AWS Glue, you only pay for the time your ETL job takes to run. AWS Glue Usage and Pricing
  • 14. AWS Glue Deep Dive Components
  • 15. Security: IAM Permissions – A refresher IAM Users consist of a username and a password IAM Groups collection of users IAM Role an identity used to delegate access to AWS resources IAM Service Role a role that a service assumes to perform actions in your account on your behalf IAM Policy an entity, when attached to an identity, defines their permissions
  • 16. AWS Glue Permissions Follow the least privilege access principle Requires an IAM Role AWS Managed Policy: AWSGlueServiceRole Custom Policy – fine-grained access Some related services Amazon S3, Amazon Redshift, Amazon CloudWatch
  • 17. AWS Glue Components Crawlers Load andmaintain Data Catalog Infer metadata: schema, table structure Supports schema evolution AWS GlueData Catalog Apache Hive Metastore compatible Many integrated analytic services Extract, transform, and load Serverless execution Apache Spark / Python shell jobs Interactive development Auto-generate ETLcode Orchestrate triggers, crawlers, and jobs Build and monitor complex flows Reliable execution Workflow management
  • 18. AWS Glue is used to cleanse, prep, and catalog AWS Glue DataCatalog Workflows orchestrate dataflows Process data instages Crawlers populate/maintain catalog Jobs execute ETLtransforms
  • 19. What arecrawlers? Automatically discover new data and extract schema definitions detect schema changes and maintain tables detect Apache Hive style partitions on Amazon S3 Built-in classifiers for popular datatypes create your own custom classifier using Grok expressions Run on demand, on a schedule, or as parts of workflows
  • 20. Crawlers discoverstructure Handles complex, nested fields Detects Hive-style partitions
  • 21. What can crawlers classify?
  • 22. Use excludepatterns to remove unnecessary files To ignore all Metadata files in the folders year=‘2017’ and for location s3://mydatasets s3://mydatasets year=2017/**/METADATA.txt
  • 23. Improve performance with multiple crawlers Periodically audit long running crawlers to balance workloads Often crawlers are processing multiple datasets / tables Improve performance by using multiple crawlers Crawler granularity is table or dataset
  • 24. What is anAWS Glue job? An AWS Glue job encapsulates the business logic that performs extract, transform, and load (ETL)work • A core building block in your production ETL pipeline • Provide your PySpark ETL script or have one automatically generated • Supports a rich set of built-in AWS Glue transformations • Jobs can be started, stopped,monitored
  • 25. Under the hood:Apache Spark and AWSGlue ETL • Apache Spark is a distributed data processing engine with rich support for complex analytics • AWS Glue builds on the Apache Spark runtime to offer ETL-specific functionality SparkSQL AWS GlueETL Spark DataFrames AWS GlueDynamicFrames Spark Core:RDDs
  • 26. Apache Spark – What is it? HDFS YARN MapReduce Spark Cassandra NoSQL Mesos Tez Distributed Storage Layer Cluster Resource Management Processing Framework Layer
  • 27. Let’s try that again.. Think of a Bee Hive as your Distributed Storage A Bee Hive needs to have a Queen This Queen, serves as your Spark Driver The Worker Bees, serves as your worker nodes
  • 28. Putting it together.. Generates the Spark Context Main Method Access to the Resource Manager Spark Driver Resource Manager Executor Cache Executor Cache Executor Cache Executor Cache The Queen The Worker Bees
  • 29. DataFrames and DynamicFrames DataFrames Core data structure for SparkSQL Like structured tables Need schema upfront Each row has same structure Suited for SQL-like analytics DynamicFrames Like DataFrames forETL Designed for processing semi-structured data, e.g., JSON, Avro, Apachelogs
  • 30. schema per-record, noupfront schema needed Easy to restructure, tag,modify Can be more compact than DataFrame rows Many flows can be done in single pass Dynamic Frame internals {“id”:”2489”, “type”: ”CreateEvent”, ”payload”: {“creator”:…}, …} Dynamic records type id type id Dynamic Frame schema type id {“id”:4391, “type”: “PullEvent”, ”payload”: {“assets”:…}, …} type id {“id”:”6510”, “type”: “PushEvent”, ”payload”: {“pusher”:…}, …} id
  • 31. AWS Glue executionmodel: jobs and stages Filter Read Read Stage 1 Repartition Write Stage 2 Job 1 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping
  • 32. AWS Glue executionmodel: jobs and stages Filter Read Repartition Write Read Job 1 Stage 1 Stage 2 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping Actions
  • 33. AWS Glue executionmodel: jobs and stages Filter Read Read Job 1 Stage 1 Repartition Write Stage 2 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping Jobs
  • 34. AWS Glue executionmodel: data partitions • Apache Spark and AWS Glue are data parallel. • Data is divided intopartitions that are processed concurrently. • 1 stage x 1 partition = 1 task Driver Executors Overall throughput islimited by the number of partitions
  • 35. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 36. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type Jobs Filter Read Job 1 Stage 1 Repartition Write Stage 2 Apply Mapping Read Filter Apply Mapping Job 2 Show
  • 37. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 38. • Text – xSV, JSON • May or may not be compressed • Human readable whenuncompressed • Not optimized foranalytics • Columnar – Parquet & ORC • Compressed in a binaryformat • Integrated indexes and stats • Optimized read performance when selecting only a subset of columns • Row – Avro • Compressed in a binaryformat • Optimized read performance when selecting all columns of a subset of rows File formats
  • 39. Partitioning guidance • Chose columns that have low cardinality (uniqueness) • Partitioning on day/month/year has 365 unique values per year • Partitioning on seconds has millions of values per year • You can partition on any column, not just date • For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx) • Look at your query patterns – what data do you want to query, and what do you want to filter out?
  • 40. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 41. Standard Provide the maximum capacity of DPUs (max. 100) 4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors G.1X Provide the number of workers (max. 299) A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per worker Recommended for memory-intensive jobs G.2X Provide the number of workers (max. 149) A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker Recommended for memory-intensive jobs that run ML Transforms Worker Types
  • 42. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type • Use G.1X and G.2X instances when your jobs need lots of memory • Executor memory issues happen most often during sort and shuffle operations • The driver most often runs out of memory when processing a very large number of input partitions
  • 43. What is anAWS Glue trigger? Triggers are the “glue” in your AWS Glue ETL pipeline Triggers • Can be used to chain multiple AWS Glue jobs in a series • Can start multiple jobs atonce • Can be scheduled, on-demand, or based on job events • Can pass unique parameters to customize AWS Glue job runs
  • 44. Three ways to set up anAWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  • 45. Schedule-drivenAWS Glue ETL pipeline We work our way backward from a daily SLA deadline
  • 46. Event-drivenAWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
  • 47. Example ETL flow Create and run a job that will • Consume data in S3 • Join the data • Select only the required columns • Write the results to a data lake on Amazon Simple Storage Service (AmazonS3) Monitor the running job Analyze the resulting dataset Join Data Select Columns Fill null values • Fill null values Goal: prepare and analyze POS Data
  • 48. What are workflows and how do they work? DAGs with triggers, jobs, andcrawlers Graphical canvas for authoringworkflows Run / rerun and monitor workflow executions Share parameters across entities in the workflow
  • 50. Building workflows Build workflows with: Graphical canvas APIs AWS CloudFormation templates
  • 51. Monitoring workflows Easily monitor /see: workflows running now completed workflows status /errors
  • 52. Track previously processed data Enable |disable |pause bookmarks onsources Rollback to a previous state if necessary Incrementaldata processing with job bookmarks
  • 53. Examples uses: Process POS Data filesdaily Process log fileshourly Track timestamps or primary keys in DBs Track generated foreign keysfor normalization Bookmarks are per-job checkpoints that track the work done in previous runs. They persist the state of sources, transforms, and sinks on each run. run 1 run 2 run 3 Incrementaldata processing withjobbookmarks
  • 54. Option Behavior Enable Pick up from where you left off Disable Ignore and process the entire dataset every time Pause Temporarily disable advancing the bookmark run 1 run 2 enable disable pause run 3 Examples: Enable: Process the newest githubarchive partition Disable: Process the entire githubarchivetable Pause: Process the previous githubarchive partition Job bookmark options
  • 55. Job bookmark example year … … 2017 11 12 28 month day 27 hour … year … … 2017 11 12 28 month day 27 hour … Input table … … run 1 run 2 … Output table Periodically run ajob avoid reprocessing previous input avoid generating duplicate output
  • 57. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 59. Key Concepts Virtual Private Cloud (VPC) allows you to specify an IP address range for the VPC, add subnets, associate security groups, and configure route tables. Subnet is a range of IP addresses in your VPC. Public Subnet Internet Private Subnet No Internet VPN connection Virtual Private Gateway (VGW) Amazon Side Customer Gateway (CGW) Physical device on your Corporate Network Security Groups controls inbound and outbound traffic for your instances
  • 61. 10.10.10.0/24 Detailed Architecture AWS VPC (10.10.0.0/16) 10.10.11.0/24 NAT-GW IGW AWS Glue ENIs: 10.10.10.x Amazon RDS VGW Amazon S3 VPCe VPN Tunnel CGW Destination Target 10.10.0.0/16 local 0.0.0.0 NAT-GW-id Destination Target 10.10.0.0/16 local 0.0.0.0 IGW-id JDBC Connection Internet Destination Target 10.10.0.0/16 local 0.0.0.0 NAT-GW-id 172.31.0.0/16 VGW-id
  • 63. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 65. CPFI Data lake Architecture
  • 66. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 68. RecentAWS Glue innovations Merge/ transition/purge SageMaker notebooks AWS Glue streaming Vertical scaling PartitionIndex Pause and resume workflows Bahrain Spark UI Crawler performance Sao Paulo Custom JDBC certificates Milan AWS GlueVPC sharing AWS Glue2.0 C-based libraries MongoDB Amazon DocumentDB Self-managed Kafka support AWS Glue Studio Spark 2.4.3 AVRO support Continuous logging Hong Kong Resource tags Python shell jobs GovCloud AWS Glue workflows Python 3.7on Spark Stockholm Wheel dependency Job bookmarks FindMatches ML transforms China Regions AWS GlueETL binaries 50+ new features and regions
  • 69. AWS Glue 2.0:New engine for real-time workloads Cost effective New job execution engine with a new scheduler 10x faster job start times Predictable job latencies Enables micro-batching Latency-sensitive workloads Fast and predictable Diverse workloads 1-minute minimum billing 4 5 % cost savings on average
  • 70. AWS Glue Studio: New visual ETL interface M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S Author AWS Glue jobs visually without coding Monitor 1000s of jobs through a single pane of glass Distributed processing without the learning curve Advanced transforms through code snippets
  • 71. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services
  • 73. AWS Glue DataBrew V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G GenerallyAvailable!
  • 74. AmazonManagedWorkflowsforApacheAirflow H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R A PA C H E A I R F LO W Preview
  • 75. AWSLake Formation Build a secure data lake in days Simplify security management Centrally define security,governance and auditing policies Enforce policiesconsistently across multiple services Integrates with IAM andKMS Provide self-service access to data Build a data catalogthat describes your data Enable analysts and datascientists to easily find relevantdata Analyze with multipleanalytics services without moving data Build datalakes quickly Move, store, catalog, and clean your data faster Transform to openformats like Parquet and ORC ML-based deduplication and recordmatching
  • 76. AWS API Boto3 for Python https://boto3.amazonaws.com /v1/documentation/api/latest/ guide/index.html Examples: Upload files to S3 Download files from S3 Run a Glue Job Run a Workflow