AWS glue technical enablement training

AWS Glue Technical
Enablement Training
Kyle Escosia
Jr. Data Science Specialist
Info Alchemy

Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)

Data at scale
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications

Why data
preparation?
Data preparation is the first mile of
Analytics
Business
Intelligence Machine
Learning

Data preparation is hard
Lots of data! Infrastructure
management
Data grows fast 10x
every5years
Data is more diverse
Most jobshand-coded
Brittle and error prone
Machine / instance sizing Cluster
lifecyclemanagement
Scheduling andmonitoring
Managingmetastores
Needs customization

AWS Glue has evolved
Then Now
Fully Managed extract-transform-load
(ETL) Service
For developers, built
by developers
Serverless data preparation service
ETL developers, data engineers, data
scientists, business analysts, and more

Amazon S3
data lakestorage
Building data
lakes
Break silos, store data in Amazon S3
AWSGlue jobs and workflows to
ingest, process, and refine data instages
Access data lakes viaa
variety of cloud analytic engines
Amazon RDS Other databases On-premises data Streaming data
AWS Gluecrawlers
load and maintain the Data Catalog
AWS Lake Formation permissions to
secure the data lake

AWS
Glue
Fully managed, serverless ETLservice
for developers and datascientists

Serverlessreview
No infrastructure provisioning,
no management
Automatic scaling
Pay for value Highly available andsecure

Easily de-duplicate your data with ML
transforms

ETL Jobs
No resources to manage
Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour
provides 4 vCPU and 16 GB of memory
Three types
Apache Spark
Python Shell
Spark Streaming
Data Catalog
Free for the first million objects stored (table, table version, partition, or database)
$1.00 per 100,000 objects stored above 1M, per month
Crawlers
Charged hourly based on Data Processing Units (DPUs)
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
With AWS Glue, you only pay for the time your ETL job takes to run.
AWS Glue Usage and Pricing

Security: IAM Permissions – A refresher
IAM Users
consist of a username and a password
IAM Groups
collection of users
IAM Role
an identity used to delegate access to AWS resources
IAM Service Role
a role that a service assumes to perform actions in your
account on your behalf
IAM Policy
an entity, when attached to an identity, defines their permissions

AWS Glue Permissions
Follow the least privilege access principle
Requires an IAM Role
AWS Managed Policy: AWSGlueServiceRole
Custom Policy – fine-grained access
Some related services
Amazon S3, Amazon Redshift, Amazon CloudWatch

AWS Glue Components
Crawlers
Load andmaintain
Data Catalog
Infer metadata:
schema, table
structure
Supports schema
evolution
AWS GlueData
Catalog
Apache Hive Metastore
compatible
Many integrated
analytic services
Extract,
transform, and load
Serverless execution
Apache Spark / Python
shell jobs
Interactive development
Auto-generate ETLcode
Orchestrate triggers,
crawlers, and jobs
Build and monitor
complex flows
Reliable execution
Workflow
management

AWS Glue is used to cleanse, prep, and
catalog
AWS Glue DataCatalog
Workflows orchestrate dataflows
Process data instages
Crawlers populate/maintain catalog
Jobs execute ETLtransforms

What arecrawlers?
Automatically discover new data and extract schema definitions
detect schema changes and maintain tables detect Apache
Hive style partitions on Amazon S3
Built-in classifiers for popular datatypes
create your own custom classifier using Grok expressions
Run on demand, on a schedule, or as parts of workflows

Crawlers discoverstructure
Handles complex, nested fields
Detects Hive-style partitions

Use excludepatterns to remove unnecessary files
To ignore all Metadata files in the
folders year=‘2017’ and for
location s3://mydatasets
s3://mydatasets
year=2017/**/METADATA.txt

Improve performance with multiple crawlers
Periodically audit long running crawlers to balance workloads
Often crawlers are processing multiple datasets / tables
Improve performance by using multiple crawlers
Crawler granularity is table or dataset

What is anAWS Glue
job?
An AWS Glue job encapsulates the business logic that
performs extract, transform, and load (ETL)work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped,monitored

Under the hood:Apache Spark and AWSGlue
ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics
• AWS Glue builds on the Apache Spark runtime to offer ETL-specific
functionality
SparkSQL AWS GlueETL
Spark DataFrames AWS GlueDynamicFrames
Spark Core:RDDs

Apache Spark – What is it?
HDFS
YARN
MapReduce Spark
Cassandra NoSQL
Mesos
Tez
Distributed Storage Layer
Cluster Resource Management
Processing Framework Layer

Let’s try that again..
Think of a Bee Hive as your Distributed Storage
A Bee Hive needs to have a Queen
This Queen, serves as your Spark Driver
The Worker Bees, serves as your worker nodes

Putting it together..
Generates the Spark Context
Main Method
Access to the Resource Manager
Spark Driver
Resource
Manager
Executor
Cache
Executor
Cache
Executor
Cache
Executor
Cache
The Queen
The Worker Bees

DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema upfront
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames forETL
Designed for processing semi-structured
data, e.g., JSON, Avro, Apachelogs

schema per-record, noupfront schema needed
Easy to restructure, tag,modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
Dynamic Frame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
type
id type
id
Dynamic Frame schema
type
id
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
type
id
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id

AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Stage 1
Repartition
Write
Stage 2
Job 1
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping

Filter
Read
Repartition
Write
Read
Job 1
Stage 1
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Actions

Filter
Read
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Jobs

AWS Glue executionmodel: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided intopartitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput islimited
by the number of partitions

Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type

Jobs
Filter
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Apply
Mapping
Read Filter
Apply
Mapping
Job 2
Show

• Text – xSV, JSON
• May or may not be compressed
• Human readable whenuncompressed
• Not optimized foranalytics
• Columnar – Parquet & ORC
• Compressed in a binaryformat
• Integrated indexes and stats
• Optimized read performance when selecting only a subset of columns
• Row – Avro
• Compressed in a binaryformat
• Optimized read performance when selecting all columns of a subset of
rows
File formats

Partitioning guidance
• Chose columns that have low cardinality (uniqueness)
• Partitioning on day/month/year has 365 unique values per year
• Partitioning on seconds has millions of values per year
• You can partition on any column, not just date
• For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx)
• Look at your query patterns – what data do you want to query, and what do
you want to filter out?

Standard
Provide the maximum capacity of DPUs (max. 100)
4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors
G.1X
Provide the number of workers (max. 299)
A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per
worker
Recommended for memory-intensive jobs
G.2X
Provide the number of workers (max. 149)
A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker
Recommended for memory-intensive jobs that run ML Transforms
Worker Types

• Use G.1X and G.2X instances when your jobs need lots of memory
• Executor memory issues happen most often during sort and shuffle
operations
• The driver most often runs out of memory when processing a very
large number of input partitions

What is anAWS Glue
trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs atonce
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs

Three ways to set up anAWS Glue ETL
pipeline
• Schedule-driven
• Event-driven
• State machine–driven

Schedule-drivenAWS Glue ETL
pipeline
We work our way backward from a daily SLA deadline

Event-drivenAWS Glue ETL
pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline

Example ETL
flow
Create and run a job that will
• Consume data in S3
• Join the data
• Select only the required columns
• Write the results to a data lake on Amazon Simple Storage
Service (AmazonS3)
Monitor the running job Analyze
the resulting dataset
Join Data
Select
Columns
Fill null values
• Fill null values
Goal: prepare and analyze
POS Data

What are workflows and how do they work?
DAGs with triggers, jobs, andcrawlers
Graphical canvas for authoringworkflows
Run / rerun and monitor workflow executions
Share parameters across entities in the workflow

Building workflows
Build workflows with:
Graphical canvas
APIs
AWS CloudFormation templates

Monitoring workflows
Easily monitor /see:
workflows running now
completed workflows
status /errors

Track previously processed data
Enable |disable |pause bookmarks onsources
Rollback to a previous state if necessary
Incrementaldata processing with job
bookmarks

Examples uses:
Process POS Data filesdaily
Process log fileshourly
Track timestamps or primary keys in DBs
Track generated foreign keysfor
normalization
Bookmarks are per-job checkpoints that
track the work done in previous runs.
They persist the state of sources,
transforms, and sinks on each run.
run 1 run 2 run 3
Incrementaldata processing withjobbookmarks

Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire
dataset every time
Pause
Temporarily disable advancing the
bookmark
run 1 run 2
enable
disable
pause
run 3
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchivetable
Pause: Process the previous githubarchive partition
Job bookmark options

Job bookmark example
year
…
…
2017
11 12
28
month
day 27
hour …
year
…
…
2017
11 12
28
month
day 27
hour …
Input table
… …
run 1
run 2
…
Output table
Periodically run ajob
avoid reprocessing
previous input
avoid generating
duplicate output

Key Concepts
Virtual Private Cloud (VPC)
allows you to specify an IP address range for the VPC, add subnets, associate security
groups, and configure route tables.
Subnet
is a range of IP addresses in your VPC.
Public Subnet
Internet
Private Subnet
No Internet
VPN connection
Virtual Private Gateway (VGW)
Amazon Side
Customer Gateway (CGW)
Physical device on your Corporate Network
Security Groups
controls inbound and outbound traffic for your instances

10.10.10.0/24
Detailed Architecture
AWS VPC
(10.10.0.0/16)
10.10.11.0/24
NAT-GW
IGW
AWS Glue
ENIs: 10.10.10.x
Amazon RDS
VGW
Amazon S3
VPCe
VPN Tunnel CGW
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
Destination Target
10.10.0.0/16 local
0.0.0.0 IGW-id
JDBC Connection
Internet
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
172.31.0.0/16 VGW-id

Reference Architecture
AWS Glue

RecentAWS Glue innovations
Merge/
transition/purge
SageMaker
notebooks
AWS Glue
streaming
Vertical scaling
PartitionIndex
Pause and
resume
workflows
Bahrain
Spark UI
Crawler
performance
Sao Paulo
Custom JDBC
certificates
Milan AWS GlueVPC
sharing
AWS Glue2.0
C-based
libraries
MongoDB
Amazon
DocumentDB
Self-managed
Kafka support
AWS Glue
Studio
Spark 2.4.3
AVRO
support
Continuous
logging
Hong Kong
Resource tags
Python shell
jobs
GovCloud
AWS Glue
workflows
Python 3.7on
Spark Stockholm
Wheel
dependency
Job bookmarks
FindMatches
ML transforms
China Regions
AWS GlueETL
binaries
50+ new features
and regions

AWS Glue 2.0:New engine for real-time
workloads
Cost effective
New job execution engine with a new scheduler
10x faster job start times
Predictable job latencies
Enables micro-batching
Latency-sensitive workloads
Fast and predictable
Diverse workloads
1-minute minimum billing
4 5 % cost savings on average

AWS Glue Studio: New visual ETL
interface
M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of
glass
Distributed processing without the learning curve
Advanced transforms through code snippets

Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services

AWS Glue DataBrew
V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G
GenerallyAvailable!

AmazonManagedWorkflowsforApacheAirflow
H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R
A PA C H E A I R F LO W
Preview

AWSLake Formation
Build a secure data lake in days
Simplify security
management
Centrally define security,governance
and auditing policies
Enforce policiesconsistently
across multiple services
Integrates with IAM andKMS
Provide self-service
access to data
Build a data catalogthat
describes your data
Enable analysts and datascientists
to easily find relevantdata
Analyze with multipleanalytics
services without moving data
Build datalakes
quickly
Move, store, catalog, and clean
your data faster
Transform to openformats
like Parquet and ORC
ML-based deduplication
and recordmatching

AWS API
Boto3 for Python
https://boto3.amazonaws.com
/v1/documentation/api/latest/
guide/index.html
Examples:
Upload files to S3
Download files from S3
Run a Glue Job
Run a Workflow

Thank you!
Kyle Escosia
kescosia@info-alchemy.net

AWS glue technical enablement training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS glue technical enablement training

Similar to AWS glue technical enablement training (20)

Recently uploaded

Recently uploaded (20)

AWS glue technical enablement training