5. Data preparation is hard
Lots of data! Infrastructure
management
Data grows fast 10x
every5years
Data is more diverse
Most jobshand-coded
Brittle and error prone
Machine / instance sizing Cluster
lifecyclemanagement
Scheduling andmonitoring
Managingmetastores
Needs customization
6. AWS Glue has evolved
Then Now
Fully Managed extract-transform-load
(ETL) Service
For developers, built
by developers
Serverless data preparation service
ETL developers, data engineers, data
scientists, business analysts, and more
8. Amazon S3
data lakestorage
Building data
lakes
Break silos, store data in Amazon S3
AWSGlue jobs and workflows to
ingest, process, and refine data instages
Access data lakes viaa
variety of cloud analytic engines
Amazon RDS Other databases On-premises data Streaming data
AWS Gluecrawlers
load and maintain the Data Catalog
AWS Lake Formation permissions to
secure the data lake
13. ETL Jobs
No resources to manage
Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour
provides 4 vCPU and 16 GB of memory
Three types
Apache Spark
Python Shell
Spark Streaming
Data Catalog
Free for the first million objects stored (table, table version, partition, or database)
$1.00 per 100,000 objects stored above 1M, per month
Crawlers
Charged hourly based on Data Processing Units (DPUs)
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
With AWS Glue, you only pay for the time your ETL job takes to run.
AWS Glue Usage and Pricing
15. Security: IAM Permissions – A refresher
IAM Users
consist of a username and a password
IAM Groups
collection of users
IAM Role
an identity used to delegate access to AWS resources
IAM Service Role
a role that a service assumes to perform actions in your
account on your behalf
IAM Policy
an entity, when attached to an identity, defines their permissions
16. AWS Glue Permissions
Follow the least privilege access principle
Requires an IAM Role
AWS Managed Policy: AWSGlueServiceRole
Custom Policy – fine-grained access
Some related services
Amazon S3, Amazon Redshift, Amazon CloudWatch
17. AWS Glue Components
Crawlers
Load andmaintain
Data Catalog
Infer metadata:
schema, table
structure
Supports schema
evolution
AWS GlueData
Catalog
Apache Hive Metastore
compatible
Many integrated
analytic services
Extract,
transform, and load
Serverless execution
Apache Spark / Python
shell jobs
Interactive development
Auto-generate ETLcode
Orchestrate triggers,
crawlers, and jobs
Build and monitor
complex flows
Reliable execution
Workflow
management
18. AWS Glue is used to cleanse, prep, and
catalog
AWS Glue DataCatalog
Workflows orchestrate dataflows
Process data instages
Crawlers populate/maintain catalog
Jobs execute ETLtransforms
19. What arecrawlers?
Automatically discover new data and extract schema definitions
detect schema changes and maintain tables detect Apache
Hive style partitions on Amazon S3
Built-in classifiers for popular datatypes
create your own custom classifier using Grok expressions
Run on demand, on a schedule, or as parts of workflows
22. Use excludepatterns to remove unnecessary files
To ignore all Metadata files in the
folders year=‘2017’ and for
location s3://mydatasets
s3://mydatasets
year=2017/**/METADATA.txt
23. Improve performance with multiple crawlers
Periodically audit long running crawlers to balance workloads
Often crawlers are processing multiple datasets / tables
Improve performance by using multiple crawlers
Crawler granularity is table or dataset
24. What is anAWS Glue
job?
An AWS Glue job encapsulates the business logic that
performs extract, transform, and load (ETL)work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped,monitored
25. Under the hood:Apache Spark and AWSGlue
ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics
• AWS Glue builds on the Apache Spark runtime to offer ETL-specific
functionality
SparkSQL AWS GlueETL
Spark DataFrames AWS GlueDynamicFrames
Spark Core:RDDs
26. Apache Spark – What is it?
HDFS
YARN
MapReduce Spark
Cassandra NoSQL
Mesos
Tez
Distributed Storage Layer
Cluster Resource Management
Processing Framework Layer
27. Let’s try that again..
Think of a Bee Hive as your Distributed Storage
A Bee Hive needs to have a Queen
This Queen, serves as your Spark Driver
The Worker Bees, serves as your worker nodes
28. Putting it together..
Generates the Spark Context
Main Method
Access to the Resource Manager
Spark Driver
Resource
Manager
Executor
Cache
Executor
Cache
Executor
Cache
Executor
Cache
The Queen
The Worker Bees
29. DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema upfront
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames forETL
Designed for processing semi-structured
data, e.g., JSON, Avro, Apachelogs
30. schema per-record, noupfront schema needed
Easy to restructure, tag,modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
Dynamic Frame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
type
id type
id
Dynamic Frame schema
type
id
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
type
id
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
31. AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Stage 1
Repartition
Write
Stage 2
Job 1
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
32. AWS Glue executionmodel: jobs and stages
Filter
Read
Repartition
Write
Read
Job 1
Stage 1
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Actions
33. AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Jobs
34. AWS Glue executionmodel: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided intopartitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput islimited
by the number of partitions
35. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
36. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Jobs
Filter
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Apply
Mapping
Read Filter
Apply
Mapping
Job 2
Show
37. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
38. • Text – xSV, JSON
• May or may not be compressed
• Human readable whenuncompressed
• Not optimized foranalytics
• Columnar – Parquet & ORC
• Compressed in a binaryformat
• Integrated indexes and stats
• Optimized read performance when selecting only a subset of columns
• Row – Avro
• Compressed in a binaryformat
• Optimized read performance when selecting all columns of a subset of
rows
File formats
39. Partitioning guidance
• Chose columns that have low cardinality (uniqueness)
• Partitioning on day/month/year has 365 unique values per year
• Partitioning on seconds has millions of values per year
• You can partition on any column, not just date
• For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx)
• Look at your query patterns – what data do you want to query, and what do
you want to filter out?
40. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
41. Standard
Provide the maximum capacity of DPUs (max. 100)
4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors
G.1X
Provide the number of workers (max. 299)
A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per
worker
Recommended for memory-intensive jobs
G.2X
Provide the number of workers (max. 149)
A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker
Recommended for memory-intensive jobs that run ML Transforms
Worker Types
42. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Use G.1X and G.2X instances when your jobs need lots of memory
• Executor memory issues happen most often during sort and shuffle
operations
• The driver most often runs out of memory when processing a very
large number of input partitions
43. What is anAWS Glue
trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs atonce
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
44. Three ways to set up anAWS Glue ETL
pipeline
• Schedule-driven
• Event-driven
• State machine–driven
47. Example ETL
flow
Create and run a job that will
• Consume data in S3
• Join the data
• Select only the required columns
• Write the results to a data lake on Amazon Simple Storage
Service (AmazonS3)
Monitor the running job Analyze
the resulting dataset
Join Data
Select
Columns
Fill null values
• Fill null values
Goal: prepare and analyze
POS Data
48. What are workflows and how do they work?
DAGs with triggers, jobs, andcrawlers
Graphical canvas for authoringworkflows
Run / rerun and monitor workflow executions
Share parameters across entities in the workflow
52. Track previously processed data
Enable |disable |pause bookmarks onsources
Rollback to a previous state if necessary
Incrementaldata processing with job
bookmarks
53. Examples uses:
Process POS Data filesdaily
Process log fileshourly
Track timestamps or primary keys in DBs
Track generated foreign keysfor
normalization
Bookmarks are per-job checkpoints that
track the work done in previous runs.
They persist the state of sources,
transforms, and sinks on each run.
run 1 run 2 run 3
Incrementaldata processing withjobbookmarks
54. Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire
dataset every time
Pause
Temporarily disable advancing the
bookmark
run 1 run 2
enable
disable
pause
run 3
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchivetable
Pause: Process the previous githubarchive partition
Job bookmark options
55. Job bookmark example
year
…
…
2017
11 12
28
month
day 27
hour …
year
…
…
2017
11 12
28
month
day 27
hour …
Input table
… …
run 1
run 2
…
Output table
Periodically run ajob
avoid reprocessing
previous input
avoid generating
duplicate output
59. Key Concepts
Virtual Private Cloud (VPC)
allows you to specify an IP address range for the VPC, add subnets, associate security
groups, and configure route tables.
Subnet
is a range of IP addresses in your VPC.
Public Subnet
Internet
Private Subnet
No Internet
VPN connection
Virtual Private Gateway (VGW)
Amazon Side
Customer Gateway (CGW)
Physical device on your Corporate Network
Security Groups
controls inbound and outbound traffic for your instances
68. RecentAWS Glue innovations
Merge/
transition/purge
SageMaker
notebooks
AWS Glue
streaming
Vertical scaling
PartitionIndex
Pause and
resume
workflows
Bahrain
Spark UI
Crawler
performance
Sao Paulo
Custom JDBC
certificates
Milan AWS GlueVPC
sharing
AWS Glue2.0
C-based
libraries
MongoDB
Amazon
DocumentDB
Self-managed
Kafka support
AWS Glue
Studio
Spark 2.4.3
AVRO
support
Continuous
logging
Hong Kong
Resource tags
Python shell
jobs
GovCloud
AWS Glue
workflows
Python 3.7on
Spark Stockholm
Wheel
dependency
Job bookmarks
FindMatches
ML transforms
China Regions
AWS GlueETL
binaries
50+ new features
and regions
69. AWS Glue 2.0:New engine for real-time
workloads
Cost effective
New job execution engine with a new scheduler
10x faster job start times
Predictable job latencies
Enables micro-batching
Latency-sensitive workloads
Fast and predictable
Diverse workloads
1-minute minimum billing
4 5 % cost savings on average
70. AWS Glue Studio: New visual ETL
interface
M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of
glass
Distributed processing without the learning curve
Advanced transforms through code snippets
73. AWS Glue DataBrew
V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G
GenerallyAvailable!
74. AmazonManagedWorkflowsforApacheAirflow
H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R
A PA C H E A I R F LO W
Preview
75. AWSLake Formation
Build a secure data lake in days
Simplify security
management
Centrally define security,governance
and auditing policies
Enforce policiesconsistently
across multiple services
Integrates with IAM andKMS
Provide self-service
access to data
Build a data catalogthat
describes your data
Enable analysts and datascientists
to easily find relevantdata
Analyze with multipleanalytics
services without moving data
Build datalakes
quickly
Move, store, catalog, and clean
your data faster
Transform to openformats
like Parquet and ORC
ML-based deduplication
and recordmatching
76. AWS API
Boto3 for Python
https://boto3.amazonaws.com
/v1/documentation/api/latest/
guide/index.html
Examples:
Upload files to S3
Download files from S3
Run a Glue Job
Run a Workflow