[AWS Builders] Effective AWS Glue

모두 마스터가 되어보세요! J

강연 중 질문하는 방법 AWS Builders
Go to Webinar “Questions” 창에 자신이 질문한
내역이 표시됩니다. 기본적으로 모든 질문은
공개로 답변 됩니다만 본인만 답변을 받고
싶으면 (비공개)라고 하고 질문해 주시면 됩니다.
본 컨텐츠는 고객의 편의를 위해 AWS 서비스 설명을 위해 온라인 세미나용으로 별도로 제작, 제공된 것입니다. 만약 AWS
사이트와 컨텐츠 상에서 차이나 불일치가 있을 경우, AWS 사이트(aws.amazon.com)가 우선합니다. 또한 AWS 사이트
상에서 한글 번역문과 영어 원문에 차이나 불일치가 있을 경우(번역의 지체로 인한 경우 등 포함), 영어 원문이 우선합니다.
AWS는 본 컨텐츠에 포함되거나 컨텐츠를 통하여 고객에게 제공된 일체의 정보, 콘텐츠, 자료, 제품(소프트웨어 포함) 또는 서비스를 이용함으로 인하여 발생하는 여하한 종류의 손해에
대하여 어떠한 책임도 지지 아니하며, 이는 직접 손해, 간접 손해, 부수적 손해, 징벌적 손해 및 결과적 손해를 포함하되 이에 한정되지 아니합니다.
고지 사항(Disclaimer)

§ Introdution
§ Glue internal
§ Items
§ Item1: Processing lots of small files
§ Item2: Processing a few large files
§ Item3: Optimizing parallelism
§ Item4: JDBC partitions
§ Item5: Python udf & performance
§ Item6: Scheduler
§ Item7: Python shell
§ QnA

Fully-managed, serverless extract-transform-load (ETL) service
for developers, built by developers
1000s of customers and jobs
A year ago …

AWS Glue
Serverless data catalog & ETL service
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Scala
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless, flexible, and built on open standards

Putting it together - data lake with AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed
data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers

Programming Environment
• ETL in Python
• Python 2.7
• Boto 3
• ETL in Scala
• Scala 2.11
• Spark Cluster
• Spark 2.2.1

• 1 DPU (Data Processing Unit)
• 1 m4.xlarge node
• 4vCPU
• 16G RAM
• 2 executors
• 1 Executor
• 5G RAM
• 4 Tasks
Driver
Executors

• Glue Job
• Minimum DPU: 2
• Default DPU: 10
• Ex) 10 DPU Job
• 10 node cluster
• 1 Master + 9 Core Nodes
• 18 executors
• 1 driver
• 17 executors

• Internal argument to AWS Glue
• --conf
• --debug
• --mode
• --JOB_NAME

Basics of ETL Job Programming
1. Initialize
2. Read
3. Transform data
4. Write
## Initialize
glueContext = GlueContext(SparkContext.getOrCreate())
## Create DynamicFrame and retrieve data from source
ds0 = glueContext.create_dynamic_frame.from_catalog (
database = "mysql", table_name = "customer",
transformation_ctx = "ds0")
## Implement data transformation here
ds1 = ds0 ...
## Write DynamicFrame from Catalog
ds2 = glueContext.write_dynamic_frame.from_catalog (
frame = ds1, database = "redshift",
table_name = "customer_dim",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "ds2")

What is Apache Spark?
Parallel, scale-out data processing engine
Fault-tolerance built-in
Flexible interface: Python scripting, SQL
Rich eco-system: ML, Graph, analytics, …
Apache Spark and AWS Glue ETL
Spark core: RDDs
SparkSQL
Dataframes DynamicFrames
AWS Glue ETL
AWS Glue ETL libraries
Integration: Data Catalog, job orchestration,
code-generation, job bookmarks, S3, RDS
ETL transforms, more connectors & formats
New data structure: DynamicFrames

Dataframes
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
Dataframes and Dynamic Frames
Dynamic Frames
Like dataframes for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...

Public GitHub timeline is …
35+ event types
semi-structured
payload structure
and size varies by
event type

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
schema per-record, no up-front schema needed
Easy to restructure, tag, modify
Can be more compact than dataframe rows
Many flows can be done in single-pass
§ {“id”:”2489”, “type”:
”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic Records
typeid typeid
Dynamic Frame Schema
typeid
Dynamic Frame internals
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
typeid
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id

ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
ApplyMapping() A
X Y
A X Y
C
15+ transforms out-of-the box
Dynamic Frame transforms

Semi-structured schema Relational schema
FKA B B C.X C.Y
PK ValueOffset
A C D [ ]
X Y
B B
Transforms and adds new columns, types, and tables on-the-fly
Tracks keys and foreign keys across runs
SQL on the relational schema is orders of magnitude faster than JSON processing
Relationalize() transform

toDF(): Convert to a Dataframe
fromDF(): Convert from a Dataframe
Spigot(): Sample data of any Dynamic Frame to S3
Unbox(): Parse string column as given format into Dynamic Frame
Filter(), Map(): Apply Python UDFs to Dynamic Frames
Join(): Join two Dynamic Frames
And more ….
Useful AWS Glue transforms

0
200
400
600
800
1000
1200
1400
1600
1800
Day Month Year
GitHub Timeline ETL Performance
DynamicFrames DataFrames
Time(sec)
On average: 2x performance
improvement
Data size (# files)
24 744 8699
Performance: AWS Glue ETL
Configuration
10 DPUs
Apache Spark 2.1.1
Workload
JSON to CSV
Filter for Pull events
(lower is better)

Lots of small files, e.g. Kinesis Firehose
Vanilla Apache Spark (2.1.1) overheads
Must reconstruct partitions (2-pass)
Too many tasks: task per file
Scheduling & memory overheads
AWS Glue Dynamic Frames
Integration with Data Catalog
Automatically group files per task
Rely on crawler statistics
Performance: Lots of small files
0
1000
2000
3000
4000
5000
6000
7000
8000
1:2K 20:40K 40:80K 80:160K 160:320K 320:640K 640: 1280K
AWS Glue ETL small file scalability
Spark Glue
1.2 Million Files
Spark
Out-Of-Memory
>= 320: 640K files
Grouping
Time(sec)
# partitions : # files

AWS Glue execution model: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided into partitions
that are processed
concurrently.
• A stage is a set of parallel
tasks – one task per partition
Driver
Executors Overall throughput is limited
by the number of partitions

AWS Glue execution model: jobs and stages

Actions

Jobs

Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping Jobs

• How is your dataset
partitioned?
• How is your application
divided into jobs and
stages?
• Data is divided into
partitions that are
processed concurrently
AWS Glue performance: key questions

Item1: Processing lots of small files

Example: Processing lots of small files
• Let's look at a straightforward JSON to Parquet conversion job
• 1.28 million JSON files in 640 partitions:

• First try: use a standard SparkSQL job

• Driver memory use is growing fast and approaching the 5g max.

• Case 2: Run using AWS Glue DynamicFrames.

Driver memory remains below 50%
for the entire duration of execution.

Options for grouping files
• groupFiles
• inPartition: within a partition.
• acrossPartition: from different partitions.
• groupSize
• Target size of each group.

Example: Aggressively grouping files
• Execution is much slower, but hasn't crashed.
"groupFiles": "acrossPartition"

Example: Aggressively grouping files
Executor memory is higher than driver. Only one executor is active.

Item2: Processing a few large files

Example: Processing a few large files
• Let's see how this looks on a sample dataset of 5 large csv files.
• Each file is
• 12.5 GB uncompressed
• 1.6 GB gzip
• 1.3 GB bzip2
• Script converts data to Parquet.

Example: Processing a few large gzip files
• We only have 5 partitions – one for each file.
• Job fails after 2 hours.

Example: Processing a few large bzip2 files
• Bzip2 files can be split into blocks, so we see up to 104 tasks.
• Job completes in 18 minutes.

Example: Processing a few large bzip2 files
• With 15 DPU, the number of active executors closely tracks the maximum needed
number of executors.

Example: Processing a few large uncompressed files
• Uncompressed files can be split into lines, so we construct 64MB partitions.
• Job completes in 12 minutes.

Example: Processing a few large files
• If you have a choice of compression type, prefer bzip2.
• If you are using gzip, make sure you have enough files to fully utilize your resources.
• Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider leaving files
uncompressed.

Example: optimizing parallelism
Processing large, split-able bzip2 files.
With 10 DPU, metric maximum needed executors shows room for scaling.

§ 17 Executors (Maximum Allocated Executors)
§ 10 DPU = 10 Node Cluster = 1 Master + 9 Core Node
§ 9 Core Node = 18 Executors = 1 Driver + 17 Executors
§ 27 Executors (Maximum Needed Executors)
§ 1 Driver + 27 Executors = 28 Executors = 14 Core Node
§ 14 Core Node + 1 Master = 15 Node Cluster = 15 DPU
DPU

Example: optimizing parallelism
With 15 DPU, active executors closely tracks maximum needed executors.

AWS Glue JDBC partitions
• For JDBC sources, by default each table is read as a single partition.
• AWS Glue automatically partitions datasets with fewer than 10
partitions after the data has been loaded.

Reading JDBC partitions
A single executor is used
for the JDBC query
Data is repartitioned for
the rest of the job.

Options for reading database tables in parallel
• hashexpression – Integer expression to use for distribution.
• hashfield – Single column to use for distribution.
• hashpartitions – Number of parallel queries to make. Default is 7.
• Turns into a collection of queries of the form

• Guidelines for picking distribution keys.
• For hashexpression, choose a column that is evenly distributed across values. A primary key works well.
• If no such field exists, use hashfield to define one.
• Example: The taxi dataset does not have a primary key, so we set hashfield to
partition based on day of the month:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "nyctaxi",
table_name = "green-mysql-large",
additional_options={'hashfield': 'day(lpep_pickup_datetime)',
'hashpartitions': 15})

• Four executors can process 16 partitions concurrently.

• Make sure to understand impact to database engine.

Job Bookmarks for JDBC Queries
• Job bookmarks only work when the source table has an ordered
primary key.
• Updates are not handled today.

Python performance
• Using map and filter in Python
is expensive for large data sets.
• All data is serialized and sent
between the JVM and Python.
• Alternatives
• Use AWS Glue Scala SDK.
• Convert to DataFrame and use Spark
SQL expressions.
Spark JVM
Python VM

Announcing a new job type: Python shell
A new cost-effective ETL primitive for small to
medium tasks
Python
shell 3rd party
service

AWS Glue Python shell specs
Python 2.7 environment with
boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, …
cold spin-up: < 20 sec, support for VPCs, no runtime limit
sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB)
pricing: $0.44 per DPU-hour, 1-min minimum, per-second billing

Python shell collaborative filtering example
Amazon customer reviews dataset (s3://amazon-reviews-pds)
Video category
Compute low-rank approx of (Customer x Product) ratings using SVD
uses scipy sparse matrix and SVD library
Step Time (sec)
Redshift COPY 13
Extract ratings 5
Generate matrix 1552
SVD (k=1000) 2575
Total 4145
69 min
$0.60

더 나은 세미나를 위해
여러분의 의견을 남겨주세요!
▶ 질문에 대한 답변 드립니다.
▶ 발표자료/녹화영상을 제공합니다.
http://bit.ly/awskr-webinar

[AWS Builders] Effective AWS Glue

More Related Content

What's hot

Similar to [AWS Builders] Effective AWS Glue

More from Amazon Web Services Korea

Recently uploaded

[AWS Builders] Effective AWS Glue