202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

© 2022, Amazon Web Services, Inc. or its Affiliates.
Chiei Hayashida
Solution Architect
2022/01
AWS Glue Spark
Performance Tuning

Self Introduction
Chie Hayashida(Chie Hayashida)
Amazon Web Services Japan
solution architect

The target of this slide
o People who have done AWS glue tutorial or have equivalent
knowledge.
o People who have written Apache Spark applications.
o People who would like to improve their existing AWS Glue jobs
o The Code at this slide are all by PySpark because a lot of AWS Glue
users choose PySpark
o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)

Agenda
• Architecture of AWS Glue and Apache Spark
• AWS Glue Functions related to performance
• How to proceed AWS Glue Spark performance tuning
• AWS Glue Spark performance tuning patterns

Agenda
• Architecture of Apache Spark
• AWS Glue specific features (performance related)
• How to proceed with performance tuning of AWS Glue Spark
• Basic strategy for tuning AWS Glue Spark jobs
• Tuning Patterns for AWS Glue Spark Jobs

AWS Glue and Apache Spark
Data source
crawler data catalog
Serverless
Engine
(1) Crawl data
2) Manage metadata
AWS Glue
3) Triggered manually / by schedule /
by event
5) Run transfomation job
and load data to target
data source
4) Extract data
from the input data
source
scheduler
Data source
Other AWS
Services
Managed Service
of Apache Spark

Architecture of Apache Spark

Architecture of Apache Spark (cluster mode)
• Cluster Manager divides a job into one or more tasks and assigns them
to executors
• On a single worker node, more than one executor is started.
• More than one task can be executed on a single executor.
(a)
3)
Driver Program
SparkContext Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
Executor
Executor

Architecture of Spark (cluster mode)
1) Request the resources needed by the application.
2) Launch the Executor required for the job on each worker.
3) Divide the process into tasks and assign them to Executor.
4) Assign a task to each Executor. When the task is completed, the Executor informs the Driver Program about it.
Exchange data between tasks as necessary.
5) After 2) and 3) are repeated several times, the result of the process is returned.
(a)
2) 3)
3)
Driver Program
SparkContext Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
1)
2) 4)
3)
Executor
Executor
4)
5)

10
How data is processed
• The data being processed is defined as a distributed collection called an RDD
• An RDD is made up of one or more “partitions”
• A partition is processed by a single "task”
• The actual Spark code is often written through interfaceswhich called
DataFrame that can treat data as typed table.
S3
file file file
RDD
Partition
Partition
Partition
RDD
Partition
Partition
Partition
RDD
Partition
Partition
Partition
S3
file file file

Components of Apache Spark
Spark Core (RDD)
DataFrame/Catalyst Optimizer
Spark ML
Structured
Streaming
GraphX
Spark SQL

12
RDD and DataFrame
RDD data architecture image
[
[1, Bob, 24],
[2, Alice, 48],
[3, Ken, 10],
…
]
DataFrame data architecture image
col1 col2 col3
1 Bob 24
2 Alice 48
3 Ken 10
… … …
With both interfaces, the code is written as if the data is processed as a
list/table, but the actual data is distributed across multiple servers.
DataFrame is a high-level API for RDDs, and processing described in a
DataFrame is internally executed as an RDD.

13
Lazy evaluation
• There are two types of Spark processing: “action” and “transformation”
• When an “action” is executed, all the previous processing necessary for the
action is performed
• as series of processes excuted by an “action” is called a “job”
• Note that “Job” here is different from a Glue job.
>>> df1 = spark.read.csv(…)
>>> df2 = spark.read.json(…)
>>> df3 = df1.filter(…)
>>> df4 = spark.read.csv(…)
>>> df5 = df2.join(df3, …)
>>> df5.count()
Action
Up to this point, no actual
processing won’t be done.
At this point, the previous process
is executed for the first time.
The df4 process is not a dependency of df5.count(), so it will not be
executed in this action.

Examples of transformations and actions
Transform: data generation and
processing
• select()
• Selecting a column
• read
• Loading data
• filter()
• Data Filtering
• groupBy()
• Aggregation by group
• sort()
• Sorting data
Action: Output the processing
result
• count()
• Counting the number of
records
• write
• Exporting to the file system
• collect()
• Collect all data on Driver node
• show()
• View a sample of the data
• describe()
• View data statistics

Spark Applications
o A Spark application consists of multiple jobs.
glueContext = GlueContext(SparkContext.getOrCreate(conf))
spark = glueContext.spark_session
df1 = spark.read.json(...)
df1.show() # job1
df1.filter(df1.col1='a').write.parquet(...) # job2
df1.filter(df1.col2='b').write.parquet(...) # job3
An application is a set of processes executed in a
single GlueContext (or SparkContext).

Shuffle and Stage
Stage1 Stage2
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
df2 = df1.filter(“price”>500).groupBy(“item”).sum().withColumn(“bargain”, price*0.8)
• Stages are divided by shuffling
• Multiple tasks are processed in concurrently in one stage
task

17
Processing with and without shuffling (exchange of
data between tasks)
Example of no shuffling
df2 = df1.groupBy(‘item’).sum()
df2 = df1.filter(price>500)
item price
beef 1300
pork 200
chicken 700
fish 400
item price
beef 1300
chicken 700
df1 df2
item num
beef 2
pork 3
beef 4
pork 5
item num
beef 6
pork 9
df1 df2
Example of shuffling

Processing with and without shuffling (exchange of data
between tasks)
Processing without shuffling
• read
• filter()
• withColumn()
• UDF
• coalesce()
Processing with shuffling
• join()
• groupBy()
• orderBy()
• repartition()

Optimization with Catalyst Query Optimizer
• Processes described in DataFrame are converted into optimized RDDs
by the optimizer and before the process is executed.
df1 = spark.read.csv('s3://path/to/data1')
df2 = spark.read.parquet('s3://path/to/data2')
df3 = df1.join(df2, df1.col1 == df2.col1)
df4 = df3.filter(df3.col2 == 'meat').select(df3.col3, df3.col5)
df4.show()
data1 data2
join
filter
Logical Plan Physical Plan
scan
(data1)
scan
(data2)
filter
join
scan
(data1)
optimized
scan
(data2)
join
Physical Plan
(with storage side optimization)
10GB 50GB
60GB
20GB
10GB 50GB
20GB
20GB
20GB
10GB
20GB
data
volume
predicate pushdown
and column pruning

Architecture of PySpark
Driver
Python
Spark
Context
Worker
Executor
Task
Task
Python Worker
Python Worker
Executor
Worker
Worker
• Processes written in PySpark Dataframe are converted to Java code.
• UDF code written by Python is executed as Python at each Python worker per
tasks
Py4J

Introduction to AWS Glue
specific features

Data Catalog
• Data Catalog has metadata(tablenames, column names, S3 paths and so
on) necessary to access data sources such as S3 and databases from
Glue, Athena, Redshift Spectrum, etc.
• There are three ways to create metadata in the data catalog: crawler,
Glue API, and DDL (Athena/EMR/Redshift Spectrum).
• Amazon DynamoDB, Amazon S3, Amazon Redshift, Amazon RDS, JDBC
connectable DB, Kinesis, HDFS, etc. can be specified as data sources.
• No need to manage metastore database
DynamoDB
S3
Redshift
RDS
JDBC接続可能なDB
Data Source
Glue ETL Athena
Redshift
Spectrum
EMR
Connected services
Hive alternative
appliations
Save metadata
Data catalog
Crawler
Image of data catalog usage
①Metadata Access
②Data access

DynamicFrame
AWS Glue functions to absorb the inherent complexity of ETL for raw
data
• As a component, it is located in the same hierarchy as DataFrame.
They can be used by converting each other (fromDF and toDF functions).
• Leave the possibility of multiple types to be determined later (Choice
type)
• DynamicFrame refers to the entire data, DynamicRecord refers to a
refers to a single row of data.
Spark Core: RDDs
Spark DataFrame/
Catalyst Optimizer
AWS Glue
DynamicFrame
Data structure image
DynamicFrame in Apache Spark architecture
DataFrame DynamicFrame
Similar to semi-structured tables
record

struct type
Choice Type
DynamicFrame is able to have both types when multiple types are found in a
column
• ResolveChoice method can be used to resolve types
root
|-- uuid: string
|
|-- device id: choice
|-- long
|-- string
Example of data structure of type choice
The device id column has both long and string data.
(e.g. "1234" in the device id column is confused 1234with
"1234" in the string column)
project
(Discard the mold)
cast
(Cast to a single type)
make_cols
(Keep all types in a separate column)
Example of ResolveChoice execution
deviceid: choice type
long
type
string
type
long
type
long
type
long
type
string
type
long
type
deviceid deviceid deviceid deviceid_long deviceid_string
long
type
deviceid
make_struct
(Map conversion to struct type)
string
type
ColA ColB ColC
1
2
...
1,000,000
"1000001."
"1000002."
With DataFrame, if more than one type is
present, processing may be interrupted
and have to be reprocessed.

DataFrame
ETL processing that takes advantage of the
characteristics of DynamicFrame and DataFrame
• DynamicFrame is good for ETL processing, DataFrame is good for table processing
• Data input/output and associated ETL processing is performed by DynamicFrame,
while table manipulation is performed by DataFrame.
DynamicFrame
toDF function Table Operations
Output result file
ETL job
data loading
input data
By using DynamicFrame when loading, it is
possible to optimize data loading using AWS
Glue catalog, load differential data, and
process semi-structured data using Choice
type.
DynamicFrame
Table operations
such as JOIN are
performed in
DataFrame.
Use the toDF and fromDF functions for
mutual conversion between
DynamicFrame and DataFrame.
(No data copying is done. conversion
cost is within a few milliseconds)
Output the result
fromDF function

Bookmark function
Function to process only the delta data when performing steady-state ETL
• Use file timestamps to process only the data that has not been
processed in the previous job to prevent duplicate processing and
duplicate data.
Processed data
(Not loaded)
Unprocessed data
(load target)
df = spark.read.parquet('s3://path/to/data')
s3://path/to/data

How to proceed with
performance tuning of AWS
Glue ETL

The Performance Tuning Cycle
1. Determine performance goals.
2. Measure the metrics
3. Identify bottlenecks.
4. Reduce the impact of bottlenecks
5. Repeat steps 2. to 4. until the goal is achieved.
6. Achieving performance goals

Understand the characteristics of AWS Glue/Apache Spark
• distributed processing
• There are tuning patterns such as "shuffling" and "data bias" that
are not found in single-process applications.
• Delayed evaluation
• Spark processing is lazy evaluation, so the error may not be caused
by the API that was executed just before the error occurred, but by
the API that was written before the error occurred.
• Impact of optimizer
• Since Spark processing is optimized internally, it can be difficult to
determine which part of the script is the actual processing you can
see in the Spark UI. You need to check multiple metrics to find the
cause.

Spark parameters in AWS Glue
• Essentially, Spark has a means of tuning by parameters at the time of job
execution, but AWS Glue is a serverless service, so before looking at
Spark's parameters, we first tune it according to AWS Glue's best
practices.
• User should test thoroughly when changing the value of the Spark
parameter.

Metrics to check
• Spark UI（Spark History Server）
• It shows the details of Spark processing.
• Executor Log and Driver Log
• It shows stdout/stderr logs of executors and a driver
• AWS Glue Job metrics
• It shows the CPU, memory, and other resource usage status of
each executor and driver node
• Statistics obtained from the Spark API
• It shows samples and statistical values of intermediate data

Setting up a job to do the tuning
In order to get the logs
needed for tuning, you need
to check the box to use
Monitoring Options in the
Add job screen.

Spark UI

Spark History Server
Spark Event Logs can be visualized by running the Spark History Server
There are several ways to launch the Spark History Server.
• Using Cloud Formation
• Using docker to launch it on a local PC
• Download Apache Spark on your local PC or EC2
and start the Spark History Server.
• Using an EMR cluster
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html

Example of launching with Docker
o Once the Docker container has started, access http://localhost:18080
in your browser
$ docker build -t glue/sparkui:latest .
$ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
-Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY" -p
18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer”
Specify the S3 path of Spark
history logs

Spark History Server
Click and chek the details of each application
Check duration of each application

List of jobs executed by the application
Completed Jobs
Failed Jobs
Check jobs which
takes long time

Checking the contents of a job
Identify Stages that are failing
or taking a long time.
Identify Stages that are failing
or taking a long time.

Check the contents of the Stage.
If there is a difference in the line length, it
means that skew occurs and it isnʼt
distributed sufficiently.
Check the data size in the Stage.

Checking the contents of the Stage (continued)
the details of what's taking so long.
Task Time for each Executor
If there is a spill on the disk, select a
worker node with larger memory to
solve it.
In addition to the Event Timeline above, data
skew can be seen in Summary Metrics and
Tasks.

View details of tasks that are Failing
Click on details to
learn more.
View the log of the Executor
where the Fail is occurring.
For AWS Glue ETL, check the
Executor ID and check from
CloudWatch Log groups

Environmental settings for Spark application runtime
o Spark options, dependencies, and so on.

List of Driver and Executor nodes
If the cluster is running and
accessible from the History
Server, the logs of each
Driver/Executor can be seen.

Checking the Spark SQL Query Execution Plan
You can see the actual
execution plan, which is more
accurate than the explain API.

Executor and Driver logs

Check Log groups in CloudWatch
Executor Logs
Driver Log

AWS Glue Job Metrics

Check the resource usage of Executor and Driver
You can also create a Dashboard
for CloudWatch and add other
metrics.

Spark API

Check the trend of data during processing with
commands
o Use the following commands to check the trend of the intermediate
data during processing and use it for tuning strategy.
o Note that processing with actions (red text) will make the job slow
down the if you use too many times.
• count()
• printSchema()
• show()
• describe([cols*]).show()
• explain()
• df.agg(approx_count_distinct(df.col))

Check the trend of data during processing with APIs
df.count()
o Check the number of records.
o Use df.groupBy('col_name').count() to check for skew.

df.printSchema()
o Check the schema information of the DataFrame.

df.show()
o The number of records to be displayed can be specified by using
df.show(5).

df.describe([cols*]).show()
o The statistics for each column can be seen.

df.explain()
o Check the execution plan which optimizer created.

df.agg(approx_count_distinct(df.col))
o Check the cardinality in the columns
o Fast because HyperLogLog is used

AWS Glue ETL
Performance Tuning
Pattern

Basic strategy for tuning AWS Glue ETL jobs
• Use the new version
• Reduce the data I/O load.
• Minimize shuffling.
• Speed up per-task processing.
• Parallelize

Use the new version

Use the new version
o When the Spark application is slow, simply replacing the job execution
environment with the latest version may speed up the process.
o Both AWS Glue and Apache Spark are evolving in every way, not just in
performance. Use the newest version possible.
https://aws.amazon.com/jp/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/

Using AWS Glue 2.0 and 3.0 to reduce
startup time
Dramatically reduced the time required to launch AWS Glue ETL jobs
• Cold start used to take about 85 minutes in AWS Glue 0.9/1.0.
• In the new AWS Glue 2.0, less than a 1minute.
time
Start-up time Execution time
Execution time
AWS Glue 1.0
AWS Glue 2.0

2. Submit job to
virtual cluster
AWS Glue 2.0+ integrated scheduling and provisioning
1. Run AWS Glue job
3. Spark schedules tasks
to executors
Job
manager
4. Dynamically grow
virtual clusters
5. Spark utilizes
new executors
AZ1
AZ2
Job starts when first executor is ready
Reduced start time
Reduced start variance
Jobs run on reduced capacity
Graceful degradation

Minimize data I/O

How to minimize the data I/O load
• Read only the data you need.
• Control the amount of data read in one task.
• Choose the right compression format.

Using Apache Parquet
• Column-oriented format for
data arrangement suitable for
analytical applications
• Data type is preserved
• Compressed effectively
• Aggregation by skipping
unnecessary data and using
metadata
• The Spark engine can
efficiently use Apache Parquet
Integration is in place
https://parquet.apache.org/documentation/latest/

Partition Filtering and Filter Pushdown
Reduce the amount of data to be read
• Partition Filtering
• The ability to read only the files in the partition specified by the
filter or Where clause.
or Where clause.
• Available in Text/CSV/JSON/ORC/Parquet
• Filter Pushdown
• Ability to read only blocks that hit the filter or where clause for
columns that are not used in the partition column.
• AWS Glue automatically applies this when Parquet is used.

Partition Filter and Filter Pushdown
Filter Pushdown
Partition Filtering

Partition Filter and Filter Pushdown
Partition Filter can be used when a partition directory has been created; for
Dataframe and DynamicFrame writes, the partition directory can be created
by using the partitionBy option as follows
df.write.parquet(path=path, partitionBy='col_name')
It may be more performance
efficient to partition columns that
are used more frequently in the
filter clause into higher partitions.

Using push_down_predicate in DynamicFrame
Read only the files included in the partition where the data specified in the
filter or where clause is stored when reading a DynamicFrame from the
AWS Glue data catalog.
partitionPredicate ="(product_category == 'Video')"
datasource = glue_context.create_dynamic_frame.from_catalog(
database = "githubarchive_month",
table_name = "data",
push_down_predicate = partitionPredicate)

Choose a compression codec based on your application
• Compression codec can be selected at data writing.
• Trade-off between compression rate and compression/decompression speed
• Files compressed with bzip2, lzo, and snappy can be split and processed
when read, but files compressed with gzip(*) cannot be split.
• Uncompressed files does not require compression/decompression time, but
data transfer cost may be a bottleneck
• If processing speed is important to you, choose snappy or lzo.
ex. df.write.csv("path/to/csv", compression="gzip")
Parquet can also be split by gzip.
gzip bzip2 lzo snappy
file extension .gzip .bz2 .lzo .snappy
Compression Level High Highest Average Average
Speed Medium Slow Fast Fast
CPU usage Medium High Low Low
Is Splittable No(*) No Yes, if indexed No

Store data in appropriate file sizes.
• Data read/write tasks are basically tied to a single file.
(If the file is splitable, one file can be split into multiple tasks.)
• The recommended file size for AWS Glue is 128MB-512MB.
When the data is too small
• Overhead due to large number of small tasks
When there is large unspilitable data in one file
• Data is not fully loaded into memory on a single node.
• No distributed processing
task task task
...
...
task
Executor Executor Executor
...
Not used

Using Bounded Execution with DynamicFrame
When there is a lot of data to be read, Bounded Execution can be used at
the same time as Job Bookmarking to divide the process without reading
all the unprocessed data at once.
glueContext.create_dynamic_frame.from_catalog(
database = "database",
tableName = "table_name",
redshift_tmp_dir = "",
transformation_ctx = "datasource0",
additional_options = {
"boundedFiles" : "500", # need to be string
# "boundedSize" : "1000000000" unit is byte
}
)

Using DynamicFrame's groupFiles and groupSize
• Eliminate overhead by reading small files together in a 1single
task.
• Useful for processing data that is output every few minutes by
Kinesis Data Firehose.
• Use the groupFiles option to group the data in the S3 partition,
and the groupSize option to specify the size of the group to be
read.
df = glueContext.create_dynamic_frame_from_options(
's3', {'paths': ['s3://s3path/'],
' recurse'':True, 'groupFiles'': 'inPartition'', 'groupSize'': ''1048576}, format='json')
note: groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog,
json, and xml. This option is not supported for avro, parquet, and orc.

Number of files and processing time for DataFrame
and DynamicFrame

Using DynamicFrame S3ListImplementation
• If there are a lot of small files, a large number of tasks can cause
Driver OOM.
• When S3ListImplementation is True, the results of S3 list are read and
processed in one batch at a 1000time, which prevents driver memory
from being overloaded by S3 listing.
datasource = glue_context.create_dynamic_frame.from_catalog(
database = "my_database",
table_name = "my_table",
push_down_predicate = partitionPredicate,
additional_options = {"useS3ListImplementation":True} )

Set the Partition Index
When reading a DataFrame from AWS Glue catalog from a data source
with many partitions consisting of multiple partition keys, setting the
Partition Index will reduce the time to fetch the read partition if there are
filter or where clauses for the target partition. This will reduce the time to
fetch the partitions.
https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/

The difference of query planning time between using
Partition Index and not using Partition Index
https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/

Parallel Data Reading in DataFrame JDBC Connections
• spark.read.jdbc() only allows one Executor to access the target
database by default.
• For parallel reading, partitionColumn, lowBound, upperBound, and
numPartitions must be specified. The partitionColumn must be one of
the following types: numeric, date, or timestamp.
df = spark.read.jdbc(
url=jdbcUrl, table="sample",
partitionColumn="col1",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
fetchsize=1000,
connectionProperties=connectionProperties)

Parallel data reading in DynamicFrame JDBC
connection
o If you want to read data from a JDBC connection as a
DynamicFrame, you need to specify hashfield/hashexpression.
o In hashfield, strings and other columns can also be used as
partition columns.
glueContext.create_dynamic_frame.from_catalog(
database = "my_database",
tableName = "my_table_name",
transformation_ctx = "my_transformation_context",
additional_options = { ''hashfield': 'customer_name', 'hashpartitions': '5' } )
https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

Minimize shuffling

Minimize shuffling
• Make good use of cache
• Perform filter processing in the first stage as much
as possible.
• Devise the order of joins to keep the data small.
• Optimize join strategy
• Remove data skew
• Use Window processing instead of data processing
by self join

Minimize shuffling
The processing described in the DataFrame is optimized by Catalyst
Optimizer.
However, it is not perfect in the following aspects
• If there is a cache() in between, optimization including before and after
will not work.
• Spark 2.4.3, used in AWS Glue 1.0 and 2.0, disables the cost-based
optimizer by default
Shuffling can be reduced by manually changing the order and strategy of
filters and joins.

Make good use of cache
• When branching the processing of a single Dataframe to perform
multiple outputs, you can prevent recalculation by inserting cache()
just before the branch.
• Note that it may be faster not to use cache, and that too much use of
cache will use local disk space.
df1 df2
df3
df5
df4
df1 df2
df3
df5
df4
cache()
The process up to the
creation of df2 is executed
only once.
The process to create df2
will be executed twice.

Make good use of cache
• By default, cached data will be stored in the memory initially allocated
for caching, and what is not in memory will be stored on the local disk.
• Users can choose to save to memory only or disk only.
Example of cache only on memory: df.cache(MEMORY_ONLY)

Delete cache that is no longer in use
• A cached Dataframe will continue to occupy memory and local
disk space.
• Save memory and disk space by deleting the Dataframe cache
when it is no longer needed.
df.unpersist()

Perform filter processing in the first stage as much as possible
The filter process can be placed before cache() to reduce the amount of data during
processing.

Work out the order of join
• The end result is the same, but the data size of the DataFrame in the
middle is different.
• In Glue 3.0 (Spark 3.1.1), the cost-based optimizer takes into account
the amount of data and optimizes the order of joins.
df1
(4000 rows)
df2
(1000 rows)
df4
( 4000rows)
df3
(10 rows)
df5
(10 rows)
df1
(4000 rows)
df3
(10 rows)
df4
( 10rows)
df2
(1000 rows)
df5
(10 rows)
left join join
left join
join

Using join in different ways
Sort Merge Join
• Distribute the two tables to be joined by their respective keys, sort
them, and then join them.
• Suitable for joining large tables together.
Broadcast Join
• Transfer one table to all Executors, and distribute the other table to all
Executors and join them.
• Suitable for when one table is smaller than the other.
Shuffle Hash Join
• Distribute the two tables to be joined and join them without sorting.
• Suitable for joins between tables that are not so large.

Using join in different ways
• By default, if the table size is less than or equal to the value specified
in spark.sql.autoBroadcastJoinThreshold (default 10MB), Broadcast
Join will be used.
• The Join strategy in use can be seen in the Spark UI or by using
explain().
• If join performance is the bottleneck, changing the join strategy
manually may improve performance.
df1.join( broadcast(df2), df1("col1") <=> df2("col1") ).expand()
== Physical Plan == BroadcastHashJoin [coalesce(col1#6, )], [coalesce(col1#21, )], Inner,
BuildRight, (col1#6 <=> col1#21)
:- LocalTableScan [first_name#5, col1#6].
+- BroadcastExchangeHashedRelationBroadcastMode(List(coalesce(input[0, string, true], ))) +-
LocalTableScan [col1#21, col2#22, population#23]
https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html#broadcast-hint-for-sql-queries

coalesce
For the following reasons, partitions may be split into smaller pieces during processing.
• Load a large number of small files.
• Perform groupBy on columns with high cardinality
In such a case, it is better to merge the partitions before the next process to reduce the
overhead of the subsequent process.
Since repartition involves shuffling, it may be desirable to use coalesce.
However, since it is a simple merge, the data after coalesce may be biased.
Glue 3.0 has a new feature called Adaptive Query Execution that automatically optimizes
the number of partitions by coalesce.
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
df.repartition(2) df.coalesce(2)

Use Window processing instead of self join and data
aggregation
o When the process of creating aggregate data from a single log data
and joining it is being performed, the load of joining can be eliminated
by performing Window processing.
df_agg = df.groupBy('gender', 'age').agg(
F.mean('height').alias('avg_height')), F.mean('weight').alias('avg_weight'))
df = df.join(df_agg, on=['gender', 'age'])
w = Window.partitionBy('gender', 'age')
df = df.withColumn(
'avg_height', F.mean(col('height')).over(w)
).withColumn('avg_weight', F.mean(col('weight')).over(w))

Speed up per-task processing

Using Scala
Most of the DataFrame operations can be written in PySpark and
internally converted to Java and run on a JVM, but the following
operations are slower when using Python.
If the above is the bottleneck, Scala will speed up the process.
• The part that describes the process in RDD
• Processing written in RDDs is not optimized by the optimizer.
• The part that uses UDF
• later mention

Task
Avoid UDF in PySpark
Performance issues
• Serialization and piping to Python process occurs for each Iterator
• The memory of the Python process is not controlled by the JVM.
Python Worker
JVM
Physical
Operator
Python
Runner
batch of
Rows
Invoke UDF
Deserialize
Serialize
Pipe

Using PandasUDF over PythonUDF
Python UDF
• Serialization/deserialization is done by Pickling
• Data is fetched block by block, but UDF processing is performed
row by row
Pandas UDF
• Serialization/deserialization is done by Apache Arrow
• Both data fetch and UDF processing are performed on multiple
lines at once.

Performance differences between Python UDF,
Pandas UDF, and Spark API in AWS Glue ETL
Execution
Time(s)

Parallelize

Dealing with Skewness in Data
If there is data variation between partitions, the load will be unevenly
distributed to only some tasks that process large partitions, causing
processing delays.
Data is biased to only some partitions, causing a
bottleneck in processing time.
When does it happen?
• If the file size to be read is uneven
• When joining with data that has a difference in the number of records
for each join key
• Variation in the number of records per key when
df.groupBy() is performed

Addressing data bias
• Ensure that the file size is uniform when creating input files.
• To repartition
• broadcast join
• salting

Dealing with Skewness
Repartition
If the subsequent process is not a key-by-key process (partitioning and
storing data by date, window processing by key, etc.), repartition will
resolve the skew.
df.repartition(200)
Partition
Partition
Partition
Partition
Partition
Partition
...
200 partitions
3 partitions

101
broadcast join
If one DataFrame is small enough to fit all the data in one Executor, and the
other DataFrame has huge data with skewed join key columns, you can improve
processing performance by using Broadcast Join as the Join strategy.
item n
beef 1
beef 3
...
beef 2
item n
pork 1
item price
beef 100
item price
pork 500
1000
lines
join
join
Partition
Partition
Sort Merge Join
item n
beef 1
beef 3
...
item n
pork 1
beef 2
...
item price
beef 100
pork 500
item price
beef 100
pork 500
500 lines
join
join
Partition
Partition
Broadcast Join
500 lines

Salting
In the case of a join between two sufficiently large data sets that have a
skew on one side, "Salt" can be used to eliminate the load bias.
Table A Table B
Partition with Skew
partition
Clone the partition
corresponding to the
partition with Skew.

Salting
o In Glue 2.0 (Spark 2.4.3), user need to write the Salting code manually.
o In Glue 3.0 (Spark 3.1.1), a new feature called Adaptive Query Execution
will dynamically perform Skew Join.
103
# Salting the skewed column
df_big = df_big.withColumn('shop_salt', F.concat(df['shop'], F.lit('_'), F.lit(F.floor(F.rand() * numPartition) +
1))))
# Explode the column
df_medium = df_medium.withColumn('shop_exploded', F.explode(F.array([F.lit(i) for i in
range(1,numPartition+1)]))))
df_medium = df_medium.withColumn(
'shop_exploded', F.concat(df_medium['shop'], F.lit('_')),
df_medium['shop_exploded']))
# Joining
df_join = df_big.join(df_medium df_big.'shop_salted' == df_medium.shop_exploded)
https://spark.apache.org/docs/latest/sql-performance-tuning.html#optimizing-skew-join

Assigning Incremental IDs with Performance in
Mind
• If the Window function row_number() is used to assign successive
incremental IDs to all records, the process will be slow because of the
aggregation for all records.
• monotonically_increasing_id() can give an incremental ID without
aggregation by allowing it to be non-contiguous across partitions.
1 2 3 4 5 6 7 8 9 10 11
1 2 3 6 7 8 9 10 13 14 15
df. withColumn(F.rowNumber().over(Window.partitionBy("col1"). orderBy("col2"))
df. withColumn(monotonically_increasing_id)

Selecting a Worker Type for AWS Glue
• The processing power allocated at the time of job execution is called
DPU (Data Processing Unit).
• 1DPU = 4vCPU, 16GB memory
• Each Worker Type has different resource capacity and configuration.
Worker
Type
Number
of
DPUs/1
Worker
Number
of
Executor
s/1Work
er
Number
of
memory/
1Worker
Disk/1W
orker
Standard 1 2 16GB 50GB
G.1X 1 1 16GB 64GB
G.2X 2 1 32GB 128GB
Worker Type List Worker Type configuration image
Standard
Executor
Worker
Executor
DPU
G.1X
Worker
DPU
G.2X
Worker
DPU
DPU
Executor
Executor

Ideal resource usage
• It is desirable that resources are used evenly and without
waste by all Executors.
• If not, there's likely room for tuning.
• Select the first Worker Type based on some prediction of
resource tendency based on processing contents.
• Example.
• CPU usage is likely to be high when there are
complex UDF and other processing operations.
• Memory usage is likely to be high when the shuffle
size becomes large, such as when joining large
amounts of data.

Trade-off between number of workers and job
execution time
Job execution time can be reduced without increasing cost by increasing
the number of workers as long as the number of parallelism is sufficient
to ensure effective resource utilization.
5 10
5
10
Number
of
workers
AWS Glue ETL job execution
time
5 10
5
10
Number
of
workers
AWS Glue ETL job execution
time

summary
• Introduced tuning patterns for AWS Glue Spark ETL jobs.
• AWS Glue can process large amounts of data with high
performance as-is, but it can be tuned to achieve even higher
performance and scalability.
• Tuning requires checking metrics to identify bottlenecks and
eliminate their causes.
108

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue (20)

More from Amazon Web Services Japan

More from Amazon Web Services Japan (20)

Recently uploaded

Recently uploaded (20)

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue