SlideShare a Scribd company logo
1 of 42
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
METRICS-DRIVEN PERFORMANCE TUNING FOR
AWS GLUE ETL JOBS
Benjamin Sowell
Principal Engineer
AWS Glue
A N T 3 3 2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Spark and AWS Glue ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics.
• AWS Glue builds on the Apache Spark runtime to offer ETL specific
functionality.
Spark Core: RDDs
Spark DataFrames AWS Glue DynamicFrames
SparkSQL AWS Glue ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping
Actions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping Jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided into partitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput is limited
by the number of partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue performance: key questions
How is your application divided
into jobs and stages?
How is your dataset partitioned?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• One common problem is dealing with large numbers of small files.
• Can lead to memory pressure and performance overhead.
• Let's look at a straightforward JSON to Parquet conversion job
• 1.28 million JSON files in 640 partitions:
• We will use AWS Glue job metrics to understand the performance.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Enabling job metrics
• Metrics can be enabled in the CLI/SDK by passing --enable-metrics
as a job parameter key.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• First try: use a standard SparkSQL job
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• Driver memory use is growing fast and approaching the 5g max.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• Case 2: Run using AWS Glue DynamicFrames.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
Driver memory remains below 50%
for the entire duration of execution.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• 160 max executors: Each partition is assigned 1 task. 4 tasks can be
processed on each executor, so there were 640 partitions.
• Glue automatically grouped all of the files in each partition.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Options for grouping files
• groupFiles
• Set to "inPartition" to enable grouping of files within a partition.
• Set to "acrossPartition" to enable grouping of files from different partitions. The partition
value will not be added to each record!
• Grouping is automatically enabled when there are more than 50,000 files.
• groupSize
• Target size of each group in bytes.
• Default is based on the number of cores in the cluster.
• Let's try increasing the group size.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Aggressively grouping files
• Execution is much slower, but hasn't crashed.
"groupFiles": "acrossPartition"
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Aggressively grouping files
Executor memory is higher than driver. Only one executor is active.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large files
• Processing large text files can also cause problems if they cannot be split.
• How the data is compressed makes a big difference in performance.
• Files that are uncompressed or bzip2 compressed can be split and processed in parallel.
• Files that are gzip compressed cannot be split.
• Let's see how this looks on a sample dataset of 5 large csv files.
• Each file is
• 12.5 GB uncompressed
• 1.6 GB gzip
• 1.3 GB bzip2
• Script converts data to Parquet.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large gzip files
• We only have 5 partitions – one for each file.
• Job fails after 2 hours.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large bzip2 files
• Bzip2 files can be split into blocks, so we see up to 104 tasks.
• Job completes in 18 minutes.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large bzip2 files
• With 15 DPU, the number of active executors closely tracks the
maximum needed number of executors.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large uncompressed files
• Uncompressed files can be split into lines, so we construct 64MB partitions.
• Job completes in 12 minutes.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large files
• If you have a choice of compression type, prefer bzip2.
• If you are using gzip, make sure you have enough files to fully utilize
your resources.
• Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider
leaving files uncompressed.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing a dataset
• Common use case: reduce the number of files (32 GB compressed)
df = spark.read.format("json").load(...)
df2 = df.coalesce(1)
df2.write.format("json").save(...)
Read is parallelized and then
all data is sent to one executor
for writing.
1 hr, 52 min.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing with grouping
• You can control this further with grouping
• Group size 500 GB.
Only a single executor is
used for the entire job
1 hr, 27 min.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing with grouping
• You can control this further with grouping
• Group size 2 GB.
26 min.Four executors were used
to process 16 partitions
of ~2 GB each.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Baseline performance without coalescing
• Job is fully parallelized and completes in 6 minutes.
• Output data files are around 500 MB.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrames and schema computation
• DynamicFrames are Spark
RDDs of self-describing
records.
• An overall schema is not
required for basic ETL
operations.
• I can drop the field "age" without
looking at other records.
• Some operations do require
a complete schema.
• This can force an extra job in your
application.
age
John
Doe
36
Any
town
WA
DropNullFields
Relationalize
ResolveChoice
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
ApplyMapping sets the
schema without an additional
pass.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
• There is only one stage in the application.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimizing DynamicFrames
• Use ApplyMapping where appropriate to set a schema.
• Be thoughtful about where to add transformations like DropNullFields.
• The default generated scripts are designed to be safe, but you may be able to optimize if
you know your data.
• Try your old scripts!
• We've increased data conversion speed by 4x in the past year.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Python performance in AWS Glue
• Using map and filter in
Python is expensive for large
data sets.
• All data is serialized and sent
between the JVM and Python.
• Alternatives
• Use AWS Glue Scala SDK.
• Convert to DataFrame and use Spark
SQL expressions.
Spark JVM
Python VM
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation
Build a secure data lake in days
Register existing data or load new data using blueprints. Data stored in Amazon S3.
Secure data access across multiple services using single set of permissions.
No additional charge. Only pay for the underlying services used.
Quickly build data lakes
Move, store, catalog, and clean
your data faster. Use ML
transforms to de-duplicate data
and find matching records.
Easily secure access
Centrally define table and
column-level data access and
enforce it across Amazon
EMR, Amazon Athena,
Amazon Redshift Spectrum,
Amazon SageMaker, and
Amazon QuickSight
Share and collaborate
Use data catalog in Lake
Formation to search and find
relevant data sets and share
them across multiple users
and accounts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benjamin Sowell
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design PatternAWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design PatternAmazon Web Services Japan
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門20180613 AWS Black Belt Online Seminar AWS Cloud9 入門
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門Amazon Web Services Japan
 
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送Google Cloud Platform - Japan
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Amazon Web Services
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本Amazon Web Services Japan
 
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)Amazon Web Services Japan
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMRAmazon Web Services Japan
 
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送Google Cloud Platform - Japan
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
AWS Black Belt Online Seminar 2017 Amazon S3
AWS Black Belt Online Seminar 2017 Amazon S3AWS Black Belt Online Seminar 2017 Amazon S3
AWS Black Belt Online Seminar 2017 Amazon S3Amazon Web Services Japan
 
データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介Amazon Web Services Japan
 
AWS Black Belt Techシリーズ AWS Direct Connect
AWS Black Belt Techシリーズ AWS Direct ConnectAWS Black Belt Techシリーズ AWS Direct Connect
AWS Black Belt Techシリーズ AWS Direct ConnectAmazon Web Services Japan
 
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기Amazon Web Services Korea
 
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...Amazon Web Services Japan
 
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNSAWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNSAmazon Web Services Japan
 
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-Amazon Web Services Japan
 
20190731 Black Belt Online Seminar Amazon ECS Deep Dive
20190731 Black Belt Online Seminar Amazon ECS Deep Dive20190731 Black Belt Online Seminar Amazon ECS Deep Dive
20190731 Black Belt Online Seminar Amazon ECS Deep DiveAmazon Web Services Japan
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Web Services Japan
 

What's hot (20)

AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design PatternAWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
 
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門20180613 AWS Black Belt Online Seminar AWS Cloud9 入門
20180613 AWS Black Belt Online Seminar AWS Cloud9 入門
 
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本
 
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)
20191029 AWS Black Belt Online Seminar Elastic Load Balancing (ELB)
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR
 
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS Black Belt Online Seminar 2017 Amazon S3
AWS Black Belt Online Seminar 2017 Amazon S3AWS Black Belt Online Seminar 2017 Amazon S3
AWS Black Belt Online Seminar 2017 Amazon S3
 
データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介
 
AWS Black Belt Techシリーズ AWS Direct Connect
AWS Black Belt Techシリーズ AWS Direct ConnectAWS Black Belt Techシリーズ AWS Direct Connect
AWS Black Belt Techシリーズ AWS Direct Connect
 
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
 
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...
20190402 AWS Black Belt Online Seminar Let's Dive Deep into AWS Lambda Part1 ...
 
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNSAWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS
 
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
 
20190731 Black Belt Online Seminar Amazon ECS Deep Dive
20190731 Black Belt Online Seminar Amazon ECS Deep Dive20190731 Black Belt Online Seminar Amazon ECS Deep Dive
20190731 Black Belt Online Seminar Amazon ECS Deep Dive
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
 

Similar to Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Invent 2018

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Amazon Web Services
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Amazon Web Services
 
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Amazon Web Services
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Amazon Web Services
 
Open Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFOpen Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFAmazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017Amazon Web Services
 
Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practicesKanika Gera
 
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitFoundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitAmazon Web Services
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Amazon Web Services
 
Oracle & SQL Server on the Cloud: Database Week SF
Oracle & SQL Server on the Cloud: Database Week SFOracle & SQL Server on the Cloud: Database Week SF
Oracle & SQL Server on the Cloud: Database Week SFAmazon Web Services
 

Similar to Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Invent 2018 (20)

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
 
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Open Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFOpen Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SF
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
MySQL and MariaDB
MySQL and MariaDBMySQL and MariaDB
MySQL and MariaDB
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practices
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitFoundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
 
Oracle & SQL Server on the Cloud: Database Week SF
Oracle & SQL Server on the Cloud: Database Week SFOracle & SQL Server on the Cloud: Database Week SF
Oracle & SQL Server on the Cloud: Database Week SF
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. METRICS-DRIVEN PERFORMANCE TUNING FOR AWS GLUE ETL JOBS Benjamin Sowell Principal Engineer AWS Glue A N T 3 3 2
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Spark and AWS Glue ETL • Apache Spark is a distributed data processing engine with rich support for complex analytics. • AWS Glue builds on the Apache Spark runtime to offer ETL specific functionality. Spark Core: RDDs Spark DataFrames AWS Glue DynamicFrames SparkSQL AWS Glue ETL
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping Actions
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping Jobs
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: data partitions • Apache Spark and AWS Glue are data parallel. • Data is divided into partitions that are processed concurrently. • 1 stage x 1 partition = 1 task Driver Executors Overall throughput is limited by the number of partitions
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue performance: key questions How is your application divided into jobs and stages? How is your dataset partitioned?
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • One common problem is dealing with large numbers of small files. • Can lead to memory pressure and performance overhead. • Let's look at a straightforward JSON to Parquet conversion job • 1.28 million JSON files in 640 partitions: • We will use AWS Glue job metrics to understand the performance.
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Enabling job metrics • Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job parameter key.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • First try: use a standard SparkSQL job
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • Driver memory use is growing fast and approaching the 5g max.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • Case 2: Run using AWS Glue DynamicFrames.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files Driver memory remains below 50% for the entire duration of execution.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • 160 max executors: Each partition is assigned 1 task. 4 tasks can be processed on each executor, so there were 640 partitions. • Glue automatically grouped all of the files in each partition.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Options for grouping files • groupFiles • Set to "inPartition" to enable grouping of files within a partition. • Set to "acrossPartition" to enable grouping of files from different partitions. The partition value will not be added to each record! • Grouping is automatically enabled when there are more than 50,000 files. • groupSize • Target size of each group in bytes. • Default is based on the number of cores in the cluster. • Let's try increasing the group size.
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Aggressively grouping files • Execution is much slower, but hasn't crashed. "groupFiles": "acrossPartition"
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Aggressively grouping files Executor memory is higher than driver. Only one executor is active.
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large files • Processing large text files can also cause problems if they cannot be split. • How the data is compressed makes a big difference in performance. • Files that are uncompressed or bzip2 compressed can be split and processed in parallel. • Files that are gzip compressed cannot be split. • Let's see how this looks on a sample dataset of 5 large csv files. • Each file is • 12.5 GB uncompressed • 1.6 GB gzip • 1.3 GB bzip2 • Script converts data to Parquet.
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large gzip files • We only have 5 partitions – one for each file. • Job fails after 2 hours.
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large bzip2 files • Bzip2 files can be split into blocks, so we see up to 104 tasks. • Job completes in 18 minutes.
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large bzip2 files • With 15 DPU, the number of active executors closely tracks the maximum needed number of executors.
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large uncompressed files • Uncompressed files can be split into lines, so we construct 64MB partitions. • Job completes in 12 minutes.
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large files • If you have a choice of compression type, prefer bzip2. • If you are using gzip, make sure you have enough files to fully utilize your resources. • Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider leaving files uncompressed.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing a dataset • Common use case: reduce the number of files (32 GB compressed) df = spark.read.format("json").load(...) df2 = df.coalesce(1) df2.write.format("json").save(...) Read is parallelized and then all data is sent to one executor for writing. 1 hr, 52 min.
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing with grouping • You can control this further with grouping • Group size 500 GB. Only a single executor is used for the entire job 1 hr, 27 min.
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing with grouping • You can control this further with grouping • Group size 2 GB. 26 min.Four executors were used to process 16 partitions of ~2 GB each.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Baseline performance without coalescing • Job is fully parallelized and completes in 6 minutes. • Output data files are around 500 MB.
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrames and schema computation • DynamicFrames are Spark RDDs of self-describing records. • An overall schema is not required for basic ETL operations. • I can drop the field "age" without looking at other records. • Some operations do require a complete schema. • This can force an extra job in your application. age John Doe 36 Any town WA DropNullFields Relationalize ResolveChoice
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields ApplyMapping sets the schema without an additional pass.
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields • There is only one stage in the application.
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing DynamicFrames • Use ApplyMapping where appropriate to set a schema. • Be thoughtful about where to add transformations like DropNullFields. • The default generated scripts are designed to be safe, but you may be able to optimize if you know your data. • Try your old scripts! • We've increased data conversion speed by 4x in the past year.
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Python performance in AWS Glue • Using map and filter in Python is expensive for large data sets. • All data is serialized and sent between the JVM and Python. • Alternatives • Use AWS Glue Scala SDK. • Convert to DataFrame and use Spark SQL expressions. Spark JVM Python VM
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Lake Formation Build a secure data lake in days Register existing data or load new data using blueprints. Data stored in Amazon S3. Secure data access across multiple services using single set of permissions. No additional charge. Only pay for the underlying services used. Quickly build data lakes Move, store, catalog, and clean your data faster. Use ML transforms to de-duplicate data and find matching records. Easily secure access Centrally define table and column-level data access and enforce it across Amazon EMR, Amazon Athena, Amazon Redshift Spectrum, Amazon SageMaker, and Amazon QuickSight Share and collaborate Use data catalog in Lake Formation to search and find relevant data sets and share them across multiple users and accounts
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How it works
  • 41. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benjamin Sowell
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.