SlideShare a Scribd company logo

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new 過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/

1 of 108
© 2022, Amazon Web Services, Inc. or its Affiliates.
Chiei Hayashida
Solution Architect
2022/01
AWS Glue Spark
Performance Tuning
© 2022, Amazon Web Services, Inc. or its Affiliates.
Self Introduction
Chie Hayashida(Chie Hayashida)
Amazon Web Services Japan
solution architect
© 2022, Amazon Web Services, Inc. or its Affiliates.
The target of this slide
o People who have done AWS glue tutorial or have equivalent
knowledge.
o People who have written Apache Spark applications.
o People who would like to improve their existing AWS Glue jobs
o The Code at this slide are all by PySpark because a lot of AWS Glue
users choose PySpark
o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Agenda
• Architecture of AWS Glue and Apache Spark
• AWS Glue Functions related to performance
• How to proceed AWS Glue Spark performance tuning
• AWS Glue Spark performance tuning patterns
© 2022, Amazon Web Services, Inc. or its Affiliates.
Agenda
• Architecture of Apache Spark
• AWS Glue specific features (performance related)
• How to proceed with performance tuning of AWS Glue Spark
• Basic strategy for tuning AWS Glue Spark jobs
• Tuning Patterns for AWS Glue Spark Jobs
© 2022, Amazon Web Services, Inc. or its Affiliates.
AWS Glue and Apache Spark
Data source
crawler data catalog
Serverless
Engine
(1) Crawl data
2) Manage metadata
AWS Glue
3) Triggered manually / by schedule /
by event
5) Run transfomation job
and load data to target
data source
4) Extract data
from the input data
source
scheduler
Data source
Other AWS
Services
Managed Service
of Apache Spark

Recommended

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design PatternAWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design Pattern
AWS Black Belt Online Seminar 2018 Amazon DynamoDB Advanced Design PatternAmazon Web Services Japan
 
20190220 AWS Black Belt Online Seminar Amazon S3 / Glacier
20190220 AWS Black Belt Online Seminar Amazon S3 / Glacier20190220 AWS Black Belt Online Seminar Amazon S3 / Glacier
20190220 AWS Black Belt Online Seminar Amazon S3 / GlacierAmazon Web Services Japan
 
20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS GlueAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
Oracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけてOracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけてYoichi Sai
 
The Twelve-Factor Appで考えるAWSのサービス開発
The Twelve-Factor Appで考えるAWSのサービス開発The Twelve-Factor Appで考えるAWSのサービス開発
The Twelve-Factor Appで考えるAWSのサービス開発Amazon Web Services Japan
 

More Related Content

What's hot

AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザNoritaka Sekiyama
 
Aws auto scalingによるwebapサーバbatchサーバの構成例
Aws auto scalingによるwebapサーバbatchサーバの構成例Aws auto scalingによるwebapサーバbatchサーバの構成例
Aws auto scalingによるwebapサーバbatchサーバの構成例Takeshi Mikami
 
AWSで作る分析基盤
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤Yu Otsubo
 
20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration Service20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration ServiceAmazon Web Services Japan
 
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)Amazon Web Services Japan
 
20200714 AWS Black Belt Online Seminar Amazon Neptune
20200714 AWS Black Belt Online Seminar Amazon Neptune20200714 AWS Black Belt Online Seminar Amazon Neptune
20200714 AWS Black Belt Online Seminar Amazon NeptuneAmazon Web Services Japan
 
分散トレーシングAWS:X-Rayとの上手い付き合い方
分散トレーシングAWS:X-Rayとの上手い付き合い方分散トレーシングAWS:X-Rayとの上手い付き合い方
分散トレーシングAWS:X-Rayとの上手い付き合い方Recruit Lifestyle Co., Ltd.
 
KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較Yoshiyasu SAEKI
 
データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介Amazon Web Services Japan
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチAmazon Web Services Japan
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsYoshiyasu SAEKI
 
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤Amazon Web Services Japan
 
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介Amazon Web Services Japan
 
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data PipelineAWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data PipelineAmazon Web Services Japan
 
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #1320210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13Amazon Web Services Japan
 
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)Amazon Web Services Japan
 
20190326 AWS Black Belt Online Seminar Amazon CloudWatch
20190326 AWS Black Belt Online Seminar Amazon CloudWatch20190326 AWS Black Belt Online Seminar Amazon CloudWatch
20190326 AWS Black Belt Online Seminar Amazon CloudWatchAmazon Web Services Japan
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Web Services Japan
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift UpdateAmazon Web Services Japan
 

What's hot (20)

AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザ
 
Aws auto scalingによるwebapサーバbatchサーバの構成例
Aws auto scalingによるwebapサーバbatchサーバの構成例Aws auto scalingによるwebapサーバbatchサーバの構成例
Aws auto scalingによるwebapサーバbatchサーバの構成例
 
AWSで作る分析基盤
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤
 
20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration Service20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration Service
 
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
 
20200714 AWS Black Belt Online Seminar Amazon Neptune
20200714 AWS Black Belt Online Seminar Amazon Neptune20200714 AWS Black Belt Online Seminar Amazon Neptune
20200714 AWS Black Belt Online Seminar Amazon Neptune
 
分散トレーシングAWS:X-Rayとの上手い付き合い方
分散トレーシングAWS:X-Rayとの上手い付き合い方分散トレーシングAWS:X-Rayとの上手い付き合い方
分散トレーシングAWS:X-Rayとの上手い付き合い方
 
KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較
 
データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介データ活用を加速するAWS分析サービスのご紹介
データ活用を加速するAWS分析サービスのご紹介
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
 
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
Kinesis + Elasticsearchでつくるさいきょうのログ分析基盤
 
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
 
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data PipelineAWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
 
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #1320210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
 
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)
[AWSマイスターシリーズ]Amazon Elastic Load Balancing (ELB)
 
20190326 AWS Black Belt Online Seminar Amazon CloudWatch
20190326 AWS Black Belt Online Seminar Amazon CloudWatch20190326 AWS Black Belt Online Seminar Amazon CloudWatch
20190326 AWS Black Belt Online Seminar Amazon CloudWatch
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...Cobus Bernard
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...Amazon Web Services
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewNagarjuna Kaipu
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Amazon Web Services
 

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue (20)

Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Grails 101
Grails 101Grails 101
Grails 101
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components Overview
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 

More from Amazon Web Services Japan

202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)Amazon Web Services Japan
 
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Web Services Japan
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したことAmazon Web Services Japan
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことAmazon Web Services Japan
 
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介Amazon Web Services Japan
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operationsAmazon Web Services Japan
 
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報Amazon Web Services Japan
 
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをなAmazon Web Services Japan
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPNAmazon Web Services Japan
 
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)Amazon Web Services Japan
 
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)Amazon Web Services Japan
 
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)Amazon Web Services Japan
 
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法Amazon Web Services Japan
 
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤Amazon Web Services Japan
 

More from Amazon Web Services Japan (20)

202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
 
Infrastructure as Code (IaC) 談義 2022
Infrastructure as Code (IaC) 談義 2022Infrastructure as Code (IaC) 談義 2022
Infrastructure as Code (IaC) 談義 2022
 
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのこと
 
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
 
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
 
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
 
20211109 JAWS-UG SRE keynotes
20211109 JAWS-UG SRE keynotes20211109 JAWS-UG SRE keynotes
20211109 JAWS-UG SRE keynotes
 
20211109 bleaの使い方(基本編)
20211109 bleaの使い方(基本編)20211109 bleaの使い方(基本編)
20211109 bleaの使い方(基本編)
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
 
AWS の IoT 向けサービス
AWS の IoT 向けサービスAWS の IoT 向けサービス
AWS の IoT 向けサービス
 
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)
AWS IoT Coreを オンプレミス環境と使う際の アーキテクチャ例 (AWS IoT Deep Dive #5)
 
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)
AWS IoT SiteWise のご紹介 (AWS IoT Deep Dive #5)
 
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)
製造装置データ収集の選択肢 (AWS IoT Deep Dive #5)
 
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法
IoT@Loft#20 - IoTプラットフォームを進化さ せるAWSの活用方法
 
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤
202106 AWS Black Belt Online Seminar 小売現場のデータを素早くビジネス に活用するAWSデータ基盤
 
03_AWS IoTのDRを考える
03_AWS IoTのDRを考える03_AWS IoTのDRを考える
03_AWS IoTのDRを考える
 
02B_AWS IoT Core for LoRaWANのご紹介
02B_AWS IoT Core for LoRaWANのご紹介02B_AWS IoT Core for LoRaWANのご紹介
02B_AWS IoT Core for LoRaWANのご紹介
 

Recently uploaded

iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
Revolutionizing The Banking Industry: The Monzo Way by CPO, Monzo
Revolutionizing The Banking Industry: The Monzo Way by CPO, MonzoRevolutionizing The Banking Industry: The Monzo Way by CPO, Monzo
Revolutionizing The Banking Industry: The Monzo Way by CPO, MonzoProduct School
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...DianaGray10
 
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31shyamraj55
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientKari Kakkonen
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...ShapeBlue
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueShapeBlue
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsEvangelia Mitsopoulou
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyMustafa Kuğu
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024BookNet Canada
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeJosh Gellers
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1Inbay UK
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewAshraf Fouad
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc
 
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...SearchNorwich
 

Recently uploaded (20)

iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
Revolutionizing The Banking Industry: The Monzo Way by CPO, Monzo
Revolutionizing The Banking Industry: The Monzo Way by CPO, MonzoRevolutionizing The Banking Industry: The Monzo Way by CPO, Monzo
Revolutionizing The Banking Industry: The Monzo Way by CPO, Monzo
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
 
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31
Unleash the Solace Pub Sub connector | Banaglore MuleSoft Meetup #31
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficient
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & Esri
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applications
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5Company
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human Justice
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book Review
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI Innovations
 
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...
ChatGPT's Code Interpreter: Your secret weapon for SEO automation success - S...
 

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

  • 1. © 2022, Amazon Web Services, Inc. or its Affiliates. Chiei Hayashida Solution Architect 2022/01 AWS Glue Spark Performance Tuning
  • 2. © 2022, Amazon Web Services, Inc. or its Affiliates. Self Introduction Chie Hayashida(Chie Hayashida) Amazon Web Services Japan solution architect
  • 3. © 2022, Amazon Web Services, Inc. or its Affiliates. The target of this slide o People who have done AWS glue tutorial or have equivalent knowledge. o People who have written Apache Spark applications. o People who would like to improve their existing AWS Glue jobs o The Code at this slide are all by PySpark because a lot of AWS Glue users choose PySpark o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)
  • 4. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of AWS Glue and Apache Spark • AWS Glue Functions related to performance • How to proceed AWS Glue Spark performance tuning • AWS Glue Spark performance tuning patterns
  • 5. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of Apache Spark • AWS Glue specific features (performance related) • How to proceed with performance tuning of AWS Glue Spark • Basic strategy for tuning AWS Glue Spark jobs • Tuning Patterns for AWS Glue Spark Jobs
  • 6. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue and Apache Spark Data source crawler data catalog Serverless Engine (1) Crawl data 2) Manage metadata AWS Glue 3) Triggered manually / by schedule / by event 5) Run transfomation job and load data to target data source 4) Extract data from the input data source scheduler Data source Other AWS Services Managed Service of Apache Spark
  • 7. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark
  • 8. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark (cluster mode) • Cluster Manager divides a job into one or more tasks and assigns them to executors • On a single worker node, more than one executor is started. • More than one task can be executed on a single executor. (a) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache Executor Executor
  • 9. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Spark (cluster mode) 1) Request the resources needed by the application. 2) Launch the Executor required for the job on each worker. 3) Divide the process into tasks and assign them to Executor. 4) Assign a task to each Executor. When the task is completed, the Executor informs the Driver Program about it. Exchange data between tasks as necessary. 5) After 2) and 3) are repeated several times, the result of the process is returned. (a) 2) 3) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache 1) 2) 4) 3) Executor Executor 4) 5)
  • 10. © 2021, Amazon Web Services, Inc. or its Affiliates. 10 How data is processed • The data being processed is defined as a distributed collection called an RDD • An RDD is made up of one or more “partitions” • A partition is processed by a single "task” • The actual Spark code is often written through interfaceswhich called DataFrame that can treat data as typed table. S3 file file file RDD Partition Partition Partition RDD Partition Partition Partition RDD Partition Partition Partition S3 file file file
  • 11. © 2022, Amazon Web Services, Inc. or its Affiliates. Components of Apache Spark Spark Core (RDD) DataFrame/Catalyst Optimizer Spark ML Structured Streaming GraphX Spark SQL
  • 12. © 2021, Amazon Web Services, Inc. or its Affiliates. 12 RDD and DataFrame RDD data architecture image [ [1, Bob, 24], [2, Alice, 48], [3, Ken, 10], … ] DataFrame data architecture image col1 col2 col3 1 Bob 24 2 Alice 48 3 Ken 10 … … … With both interfaces, the code is written as if the data is processed as a list/table, but the actual data is distributed across multiple servers. DataFrame is a high-level API for RDDs, and processing described in a DataFrame is internally executed as an RDD.
  • 13. © 2021, Amazon Web Services, Inc. or its Affiliates. 13 Lazy evaluation • There are two types of Spark processing: “action” and “transformation” • When an “action” is executed, all the previous processing necessary for the action is performed • as series of processes excuted by an “action” is called a “job” • Note that “Job” here is different from a Glue job. >>> df1 = spark.read.csv(…) >>> df2 = spark.read.json(…) >>> df3 = df1.filter(…) >>> df4 = spark.read.csv(…) >>> df5 = df2.join(df3, …) >>> df5.count() Action Up to this point, no actual processing won’t be done. At this point, the previous process is executed for the first time. The df4 process is not a dependency of df5.count(), so it will not be executed in this action.
  • 14. © 2022, Amazon Web Services, Inc. or its Affiliates. Examples of transformations and actions Transform: data generation and processing • select() • Selecting a column • read • Loading data • filter() • Data Filtering • groupBy() • Aggregation by group • sort() • Sorting data Action: Output the processing result • count() • Counting the number of records • write • Exporting to the file system • collect() • Collect all data on Driver node • show() • View a sample of the data • describe() • View data statistics
  • 15. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark Applications o A Spark application consists of multiple jobs. glueContext = GlueContext(SparkContext.getOrCreate(conf)) spark = glueContext.spark_session df1 = spark.read.json(...) df1.show() # job1 df1.filter(df1.col1='a').write.parquet(...) # job2 df1.filter(df1.col2='b').write.parquet(...) # job3 An application is a set of processes executed in a single GlueContext (or SparkContext).
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Shuffle and Stage Stage1 Stage2 Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition df2 = df1.filter(“price”>500).groupBy(“item”).sum().withColumn(“bargain”, price*0.8) • Stages are divided by shuffling • Multiple tasks are processed in concurrently in one stage task
  • 17. © 2021, Amazon Web Services, Inc. or its Affiliates. 17 Processing with and without shuffling (exchange of data between tasks) Example of no shuffling df2 = df1.groupBy(‘item’).sum() df2 = df1.filter(price>500) item price beef 1300 pork 200 chicken 700 fish 400 item price beef 1300 chicken 700 df1 df2 item num beef 2 pork 3 beef 4 pork 5 item num beef 6 pork 9 df1 df2 Example of shuffling
  • 18. © 2022, Amazon Web Services, Inc. or its Affiliates. Processing with and without shuffling (exchange of data between tasks) Processing without shuffling • read • filter() • withColumn() • UDF • coalesce() Processing with shuffling • join() • groupBy() • orderBy() • repartition()
  • 19. © 2022, Amazon Web Services, Inc. or its Affiliates. Optimization with Catalyst Query Optimizer • Processes described in DataFrame are converted into optimized RDDs by the optimizer and before the process is executed. df1 = spark.read.csv('s3://path/to/data1') df2 = spark.read.parquet('s3://path/to/data2') df3 = df1.join(df2, df1.col1 == df2.col1) df4 = df3.filter(df3.col2 == 'meat').select(df3.col3, df3.col5) df4.show() data1 data2 join filter Logical Plan Physical Plan scan (data1) scan (data2) filter join scan (data1) optimized scan (data2) join Physical Plan (with storage side optimization) 10GB 50GB 60GB 20GB 10GB 50GB 20GB 20GB 20GB 10GB 20GB data volume predicate pushdown and column pruning
  • 20. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of PySpark Driver Python Spark Context Worker Executor Task Task Python Worker Python Worker Executor Worker Worker • Processes written in PySpark Dataframe are converted to Java code. • UDF code written by Python is executed as Python at each Python worker per tasks Py4J
  • 21. © 2022, Amazon Web Services, Inc. or its Affiliates. Introduction to AWS Glue specific features
  • 22. © 2021, Amazon Web Services, Inc. or its Affiliates. Data Catalog • Data Catalog has metadata(tablenames, column names, S3 paths and so on) necessary to access data sources such as S3 and databases from Glue, Athena, Redshift Spectrum, etc. • There are three ways to create metadata in the data catalog: crawler, Glue API, and DDL (Athena/EMR/Redshift Spectrum). • Amazon DynamoDB, Amazon S3, Amazon Redshift, Amazon RDS, JDBC connectable DB, Kinesis, HDFS, etc. can be specified as data sources. • No need to manage metastore database DynamoDB S3 Redshift RDS JDBC接続可能なDB Data Source Glue ETL Athena Redshift Spectrum EMR Connected services Hive alternative appliations Save metadata Data catalog Crawler Image of data catalog usage ①Metadata Access ②Data access
  • 23. © 2022, Amazon Web Services, Inc. or its Affiliates. DynamicFrame AWS Glue functions to absorb the inherent complexity of ETL for raw data • As a component, it is located in the same hierarchy as DataFrame. They can be used by converting each other (fromDF and toDF functions). • Leave the possibility of multiple types to be determined later (Choice type) • DynamicFrame refers to the entire data, DynamicRecord refers to a refers to a single row of data. Spark Core: RDDs Spark DataFrame/ Catalyst Optimizer AWS Glue DynamicFrame Data structure image DynamicFrame in Apache Spark architecture DataFrame DynamicFrame Similar to semi-structured tables record
  • 24. © 2022, Amazon Web Services, Inc. or its Affiliates. struct type Choice Type DynamicFrame is able to have both types when multiple types are found in a column • ResolveChoice method can be used to resolve types root |-- uuid: string | |-- device id: choice |-- long |-- string Example of data structure of type choice The device id column has both long and string data. (e.g. "1234" in the device id column is confused 1234with "1234" in the string column) project (Discard the mold) cast (Cast to a single type) make_cols (Keep all types in a separate column) Example of ResolveChoice execution deviceid: choice type long type string type long type long type long type string type long type deviceid deviceid deviceid deviceid_long deviceid_string long type deviceid make_struct (Map conversion to struct type) string type ColA ColB ColC 1 2 ... 1,000,000 "1000001." "1000002." With DataFrame, if more than one type is present, processing may be interrupted and have to be reprocessed.
  • 25. © 2022, Amazon Web Services, Inc. or its Affiliates. DataFrame ETL processing that takes advantage of the characteristics of DynamicFrame and DataFrame • DynamicFrame is good for ETL processing, DataFrame is good for table processing • Data input/output and associated ETL processing is performed by DynamicFrame, while table manipulation is performed by DataFrame. DynamicFrame toDF function Table Operations Output result file ETL job data loading input data By using DynamicFrame when loading, it is possible to optimize data loading using AWS Glue catalog, load differential data, and process semi-structured data using Choice type. DynamicFrame Table operations such as JOIN are performed in DataFrame. Use the toDF and fromDF functions for mutual conversion between DynamicFrame and DataFrame. (No data copying is done. conversion cost is within a few milliseconds) Output the result fromDF function
  • 26. © 2022, Amazon Web Services, Inc. or its Affiliates. Bookmark function Function to process only the delta data when performing steady-state ETL • Use file timestamps to process only the data that has not been processed in the previous job to prevent duplicate processing and duplicate data. Processed data (Not loaded) Unprocessed data (load target) df = spark.read.parquet('s3://path/to/data') s3://path/to/data
  • 27. © 2022, Amazon Web Services, Inc. or its Affiliates. How to proceed with performance tuning of AWS Glue ETL
  • 28. © 2022, Amazon Web Services, Inc. or its Affiliates. The Performance Tuning Cycle 1. Determine performance goals. 2. Measure the metrics 3. Identify bottlenecks. 4. Reduce the impact of bottlenecks 5. Repeat steps 2. to 4. until the goal is achieved. 6. Achieving performance goals
  • 29. © 2022, Amazon Web Services, Inc. or its Affiliates. Understand the characteristics of AWS Glue/Apache Spark • distributed processing • There are tuning patterns such as "shuffling" and "data bias" that are not found in single-process applications. • Delayed evaluation • Spark processing is lazy evaluation, so the error may not be caused by the API that was executed just before the error occurred, but by the API that was written before the error occurred. • Impact of optimizer • Since Spark processing is optimized internally, it can be difficult to determine which part of the script is the actual processing you can see in the Spark UI. You need to check multiple metrics to find the cause.
  • 30. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark parameters in AWS Glue • Essentially, Spark has a means of tuning by parameters at the time of job execution, but AWS Glue is a serverless service, so before looking at Spark's parameters, we first tune it according to AWS Glue's best practices. • User should test thoroughly when changing the value of the Spark parameter.
  • 31. © 2022, Amazon Web Services, Inc. or its Affiliates. Metrics to check • Spark UI(Spark History Server) • It shows the details of Spark processing. • Executor Log and Driver Log • It shows stdout/stderr logs of executors and a driver • AWS Glue Job metrics • It shows the CPU, memory, and other resource usage status of each executor and driver node • Statistics obtained from the Spark API • It shows samples and statistical values of intermediate data
  • 32. © 2022, Amazon Web Services, Inc. or its Affiliates. Setting up a job to do the tuning In order to get the logs needed for tuning, you need to check the box to use Monitoring Options in the Add job screen.
  • 33. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark UI
  • 34. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Spark Event Logs can be visualized by running the Spark History Server There are several ways to launch the Spark History Server. • Using Cloud Formation • Using docker to launch it on a local PC • Download Apache Spark on your local PC or EC2 and start the Spark History Server. • Using an EMR cluster https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
  • 35. © 2022, Amazon Web Services, Inc. or its Affiliates. Example of launching with Docker o Once the Docker container has started, access http://localhost:18080 in your browser $ docker build -t glue/sparkui:latest . $ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer” Specify the S3 path of Spark history logs
  • 36. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Click and chek the details of each application Check duration of each application
  • 37. © 2022, Amazon Web Services, Inc. or its Affiliates. List of jobs executed by the application Completed Jobs Failed Jobs Check jobs which takes long time
  • 38. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of a job Identify Stages that are failing or taking a long time. Identify Stages that are failing or taking a long time.
  • 39. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the contents of the Stage. If there is a difference in the line length, it means that skew occurs and it isnʼt distributed sufficiently. Check the data size in the Stage.
  • 40. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of the Stage (continued) the details of what's taking so long. Task Time for each Executor If there is a spill on the disk, select a worker node with larger memory to solve it. In addition to the Event Timeline above, data skew can be seen in Summary Metrics and Tasks.
  • 41. © 2022, Amazon Web Services, Inc. or its Affiliates. View details of tasks that are Failing Click on details to learn more. View the log of the Executor where the Fail is occurring. For AWS Glue ETL, check the Executor ID and check from CloudWatch Log groups
  • 42. © 2022, Amazon Web Services, Inc. or its Affiliates. Environmental settings for Spark application runtime o Spark options, dependencies, and so on.
  • 43. © 2022, Amazon Web Services, Inc. or its Affiliates. List of Driver and Executor nodes If the cluster is running and accessible from the History Server, the logs of each Driver/Executor can be seen.
  • 44. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the Spark SQL Query Execution Plan You can see the actual execution plan, which is more accurate than the explain API.
  • 45. © 2022, Amazon Web Services, Inc. or its Affiliates. Executor and Driver logs
  • 46. © 2022, Amazon Web Services, Inc. or its Affiliates. Check Log groups in CloudWatch Executor Logs Driver Log
  • 47. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue Job Metrics
  • 48. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the resource usage of Executor and Driver You can also create a Dashboard for CloudWatch and add other metrics.
  • 49. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark API
  • 50. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with commands o Use the following commands to check the trend of the intermediate data during processing and use it for tuning strategy. o Note that processing with actions (red text) will make the job slow down the if you use too many times. • count() • printSchema() • show() • describe([cols*]).show() • explain() • df.agg(approx_count_distinct(df.col))
  • 51. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.count() o Check the number of records. o Use df.groupBy('col_name').count() to check for skew.
  • 52. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.printSchema() o Check the schema information of the DataFrame.
  • 53. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.show() o The number of records to be displayed can be specified by using df.show(5).
  • 54. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.describe([cols*]).show() o The statistics for each column can be seen.
  • 55. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.explain() o Check the execution plan which optimizer created.
  • 56. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.agg(approx_count_distinct(df.col)) o Check the cardinality in the columns o Fast because HyperLogLog is used
  • 57. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue ETL Performance Tuning Pattern
  • 58. © 2022, Amazon Web Services, Inc. or its Affiliates. Basic strategy for tuning AWS Glue ETL jobs • Use the new version • Reduce the data I/O load. • Minimize shuffling. • Speed up per-task processing. • Parallelize
  • 59. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version
  • 60. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version o When the Spark application is slow, simply replacing the job execution environment with the latest version may speed up the process. o Both AWS Glue and Apache Spark are evolving in every way, not just in performance. Use the newest version possible. https://aws.amazon.com/jp/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/
  • 61. © 2022, Amazon Web Services, Inc. or its Affiliates. Using AWS Glue 2.0 and 3.0 to reduce startup time Dramatically reduced the time required to launch AWS Glue ETL jobs • Cold start used to take about 85 minutes in AWS Glue 0.9/1.0. • In the new AWS Glue 2.0, less than a 1minute. time Start-up time Execution time Execution time AWS Glue 1.0 AWS Glue 2.0
  • 62. © 2021, Amazon Web Services, Inc. or its Affiliates. 2. Submit job to virtual cluster AWS Glue 2.0+ integrated scheduling and provisioning 1. Run AWS Glue job 3. Spark schedules tasks to executors Job manager 4. Dynamically grow virtual clusters 5. Spark utilizes new executors AZ1 AZ2 Job starts when first executor is ready Reduced start time Reduced start variance Jobs run on reduced capacity Graceful degradation
  • 63. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize data I/O
  • 64. © 2022, Amazon Web Services, Inc. or its Affiliates. How to minimize the data I/O load • Read only the data you need. • Control the amount of data read in one task. • Choose the right compression format.
  • 65. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Apache Parquet • Column-oriented format for data arrangement suitable for analytical applications • Data type is preserved • Compressed effectively • Aggregation by skipping unnecessary data and using metadata • The Spark engine can efficiently use Apache Parquet Integration is in place https://parquet.apache.org/documentation/latest/
  • 66. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filtering and Filter Pushdown Reduce the amount of data to be read • Partition Filtering • The ability to read only the files in the partition specified by the filter or Where clause. or Where clause. • Available in Text/CSV/JSON/ORC/Parquet • Filter Pushdown • Ability to read only blocks that hit the filter or where clause for columns that are not used in the partition column. • AWS Glue automatically applies this when Parquet is used.
  • 67. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Filter Pushdown Partition Filtering
  • 68. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Partition Filter can be used when a partition directory has been created; for Dataframe and DynamicFrame writes, the partition directory can be created by using the partitionBy option as follows df.write.parquet(path=path, partitionBy='col_name') It may be more performance efficient to partition columns that are used more frequently in the filter clause into higher partitions.
  • 69. © 2022, Amazon Web Services, Inc. or its Affiliates. Using push_down_predicate in DynamicFrame Read only the files included in the partition where the data specified in the filter or where clause is stored when reading a DynamicFrame from the AWS Glue data catalog. partitionPredicate ="(product_category == 'Video')" datasource = glue_context.create_dynamic_frame.from_catalog( database = "githubarchive_month", table_name = "data", push_down_predicate = partitionPredicate)
  • 70. © 2022, Amazon Web Services, Inc. or its Affiliates. Choose a compression codec based on your application • Compression codec can be selected at data writing. • Trade-off between compression rate and compression/decompression speed • Files compressed with bzip2, lzo, and snappy can be split and processed when read, but files compressed with gzip(*) cannot be split. • Uncompressed files does not require compression/decompression time, but data transfer cost may be a bottleneck • If processing speed is important to you, choose snappy or lzo. ex. df.write.csv("path/to/csv", compression="gzip") Parquet can also be split by gzip. gzip bzip2 lzo snappy file extension .gzip .bz2 .lzo .snappy Compression Level High Highest Average Average Speed Medium Slow Fast Fast CPU usage Medium High Low Low Is Splittable No(*) No Yes, if indexed No
  • 71. © 2022, Amazon Web Services, Inc. or its Affiliates. Store data in appropriate file sizes. • Data read/write tasks are basically tied to a single file. (If the file is splitable, one file can be split into multiple tasks.) • The recommended file size for AWS Glue is 128MB-512MB. When the data is too small • Overhead due to large number of small tasks When there is large unspilitable data in one file • Data is not fully loaded into memory on a single node. • No distributed processing task task task ... ... task Executor Executor Executor ... Not used
  • 72. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Bounded Execution with DynamicFrame When there is a lot of data to be read, Bounded Execution can be used at the same time as Job Bookmarking to divide the process without reading all the unprocessed data at once. glueContext.create_dynamic_frame.from_catalog( database = "database", tableName = "table_name", redshift_tmp_dir = "", transformation_ctx = "datasource0", additional_options = { "boundedFiles" : "500", # need to be string # "boundedSize" : "1000000000" unit is byte } )
  • 73. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame's groupFiles and groupSize • Eliminate overhead by reading small files together in a 1single task. • Useful for processing data that is output every few minutes by Kinesis Data Firehose. • Use the groupFiles option to group the data in the S3 partition, and the groupSize option to specify the size of the group to be read. df = glueContext.create_dynamic_frame_from_options( 's3', {'paths': ['s3://s3path/'], ' recurse'':True, 'groupFiles'': 'inPartition'', 'groupSize'': ''1048576}, format='json') note: groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.
  • 74. © 2022, Amazon Web Services, Inc. or its Affiliates. Number of files and processing time for DataFrame and DynamicFrame
  • 75. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame S3ListImplementation • If there are a lot of small files, a large number of tasks can cause Driver OOM. • When S3ListImplementation is True, the results of S3 list are read and processed in one batch at a 1000time, which prevents driver memory from being overloaded by S3 listing. datasource = glue_context.create_dynamic_frame.from_catalog( database = "my_database", table_name = "my_table", push_down_predicate = partitionPredicate, additional_options = {"useS3ListImplementation":True} )
  • 76. © 2022, Amazon Web Services, Inc. or its Affiliates. Set the Partition Index When reading a DataFrame from AWS Glue catalog from a data source with many partitions consisting of multiple partition keys, setting the Partition Index will reduce the time to fetch the read partition if there are filter or where clauses for the target partition. This will reduce the time to fetch the partitions. https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  • 77. © 2022, Amazon Web Services, Inc. or its Affiliates. The difference of query planning time between using Partition Index and not using Partition Index https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  • 78. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel Data Reading in DataFrame JDBC Connections • spark.read.jdbc() only allows one Executor to access the target database by default. • For parallel reading, partitionColumn, lowBound, upperBound, and numPartitions must be specified. The partitionColumn must be one of the following types: numeric, date, or timestamp. df = spark.read.jdbc( url=jdbcUrl, table="sample", partitionColumn="col1", lowerBound=1L, upperBound=100000L, numPartitions=100, fetchsize=1000, connectionProperties=connectionProperties)
  • 79. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel data reading in DynamicFrame JDBC connection o If you want to read data from a JDBC connection as a DynamicFrame, you need to specify hashfield/hashexpression. o In hashfield, strings and other columns can also be used as partition columns. glueContext.create_dynamic_frame.from_catalog( database = "my_database", tableName = "my_table_name", transformation_ctx = "my_transformation_context", additional_options = { ''hashfield': 'customer_name', 'hashpartitions': '5' } ) https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
  • 80. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling
  • 81. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling • Make good use of cache • Perform filter processing in the first stage as much as possible. • Devise the order of joins to keep the data small. • Optimize join strategy • Remove data skew • Use Window processing instead of data processing by self join
  • 82. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling The processing described in the DataFrame is optimized by Catalyst Optimizer. However, it is not perfect in the following aspects • If there is a cache() in between, optimization including before and after will not work. • Spark 2.4.3, used in AWS Glue 1.0 and 2.0, disables the cost-based optimizer by default Shuffling can be reduced by manually changing the order and strategy of filters and joins.
  • 83. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • When branching the processing of a single Dataframe to perform multiple outputs, you can prevent recalculation by inserting cache() just before the branch. • Note that it may be faster not to use cache, and that too much use of cache will use local disk space. df1 df2 df3 df5 df4 df1 df2 df3 df5 df4 cache() The process up to the creation of df2 is executed only once. The process to create df2 will be executed twice.
  • 84. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • By default, cached data will be stored in the memory initially allocated for caching, and what is not in memory will be stored on the local disk. • Users can choose to save to memory only or disk only. Example of cache only on memory: df.cache(MEMORY_ONLY)
  • 85. © 2022, Amazon Web Services, Inc. or its Affiliates. Delete cache that is no longer in use • A cached Dataframe will continue to occupy memory and local disk space. • Save memory and disk space by deleting the Dataframe cache when it is no longer needed. df.unpersist()
  • 86. © 2022, Amazon Web Services, Inc. or its Affiliates. Perform filter processing in the first stage as much as possible The filter process can be placed before cache() to reduce the amount of data during processing.
  • 87. © 2022, Amazon Web Services, Inc. or its Affiliates. Work out the order of join • The end result is the same, but the data size of the DataFrame in the middle is different. • In Glue 3.0 (Spark 3.1.1), the cost-based optimizer takes into account the amount of data and optimizes the order of joins. df1 (4000 rows) df2 (1000 rows) df4 ( 4000rows) df3 (10 rows) df5 (10 rows) df1 (4000 rows) df3 (10 rows) df4 ( 10rows) df2 (1000 rows) df5 (10 rows) left join join left join join
  • 88. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways Sort Merge Join • Distribute the two tables to be joined by their respective keys, sort them, and then join them. • Suitable for joining large tables together. Broadcast Join • Transfer one table to all Executors, and distribute the other table to all Executors and join them. • Suitable for when one table is smaller than the other. Shuffle Hash Join • Distribute the two tables to be joined and join them without sorting. • Suitable for joins between tables that are not so large.
  • 89. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways • By default, if the table size is less than or equal to the value specified in spark.sql.autoBroadcastJoinThreshold (default 10MB), Broadcast Join will be used. • The Join strategy in use can be seen in the Spark UI or by using explain(). • If join performance is the bottleneck, changing the join strategy manually may improve performance. df1.join( broadcast(df2), df1("col1") <=> df2("col1") ).expand() == Physical Plan == BroadcastHashJoin [coalesce(col1#6, )], [coalesce(col1#21, )], Inner, BuildRight, (col1#6 <=> col1#21) :- LocalTableScan [first_name#5, col1#6]. +- BroadcastExchangeHashedRelationBroadcastMode(List(coalesce(input[0, string, true], ))) +- LocalTableScan [col1#21, col2#22, population#23] https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html#broadcast-hint-for-sql-queries
  • 90. © 2022, Amazon Web Services, Inc. or its Affiliates. coalesce For the following reasons, partitions may be split into smaller pieces during processing. • Load a large number of small files. • Perform groupBy on columns with high cardinality In such a case, it is better to merge the partitions before the next process to reduce the overhead of the subsequent process. Since repartition involves shuffling, it may be desirable to use coalesce. However, since it is a simple merge, the data after coalesce may be biased. Glue 3.0 has a new feature called Adaptive Query Execution that automatically optimizes the number of partitions by coalesce. partition partition partition partition partition partition partition partition partition partition partition partition df.repartition(2) df.coalesce(2)
  • 91. © 2022, Amazon Web Services, Inc. or its Affiliates. Use Window processing instead of self join and data aggregation o When the process of creating aggregate data from a single log data and joining it is being performed, the load of joining can be eliminated by performing Window processing. df_agg = df.groupBy('gender', 'age').agg( F.mean('height').alias('avg_height')), F.mean('weight').alias('avg_weight')) df = df.join(df_agg, on=['gender', 'age']) w = Window.partitionBy('gender', 'age') df = df.withColumn( 'avg_height', F.mean(col('height')).over(w) ).withColumn('avg_weight', F.mean(col('weight')).over(w))
  • 92. © 2022, Amazon Web Services, Inc. or its Affiliates. Speed up per-task processing
  • 93. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Scala Most of the DataFrame operations can be written in PySpark and internally converted to Java and run on a JVM, but the following operations are slower when using Python. If the above is the bottleneck, Scala will speed up the process. • The part that describes the process in RDD • Processing written in RDDs is not optimized by the optimizer. • The part that uses UDF • later mention
  • 94. © 2022, Amazon Web Services, Inc. or its Affiliates. Task Avoid UDF in PySpark Performance issues • Serialization and piping to Python process occurs for each Iterator • The memory of the Python process is not controlled by the JVM. Python Worker JVM Physical Operator Python Runner batch of Rows Invoke UDF Deserialize Serialize Pipe
  • 95. © 2022, Amazon Web Services, Inc. or its Affiliates. Using PandasUDF over PythonUDF Python UDF • Serialization/deserialization is done by Pickling • Data is fetched block by block, but UDF processing is performed row by row Pandas UDF • Serialization/deserialization is done by Apache Arrow • Both data fetch and UDF processing are performed on multiple lines at once.
  • 96. © 2022, Amazon Web Services, Inc. or its Affiliates. Performance differences between Python UDF, Pandas UDF, and Spark API in AWS Glue ETL Execution Time(s)
  • 97. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallelize
  • 98. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness in Data If there is data variation between partitions, the load will be unevenly distributed to only some tasks that process large partitions, causing processing delays. Data is biased to only some partitions, causing a bottleneck in processing time. When does it happen? • If the file size to be read is uneven • When joining with data that has a difference in the number of records for each join key • Variation in the number of records per key when df.groupBy() is performed
  • 99. © 2022, Amazon Web Services, Inc. or its Affiliates. Addressing data bias • Ensure that the file size is uniform when creating input files. • To repartition • broadcast join • salting
  • 100. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Repartition If the subsequent process is not a key-by-key process (partitioning and storing data by date, window processing by key, etc.), repartition will resolve the skew. df.repartition(200) Partition Partition Partition Partition Partition Partition ... 200 partitions 3 partitions
  • 101. © 2021, Amazon Web Services, Inc. or its Affiliates. 101 Dealing with Skewness broadcast join If one DataFrame is small enough to fit all the data in one Executor, and the other DataFrame has huge data with skewed join key columns, you can improve processing performance by using Broadcast Join as the Join strategy. item n beef 1 beef 3 ... beef 2 item n pork 1 item price beef 100 item price pork 500 1000 lines join join Partition Partition Sort Merge Join item n beef 1 beef 3 ... item n pork 1 beef 2 ... item price beef 100 pork 500 item price beef 100 pork 500 500 lines join join Partition Partition Broadcast Join 500 lines
  • 102. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting In the case of a join between two sufficiently large data sets that have a skew on one side, "Salt" can be used to eliminate the load bias. Table A Table B Partition with Skew partition Clone the partition corresponding to the partition with Skew.
  • 103. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting o In Glue 2.0 (Spark 2.4.3), user need to write the Salting code manually. o In Glue 3.0 (Spark 3.1.1), a new feature called Adaptive Query Execution will dynamically perform Skew Join. 103 # Salting the skewed column df_big = df_big.withColumn('shop_salt', F.concat(df['shop'], F.lit('_'), F.lit(F.floor(F.rand() * numPartition) + 1)))) # Explode the column df_medium = df_medium.withColumn('shop_exploded', F.explode(F.array([F.lit(i) for i in range(1,numPartition+1)])))) df_medium = df_medium.withColumn( 'shop_exploded', F.concat(df_medium['shop'], F.lit('_')), df_medium['shop_exploded'])) # Joining df_join = df_big.join(df_medium df_big.'shop_salted' == df_medium.shop_exploded) https://spark.apache.org/docs/latest/sql-performance-tuning.html#optimizing-skew-join
  • 104. © 2022, Amazon Web Services, Inc. or its Affiliates. Assigning Incremental IDs with Performance in Mind • If the Window function row_number() is used to assign successive incremental IDs to all records, the process will be slow because of the aggregation for all records. • monotonically_increasing_id() can give an incremental ID without aggregation by allowing it to be non-contiguous across partitions. 1 2 3 4 5 6 7 8 9 10 11 1 2 3 6 7 8 9 10 13 14 15 df. withColumn(F.rowNumber().over(Window.partitionBy("col1"). orderBy("col2")) df. withColumn(monotonically_increasing_id)
  • 105. © 2022, Amazon Web Services, Inc. or its Affiliates. Selecting a Worker Type for AWS Glue • The processing power allocated at the time of job execution is called DPU (Data Processing Unit). • 1DPU = 4vCPU, 16GB memory • Each Worker Type has different resource capacity and configuration. Worker Type Number of DPUs/1 Worker Number of Executor s/1Work er Number of memory/ 1Worker Disk/1W orker Standard 1 2 16GB 50GB G.1X 1 1 16GB 64GB G.2X 2 1 32GB 128GB Worker Type List Worker Type configuration image Standard Executor Worker Executor DPU G.1X Worker DPU G.2X Worker DPU DPU Executor Executor
  • 106. © 2022, Amazon Web Services, Inc. or its Affiliates. Ideal resource usage • It is desirable that resources are used evenly and without waste by all Executors. • If not, there's likely room for tuning. • Select the first Worker Type based on some prediction of resource tendency based on processing contents. • Example. • CPU usage is likely to be high when there are complex UDF and other processing operations. • Memory usage is likely to be high when the shuffle size becomes large, such as when joining large amounts of data.
  • 107. © 2022, Amazon Web Services, Inc. or its Affiliates. Trade-off between number of workers and job execution time Job execution time can be reduced without increasing cost by increasing the number of workers as long as the number of parallelism is sufficient to ensure effective resource utilization. 5 10 5 10 Number of workers AWS Glue ETL job execution time 5 10 5 10 Number of workers AWS Glue ETL job execution time
  • 108. © 2022, Amazon Web Services, Inc. or its Affiliates. summary • Introduced tuning patterns for AWS Glue Spark ETL jobs. • AWS Glue can process large amounts of data with high performance as-is, but it can be tuned to achieve even higher performance and scalability. • Tuning requires checking metrics to identify bottlenecks and eliminate their causes. 108