SlideShare a Scribd company logo
1 of 34
Download to read offline
© 2019, Amazon Web Services, Inc. or its Affiliates.
AWS Cloud Support Engineer
ChiaWei Hsu
Athena & Glue
© 2019, Amazon Web Services, Inc. or its Affiliates.
Amazon Athena
AWS Athena
Amazon Athena uses Presto with ANSI SQL support and works with a variety
of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.
Reference :
https://aws.amazon.com/athena/
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
https://docs.aws.amazon.com/general/latest/gr/rande.html
Athena Common Issue
• 503 Slow down
high number of requests being received by the S3 bucket per second. We
can achieve 3,500 PUT/POST/DELETE and 5,500 GET requests per second per
prefix in a bucket. If the request rate on a prefix exceeds this rate, then S3
throttles the requests with the 503 Slow Down error.
• Solution : Combine Small file
Reference:
https://docs.aws.amazon.com/zh_tw/AmazonS3/latest/dev/optimizing-
performance.html
Athena Common Issue
• Only support Query s3 data
• Json Format Single line
• Other Error
Reference :
https://aws.amazon.com/athena/
https://aws.amazon.com/premiumsupport/knowledge-center/error-json-
athena/
Athena Common Issue – Slow
1. Partition your data
2. Optimize file sizes
....
Partition:
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
Reference :
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
https://docs.aws.amazon.com/zh_tw/athena/latest/ug/partitions.html
Create Table Query
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
Create Partition Info and Query
MSCK REPAIR TABLE impressions
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00'
and dt>='2009-04-12-13-00' ORDER BY dt DESC LIMIT 100
Athena Common Issue – Other Error
1.Query Id 2.Region 3.Sample Data
Athena Common Issue – UnexpectedResult
Athena Common Issue – UnexpectedResult
Create Support Case
Best Practice for Athena
• Top 10 Performance Tuning Tips for Amazon Athena
• 分區,最容易理解也最有效。省錢省時
© 2019, Amazon Web Services, Inc. or its Affiliates.
Amazon Glue
Glue
https://docs.aws.amazon.com/zh_tw/athena/latest/ug/glue-athena.html
Glue Common Issue
• Crawler too slow
• ETL Job too slow
• ETL Job OOM
• Crawler/ETL job Fail
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
Glue Common Issue – Crawler too slow
• Crawler will list all the prefix in s3 and decide to read it or not
• More data will get slower
• Solution: Exclude pattern
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
Glue Common Issue – ETL Job too slow
• Glue ETL Job is Apache Spark Environment. We need to gathering more
information to troubleshooting.
• Glue ETL is design for Batch Job
l
Glue Common Issue
• ETL Driver OOM
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/monitor-profile-debug-oom-
abnormalities.html
Glue Common Issue
• ETL Driver OOM Possible reason
1. Listing too many file
2. rdd.collect() -> Spark function
• Solution
Push Down Predicate
Batch Job
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/aws-glue-
programming-etl-partitions.html
Glue Common Issue
• ETL Executor OOM
Glue Common Issue - ETL Executor OOM
• Possible root cause
1. Rdd.Repartition() -> Spark action
2. Data Skew
• Solution
1. Check the source data
2. Different Worker Type
Glue Worker Type
工作者類型 可使用以下工作者類型:
Standard – 選擇這種類型時,您也要提供 Maximum capacity (容量上限) 的值。容量
上限是可在此任務執行時分配之 AWS Glue 資料處理單位 (DPU) 數目上限。DPU 是相
對的處理能力,包含 4 個 vCPU 的運算容量和 16 GB 的記憶體。Standard 工作者類
型包含 50 GB 磁碟和 2 個執行器。
G.1X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (4 個 vCPU、16 GB 記憶體、64 GB 磁碟),並為每個工作者提供
1 個執行器。我們建議記憶體密集型任務採用這種工作者類型。
G.2X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (8 個 vCPU、32 GB 記憶體、128 GB 磁碟),並為每個工作者提
供 1 個執行器。我們建議記憶體密集型任務和執行 ML 轉換的任務採用這種工作者
類型。
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-job.html
Glue Common Issue - EnableJobMetrics
• Metrics
Glue Common Issue - EnableJobMetrics
Glue Common Issue – DPU Planning
Glue Common Issue – DPU Planning
Reference:https://docs.aws.amazon.com/zh_tw/glue/lat
est/dg/monitor-debug-capacity.html
Glue Common Issue – Crawler/ETL Job Fail
l
Glue Common Issue – Crawler Customer Classifier
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-classifier.html
Create Support Case
When a crawler fails, gather the following information:
• Crawler name
• Logs from crawler runs are located in CloudWatch Logs under /aws-glue/crawlers.
When a test connection fails, gather the following information: Connection name
• Connection ID
• JDBC connection string in the form jdbc:protocol://host:port/database-name.
• Logs from test connections are located in CloudWatch Logs under /aws-glue/testconnection.
When a job fails, gather the following information: Job name
• Job run ID in the form jr_xxxxx.
• Logs from job runs are located in CloudWatch Logs under /aws-glue/jobs.
Reference :
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/troubleshooting-contact-support.html
Create Support Case - Crawler
Create Support Case - Job
Best Practice for Glue from My perspective
Crawler:
善用Exclude Pattern
如果 Crawler 無法正常判斷 -> 善用Custom Classifier
ETL:
開啟 Glue Job Metrics
程式碼優化,需要熟悉Spark
© 2019, Amazon Web Services, Inc. or its Affiliates.
意猶未盡 ?
立即加入LINE好友 >>掌握AWS最新消息 !
Thank you!
~ 歡迎填寫問卷 ~
換取 25美元 Credit Code!

More Related Content

What's hot

Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost ManagementAmazon Web Services
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
AWS VPC & Networking basic concepts
AWS VPC & Networking basic conceptsAWS VPC & Networking basic concepts
AWS VPC & Networking basic conceptsAbhinav Kumar
 
Aws cloud watch
Aws cloud watchAws cloud watch
Aws cloud watchMahesh Raj
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceAmazon Web Services
 
AWS Storage - S3 Fundamentals
AWS Storage - S3 FundamentalsAWS Storage - S3 Fundamentals
AWS Storage - S3 FundamentalsPiyush Agrawal
 

What's hot (20)

Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
What is AWS Cloud Watch
What is AWS Cloud WatchWhat is AWS Cloud Watch
What is AWS Cloud Watch
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS Storage Options
AWS Storage OptionsAWS Storage Options
AWS Storage Options
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
AWS VPC & Networking basic concepts
AWS VPC & Networking basic conceptsAWS VPC & Networking basic concepts
AWS VPC & Networking basic concepts
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Aws cloud watch
Aws cloud watchAws cloud watch
Aws cloud watch
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
AWS Storage - S3 Fundamentals
AWS Storage - S3 FundamentalsAWS Storage - S3 Fundamentals
AWS Storage - S3 Fundamentals
 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
 

Similar to Athena & Glue

How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017Amazon Web Services
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Amazon Web Services
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Amazon Web Services
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Tom Laszewski
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Amazon Web Services
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Amazon Web Services
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceAmazon Web Services
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS applicationAmazon Web Services
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraAmazon Web Services
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...Amazon Web Services
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupJustin Borgman
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraAmazon Web Services
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesAmazon Web Services
 

Similar to Athena & Glue (20)

How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT331) - AWS re:Inv...
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL Performance
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...
DAT340_Hands-On Journey for Migrating Oracle Databases to the Amazon Aurora P...
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Azure Databases with IaaS
Azure Databases with IaaSAzure Databases with IaaS
Azure Databases with IaaS
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 Instances
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Athena & Glue

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. AWS Cloud Support Engineer ChiaWei Hsu Athena & Glue
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. Amazon Athena
  • 3. AWS Athena Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Reference : https://aws.amazon.com/athena/ https://en.wikipedia.org/wiki/Presto_(SQL_query_engine) https://docs.aws.amazon.com/general/latest/gr/rande.html
  • 4. Athena Common Issue • 503 Slow down high number of requests being received by the S3 bucket per second. We can achieve 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. If the request rate on a prefix exceeds this rate, then S3 throttles the requests with the 503 Slow Down error. • Solution : Combine Small file Reference: https://docs.aws.amazon.com/zh_tw/AmazonS3/latest/dev/optimizing- performance.html
  • 5. Athena Common Issue • Only support Query s3 data • Json Format Single line • Other Error Reference : https://aws.amazon.com/athena/ https://aws.amazon.com/premiumsupport/knowledge-center/error-json- athena/
  • 6. Athena Common Issue – Slow 1. Partition your data 2. Optimize file sizes .... Partition: aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ Reference : https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://docs.aws.amazon.com/zh_tw/athena/latest/ug/partitions.html
  • 7. Create Table Query CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string, number string, processId string, browserCookie string, requestEndTime string, timers struct<modelLookup:string, requestTime:string>, threadId string, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
  • 8. Create Partition Info and Query MSCK REPAIR TABLE impressions SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00' ORDER BY dt DESC LIMIT 100
  • 9. Athena Common Issue – Other Error 1.Query Id 2.Region 3.Sample Data
  • 10. Athena Common Issue – UnexpectedResult
  • 11. Athena Common Issue – UnexpectedResult
  • 13. Best Practice for Athena • Top 10 Performance Tuning Tips for Amazon Athena • 分區,最容易理解也最有效。省錢省時
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. Amazon Glue
  • 16. Glue Common Issue • Crawler too slow • ETL Job too slow • ETL Job OOM • Crawler/ETL job Fail Reference : https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
  • 17. Glue Common Issue – Crawler too slow • Crawler will list all the prefix in s3 and decide to read it or not • More data will get slower • Solution: Exclude pattern Reference : https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
  • 18. Glue Common Issue – ETL Job too slow • Glue ETL Job is Apache Spark Environment. We need to gathering more information to troubleshooting. • Glue ETL is design for Batch Job l
  • 19. Glue Common Issue • ETL Driver OOM https://docs.aws.amazon.com/zh_tw/glue/latest/dg/monitor-profile-debug-oom- abnormalities.html
  • 20. Glue Common Issue • ETL Driver OOM Possible reason 1. Listing too many file 2. rdd.collect() -> Spark function • Solution Push Down Predicate Batch Job https://docs.aws.amazon.com/zh_tw/glue/latest/dg/aws-glue- programming-etl-partitions.html
  • 21. Glue Common Issue • ETL Executor OOM
  • 22. Glue Common Issue - ETL Executor OOM • Possible root cause 1. Rdd.Repartition() -> Spark action 2. Data Skew • Solution 1. Check the source data 2. Different Worker Type
  • 23. Glue Worker Type 工作者類型 可使用以下工作者類型: Standard – 選擇這種類型時,您也要提供 Maximum capacity (容量上限) 的值。容量 上限是可在此任務執行時分配之 AWS Glue 資料處理單位 (DPU) 數目上限。DPU 是相 對的處理能力,包含 4 個 vCPU 的運算容量和 16 GB 的記憶體。Standard 工作者類 型包含 50 GB 磁碟和 2 個執行器。 G.1X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工 作者會映射到 1 DPU (4 個 vCPU、16 GB 記憶體、64 GB 磁碟),並為每個工作者提供 1 個執行器。我們建議記憶體密集型任務採用這種工作者類型。 G.2X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工 作者會映射到 1 DPU (8 個 vCPU、32 GB 記憶體、128 GB 磁碟),並為每個工作者提 供 1 個執行器。我們建議記憶體密集型任務和執行 ML 轉換的任務採用這種工作者 類型。 https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-job.html
  • 24. Glue Common Issue - EnableJobMetrics • Metrics
  • 25. Glue Common Issue - EnableJobMetrics
  • 26. Glue Common Issue – DPU Planning
  • 27. Glue Common Issue – DPU Planning Reference:https://docs.aws.amazon.com/zh_tw/glue/lat est/dg/monitor-debug-capacity.html
  • 28. Glue Common Issue – Crawler/ETL Job Fail l
  • 29. Glue Common Issue – Crawler Customer Classifier https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-classifier.html
  • 30. Create Support Case When a crawler fails, gather the following information: • Crawler name • Logs from crawler runs are located in CloudWatch Logs under /aws-glue/crawlers. When a test connection fails, gather the following information: Connection name • Connection ID • JDBC connection string in the form jdbc:protocol://host:port/database-name. • Logs from test connections are located in CloudWatch Logs under /aws-glue/testconnection. When a job fails, gather the following information: Job name • Job run ID in the form jr_xxxxx. • Logs from job runs are located in CloudWatch Logs under /aws-glue/jobs. Reference : https://docs.aws.amazon.com/zh_tw/glue/latest/dg/troubleshooting-contact-support.html
  • 31. Create Support Case - Crawler
  • 33. Best Practice for Glue from My perspective Crawler: 善用Exclude Pattern 如果 Crawler 無法正常判斷 -> 善用Custom Classifier ETL: 開啟 Glue Job Metrics 程式碼優化,需要熟悉Spark
  • 34. © 2019, Amazon Web Services, Inc. or its Affiliates. 意猶未盡 ? 立即加入LINE好友 >>掌握AWS最新消息 ! Thank you! ~ 歡迎填寫問卷 ~ 換取 25美元 Credit Code!