Athena & Glue

© 2019, Amazon Web Services, Inc. or its Affiliates.
AWS Cloud Support Engineer
ChiaWei Hsu
Athena & Glue

Amazon Athena

AWS Athena
Amazon Athena uses Presto with ANSI SQL support and works with a variety
of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.
Reference :
https://aws.amazon.com/athena/
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
https://docs.aws.amazon.com/general/latest/gr/rande.html

Athena Common Issue
• 503 Slow down
high number of requests being received by the S3 bucket per second. We
can achieve 3,500 PUT/POST/DELETE and 5,500 GET requests per second per
prefix in a bucket. If the request rate on a prefix exceeds this rate, then S3
throttles the requests with the 503 Slow Down error.
• Solution : Combine Small file
Reference:
https://docs.aws.amazon.com/zh_tw/AmazonS3/latest/dev/optimizing-
performance.html

Athena Common Issue
• Only support Query s3 data
• Json Format Single line
• Other Error
Reference :
https://aws.amazon.com/athena/
https://aws.amazon.com/premiumsupport/knowledge-center/error-json-
athena/

Athena Common Issue – Slow
1. Partition your data
2. Optimize file sizes
....
Partition:
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
Reference :
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
https://docs.aws.amazon.com/zh_tw/athena/latest/ug/partitions.html

Create Table Query
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;

Create Partition Info and Query
MSCK REPAIR TABLE impressions
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00'
and dt>='2009-04-12-13-00' ORDER BY dt DESC LIMIT 100

Athena Common Issue – Other Error
1.Query Id 2.Region 3.Sample Data

Athena Common Issue – UnexpectedResult

Best Practice for Athena
• Top 10 Performance Tuning Tips for Amazon Athena
• 分區，最容易理解也最有效。省錢省時

Amazon Glue

Glue
https://docs.aws.amazon.com/zh_tw/athena/latest/ug/glue-athena.html

Glue Common Issue
• Crawler too slow
• ETL Job too slow
• ETL Job OOM
• Crawler/ETL job Fail
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html

Glue Common Issue – Crawler too slow
• Crawler will list all the prefix in s3 and decide to read it or not
• More data will get slower
• Solution: Exclude pattern
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html

Glue Common Issue – ETL Job too slow
• Glue ETL Job is Apache Spark Environment. We need to gathering more
information to troubleshooting.
• Glue ETL is design for Batch Job
l

Glue Common Issue
• ETL Driver OOM
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/monitor-profile-debug-oom-
abnormalities.html

Glue Common Issue
• ETL Driver OOM Possible reason
1. Listing too many file
2. rdd.collect() -> Spark function
• Solution
Push Down Predicate
Batch Job
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/aws-glue-
programming-etl-partitions.html

Glue Common Issue
• ETL Executor OOM

Glue Common Issue - ETL Executor OOM
• Possible root cause
1. Rdd.Repartition() -> Spark action
2. Data Skew
• Solution
1. Check the source data
2. Different Worker Type

Glue Worker Type
工作者類型可使用以下工作者類型：
Standard – 選擇這種類型時，您也要提供 Maximum capacity (容量上限) 的值。容量
上限是可在此任務執行時分配之 AWS Glue 資料處理單位 (DPU) 數目上限。DPU 是相
對的處理能力，包含 4 個 vCPU 的運算容量和 16 GB 的記憶體。Standard 工作者類
型包含 50 GB 磁碟和 2 個執行器。
G.1X – 選擇這種類型時，您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (4 個 vCPU、16 GB 記憶體、64 GB 磁碟)，並為每個工作者提供
1 個執行器。我們建議記憶體密集型任務採用這種工作者類型。
G.2X – 選擇這種類型時，您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (8 個 vCPU、32 GB 記憶體、128 GB 磁碟)，並為每個工作者提
供 1 個執行器。我們建議記憶體密集型任務和執行 ML 轉換的任務採用這種工作者
類型。
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-job.html

Glue Common Issue - EnableJobMetrics
• Metrics

Glue Common Issue - EnableJobMetrics

Glue Common Issue – DPU Planning

Glue Common Issue – DPU Planning
Reference:https://docs.aws.amazon.com/zh_tw/glue/lat
est/dg/monitor-debug-capacity.html

Glue Common Issue – Crawler/ETL Job Fail
l

Glue Common Issue – Crawler Customer Classifier
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-classifier.html

Create Support Case
When a crawler fails, gather the following information:
• Crawler name
• Logs from crawler runs are located in CloudWatch Logs under /aws-glue/crawlers.
When a test connection fails, gather the following information: Connection name
• Connection ID
• JDBC connection string in the form jdbc:protocol://host:port/database-name.
• Logs from test connections are located in CloudWatch Logs under /aws-glue/testconnection.
When a job fails, gather the following information: Job name
• Job run ID in the form jr_xxxxx.
• Logs from job runs are located in CloudWatch Logs under /aws-glue/jobs.
Reference :
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/troubleshooting-contact-support.html

Best Practice for Glue from My perspective
Crawler:
善用Exclude Pattern
如果 Crawler 無法正常判斷 -> 善用Custom Classifier
ETL:
開啟 Glue Job Metrics
程式碼優化，需要熟悉Spark

意猶未盡？
立即加入LINE好友 >>掌握AWS最新消息！
Thank you!
~ 歡迎填寫問卷 ~
換取 25美元 Credit Code！

Athena & Glue

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Athena & Glue

Similar to Athena & Glue (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Athena & Glue