3. AWS Athena
Amazon Athena uses Presto with ANSI SQL support and works with a variety
of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.
Reference :
https://aws.amazon.com/athena/
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
https://docs.aws.amazon.com/general/latest/gr/rande.html
4. Athena Common Issue
• 503 Slow down
high number of requests being received by the S3 bucket per second. We
can achieve 3,500 PUT/POST/DELETE and 5,500 GET requests per second per
prefix in a bucket. If the request rate on a prefix exceeds this rate, then S3
throttles the requests with the 503 Slow Down error.
• Solution : Combine Small file
Reference:
https://docs.aws.amazon.com/zh_tw/AmazonS3/latest/dev/optimizing-
performance.html
5. Athena Common Issue
• Only support Query s3 data
• Json Format Single line
• Other Error
Reference :
https://aws.amazon.com/athena/
https://aws.amazon.com/premiumsupport/knowledge-center/error-json-
athena/
6. Athena Common Issue – Slow
1. Partition your data
2. Optimize file sizes
....
Partition:
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
Reference :
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
https://docs.aws.amazon.com/zh_tw/athena/latest/ug/partitions.html
8. Create Partition Info and Query
MSCK REPAIR TABLE impressions
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00'
and dt>='2009-04-12-13-00' ORDER BY dt DESC LIMIT 100
16. Glue Common Issue
• Crawler too slow
• ETL Job too slow
• ETL Job OOM
• Crawler/ETL job Fail
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
17. Glue Common Issue – Crawler too slow
• Crawler will list all the prefix in s3 and decide to read it or not
• More data will get slower
• Solution: Exclude pattern
Reference :
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-glue.html
18. Glue Common Issue – ETL Job too slow
• Glue ETL Job is Apache Spark Environment. We need to gathering more
information to troubleshooting.
• Glue ETL is design for Batch Job
l
19. Glue Common Issue
• ETL Driver OOM
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/monitor-profile-debug-oom-
abnormalities.html
20. Glue Common Issue
• ETL Driver OOM Possible reason
1. Listing too many file
2. rdd.collect() -> Spark function
• Solution
Push Down Predicate
Batch Job
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/aws-glue-
programming-etl-partitions.html
22. Glue Common Issue - ETL Executor OOM
• Possible root cause
1. Rdd.Repartition() -> Spark action
2. Data Skew
• Solution
1. Check the source data
2. Different Worker Type
23. Glue Worker Type
工作者類型 可使用以下工作者類型:
Standard – 選擇這種類型時,您也要提供 Maximum capacity (容量上限) 的值。容量
上限是可在此任務執行時分配之 AWS Glue 資料處理單位 (DPU) 數目上限。DPU 是相
對的處理能力,包含 4 個 vCPU 的運算容量和 16 GB 的記憶體。Standard 工作者類
型包含 50 GB 磁碟和 2 個執行器。
G.1X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (4 個 vCPU、16 GB 記憶體、64 GB 磁碟),並為每個工作者提供
1 個執行器。我們建議記憶體密集型任務採用這種工作者類型。
G.2X – 選擇這種類型時,您也要提供 Number of workers (工作者數目) 的值。每個工
作者會映射到 1 DPU (8 個 vCPU、32 GB 記憶體、128 GB 磁碟),並為每個工作者提
供 1 個執行器。我們建議記憶體密集型任務和執行 ML 轉換的任務採用這種工作者
類型。
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-job.html
29. Glue Common Issue – Crawler Customer Classifier
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/add-classifier.html
30. Create Support Case
When a crawler fails, gather the following information:
• Crawler name
• Logs from crawler runs are located in CloudWatch Logs under /aws-glue/crawlers.
When a test connection fails, gather the following information: Connection name
• Connection ID
• JDBC connection string in the form jdbc:protocol://host:port/database-name.
• Logs from test connections are located in CloudWatch Logs under /aws-glue/testconnection.
When a job fails, gather the following information: Job name
• Job run ID in the form jr_xxxxx.
• Logs from job runs are located in CloudWatch Logs under /aws-glue/jobs.
Reference :
https://docs.aws.amazon.com/zh_tw/glue/latest/dg/troubleshooting-contact-support.html
33. Best Practice for Glue from My perspective
Crawler:
善用Exclude Pattern
如果 Crawler 無法正常判斷 -> 善用Custom Classifier
ETL:
開啟 Glue Job Metrics
程式碼優化,需要熟悉Spark