Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境

1,896 views

Published on

曾在 Apache Hadoop、Spark 及資料倉儲設備上處理分析、資料處理和資料科學工作負載等作業的客戶,現在正從內部部署的作業方式遷移至 Amazon EMR,以節省成本、增加可用性以及改進效能。Amazon EMR 是一種受管服務,可讓您運用超過 15 種 Apache Hadoop 及 Spark 生態系統的最新版開放原始碼框架,以處理及分析極大量資料集。本專題演講的重點將放在辨別您現有環境中的元件及工作流程,以及提供將這類工作負載遷移至 Amazon EMR 的最佳實務。我們將講解如何從 HDFS 移至可做為長效耐用儲存層使用的 Amazon S3,以及如何運用Amazon EC2 競價型執行個體及 Auto Scaling(自動擴充)來降低成本。此外,我們也將示範如何運用無伺服器服務 Athena 在 Amazon S3 內部分析資料,無需管理任何基礎架構,再更進一步降低成本。最後將說明多項常見微調訣竅來加快生產速度。

Published in: Technology
  • Be the first to comment

如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dickson Yue, Solutions Architect June 2nd, 2017 如何利用 Amazon EMR 及 Athena 打造高成本效益的大數據環境
  2. 2. 議程 • 解構當前的大數據環境 • Amazon Athena, Amazon EMR應用 • 將組件遷移到Amazon Athena,技巧和竅門 • 將組件遷移到Amazon EMR,技巧和竅門 • 客戶實例
  3. 3. 解構現有大數據環境
  4. 4. 數據分析平台技術的發展 Data warehouse appliances 1985 2006 Hadoop clusters 2009 Decoupled EMR clusters 2012 Cloud DWH Redshift Today Clusterless Athena Glue
  5. 5. Amazon SQS apps Streaming KCL apps Amazon Redshift Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams FastSlowFast SearchSQLNoSQLCacheFileMessageStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Reference architecture LoggingIoTApplicationsTransportMessaging ETL BatchMessageInteractiveStreamML Amazon EMR AWS Lambda Amazon Kinesis Analytics Amazon Athena 儲存 STORE 使用 CONSUME 過程/分析 PROCESS / ANALYZE 收集 COLLECT
  6. 6. Use case 使用案例 Redshift Amazon Athena Amazon EMR Other 互動 Interactive 需要秒 示例:自助式儀表板, Self-service dashboards Redshift Athena + S3 Presto Spark + S3 Amazon Elasticsearch service RDS 批量 Batch 需要幾分鐘到幾個小時 示例:每日/每週/每月報告 MapReduce Hive Pig Spark Glue 串流 Stream 毫秒到秒 示例:欺詐警報 Fraud alerts ,1分鐘指標 Spark streaming Kinesis Analytics KCL Storm Lambda 機器學習 Machine Learning 需要毫秒到幾分鐘 示例:欺詐檢測 Fraud detection, ,預測需求 Forecast demand Spark ML Amazon Machine Learning Deep learning AMI Slow
  7. 7. 將工作遷移到 Amazon Athena
  8. 8. 直接從Amazon S3作數據查詢 • 無需儲入數據 • 以原始格式查詢數據 • Athena支持多種數據格式 • Text,CSV,TSV,JSON,weblogs,AWS service logs • 或者轉換為優化形式,如ORC或Parquet,以獲得最佳性能和最 低成本 • 不需要ETL • 直接將數據流入Amazon S3 • 利用Amazon S3的耐用性和可用性
  9. 9. 例子
  10. 10. 例子 Ad-hoc access to raw data using SQL
  11. 11. 例子 Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  12. 12. 技巧和竅門
  13. 13. 按查詢付款 - 掃描$ 5 / TB • 支付每個查詢掃描的數據量 • 節省成本的方法 • 壓縮 • 轉換為Columnar格式 • 使用 partitioning • 免費:DDL查詢,失敗的查詢 Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  14. 14. 轉換為ORC和PARQUET • 您可以使用Hive CTAS轉換數據 • CREATE TABLE new_key_value_store • STORED AS PARQUET • As • SELECT col_1,col2,col3 FROM noncolumartable • SORT BY new_key,key_value_pair; • 您也可以使用Spark將文件轉換為PARQUET / ORC • 20行Pyspark代碼,將1TB的文本數據轉換為130 GB的PARQUET在EMR上運 行 • 快速轉換總成本$ 5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  15. 15. 如何定義你的 partitions CREATE EXTERNAL TABLE Employee ( Id INT, Name STRING, Address STRING ) PARTITIONED BY (year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘s3://mybucket/athena/inputdata/’; CREATE EXTERNAL TABLE Employee ( Id INT, Name STRING, Address STRING, INT Year ) PARTITIONED BY (year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘s3://mybucket/athena/inputdata/’;
  16. 16. 如何定義你的 partitions s3://elasticmapreduce/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ CREATE EXTERNAL TABLE impressions ( requestBeginTime string, ......) PARTITIONED BY (dt string) LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ; PRE dt=2009-04-12-14-10/ MSCK REPAIR TABLE impressions
  17. 17. 如何定義你的 partitions s3://athena-examples/elb/plaintext/ elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt elb/plaintext/2015/01/01_$folder$ elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt elb/plaintext/2015/01/02/_$folder$ ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/' ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='02') location 's3://athena-examples/elb/plaintext/2015/01/02/'
  18. 18. 將工作遷移到 Amazon EMR
  19. 19. 挑戰
  20. 20. 在地Hadoop叢集 • 1U機組 • 通常為12內核,32/64 GB RAM和6 - 8 TB硬盤($ 3-4K) • 不同的node角色 • HDFS使用在地磁盤,容量大小需 付合3x數據複製 • 網路交換器和機架 • 開放源碼版本或固定商業發行的許 可條款 Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core
  21. 21. 在同一個叢集上運行的工作類型 • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  22. 22. 生產線 Over utilized Under utilized
  23. 23. 技巧和竅門
  24. 24. 遷移的關鍵和TCO考慮 • DO NOT LIFT AND SHIFT • 透過S3,張存儲和計算分開 • 解構工作負載並映射到開源工具 • 短暫的群集和自動縮放 • 選擇實例類型和EC2 Spot實例
  25. 25. 分拆運算和存儲,使用S3去作為您的數據層 HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  26. 26. 在S3上運行Hbase作為可擴展NoSQL
  27. 27. S3提示:分區,壓縮和文件格式 • 避免按字典順序排列鍵盤名稱 • 提高吞吐量和S3列表性能 • 使用散列/隨機前綴或反轉日期時間 • 壓縮數據集,將帶寬從S3減小到EC2 • 確保使用可拆分壓縮或將每個文件作為集群上並行化的最 佳大小 • 像Parquet這樣的列狀文件格式可以提高讀取性能
  28. 28. 多個Storage layer可供選擇 Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  29. 29. TCO – 短暫或長時間運行的集群
  30. 30. 提交作業的選項 Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster Use Oozie on your cluster to build DAGs of jobs
  31. 31. 集群界面可快速調整工作負載 管理應用程序 SQL editor, Workflow designer, Metastore browser Notebooks 設計和執行查詢和 工作負載
  32. 32. 性能和硬件 • 短暫或長時間運行 • 實例類型 • 群集大小 • 應用程序設置 • 文件格式和S3調優 Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) 注意事項
  33. 33. Spot for task nodes Up to 80% off EC2 On-Demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity 使用Spot和Reserved instance降低成本 以可預見的成本滿足SLA 以較低的成本超出SLA
  34. 34. 使用 Advanced spot Master Node Core Instance Fleet Task Instance Fleet • 選擇提供Spot和On-Demand的instance type • 根據容量/價格,從而在最優的可用區域啟動 • Spot Block支持
  35. 35. 用Auto scale降低成本
  36. 36. 客戶實例
  37. 37. DataXu – 180TB of Log Data per Day CDN Real Time Bidding Retargeting Platform Kinesis Attribution & ML S3 Reporting Data Visualization Data Pipeline ETL(Spark SQL) Ecosystem of tools and services Amazon Athena
  38. 38. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators FINRA: Migrating from on-prem to AWS
  39. 39. FINRA saved 60% by moving to HBase on EMR
  40. 40. Lower Cost and Higher Scale than On-Premises
  41. 41. 總結:跟據使用實例,選擇正確的工具 Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink - 低延遲SQL - >Athena,Presto或Amazon Redshift - 數據倉庫/報表 - > Spark或Hive或Glue或Amazon Redshift - 管理和監控 - > EMR控制台或Ganglia指標 - HDFS - > S3 - 筆記本 - > Zeppelin筆記本或Jupyter(通過bootstrap動作) - 查詢控制台 - >Athena或Redshift Spectrum色相 - 安全 - >Ranger(CF template)或HiveServer2或IAM角色 Glue Amazon Redshift
  42. 42. Athena 壓縮 轉換為Columnar格式 使用 Partitioning 總結 Amazon EMR DO NOT LIFT AND SHIFT 用 S3 張存儲和計算分開 短暫運行 Spot fleet instances Autoscaling
  43. 43. 謝謝 dyue@amazon.com aws.amzon.com/emr blogs.aws.amazon.com/bigdata

×