Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

752 views

Published on

Learning Objectives:
- Learn how to use Amazon EMR for easy, fast, and cost-effective processing of vast amounts of data across dynamically scalable Amazon EC2 instances.
- Learn how using EC2 Spot can significantly reduce the cost of running your clusters.
- Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters.

  • Be the first to comment

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chad Schmutzer, Solutions Architect - EC2 Spot Instances September 13, 2017 Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR)
  2. 2. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data
  3. 3. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters
  4. 4. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters • Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters
  5. 5. What We Will Cover • Introduction to Amazon EMR • Introduction to Amazon EC2 Spot Instances • Walk through provisioning an EMR cluster using EMR instance fleets • Brief introduction to AWS Glue • Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore • Q & A
  6. 6. What is Amazon EMR?
  7. 7. PIG Infrastructure Data Layer Process Layer Framework Applications
  8. 8. PIG SQL Infrastructure Data Layer Process Layer Framework Applications
  9. 9. PIG SQL Amazon EMR
  10. 10. PIG SQL Amazon EMR Amazon S3 EMRFS
  11. 11. YARN PIG SQL Amazon EMR EMRFS Amazon S3
  12. 12. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  13. 13. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  14. 14. Why EMR? Managed, Easy to Use, & Current EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling
  15. 15. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  16. 16. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API Latest versions!
  17. 17. Amazon EMR Releases
  18. 18. Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Amazon EMR Release
  19. 19. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  20. 20. Many Storage Layers to Choose From Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  21. 21. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  22. 22. Decouple Storage and Compute by Using S3 as Your Data Layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  23. 23. HBase on S3 for Scalable NoSQL
  24. 24. S3 Tips: Partitions, Compression, and File Formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  25. 25. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  26. 26. # CPUs Time # CPUs Time Wall clock time: 1 hourWall clock time: 10 hours Cost & Time
  27. 27. Why EMR? Low-cost Transient clusters Reserved instances Spot Instances
  28. 28. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  29. 29. Why EMR? Flexibility Compute Memory Storage Machine Learning C4 Family C3 Family X1 Family R3 Family Interactive Analysis D2 Family I2 Family Large HDFS General Batch Process M4 Family M3 Family
  30. 30. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Core nodes can be added and removed gracefully Master Node must keep running Cluster can tolerate loss of task nodes EMR Nodes - Customizable
  31. 31. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  32. 32. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  33. 33. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Meet SLA at predictable cost Exceed SLA at lower cost Amazon EMR supports most EC2 instance types Use Spot and Reserved Instances to Lower Cost
  34. 34. Instance Fleets for Advanced Spot Provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  35. 35. What are Amazon EC2 Spot Instances?
  36. 36. On-Demand Pay for compute capacity by the hour with no long-term commitments For spiky workloads, or to define needs AWS EC2 Consumption Models Reserved Make a low, one-time payment and receive a significant discount on the hourly charge For committed utilization Spot Market Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive, transient, or stateless workloads
  37. 37. Spare Capacity at Scale AWS has millions of active customers every month, including more than 2,300 government agencies, 7,000 education institutions and more than 22,000 nonprofit organizations that have used AWS in the last 12 months.
  38. 38. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  39. 39. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  40. 40. The Very Simple Rules of Spot Instances
  41. 41. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand.
  42. 42. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand. You’ll never pay more than your bid. When the market exceeds your bid you get 2 minutes to wrap up your work.
  43. 43. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can:
  44. 44. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget.
  45. 45. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload.
  46. 46. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both!
  47. 47. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both! • Either way, you should try it!
  48. 48. Understanding EC2 Capacity AZ1 AZ2 (N. California) Total Capacity P2 C4 M4 I3 R4 D2 Shared Dedicated Shared Dedicated x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
  49. 49. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 Capacity and Spot Markets Recap us-east-2
  50. 50. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family Capacity and Spot Markets Recap us-east-2
  51. 51. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size Capacity and Spot Markets Recap us-east-2
  52. 52. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone Capacity and Spot Markets Recap us-east-2
  53. 53. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region Capacity and Spot Markets Recap us-east-2
  54. 54. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region • Is a separate Spot Market Capacity and Spot Markets Recap us-east-2
  55. 55. Bid Price vs. Market Price
  56. 56. You pay the market price Bid Price vs. Market Price
  57. 57. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price
  58. 58. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price Keep it simple and just bid 100% On-Demand price!
  59. 59. EC2 Spot Instance Best Practices - Diversification • Multiple EC2 instance types selected • Multiple Availability Zones selected • Pick instance types with similar performance characteristics. For example: c3.large, m3.large, r3.large, c4.large, m4.large, r4.large…
  60. 60. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  61. 61. • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to Amazon EC2 Spot Bid Advisor
  62. 62. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  63. 63. EC2 Spot Advisor in Console (New!)
  64. 64. EC2 Spot Advisor in Console (New!)
  65. 65. Example Customer Use Case
  66. 66. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators FINRA: Migrating From On-Prem to AWS
  67. 67. Lower Cost and Higher Scale Than On-Premises
  68. 68. FINRA Saved 60% by Moving to HBase on EMR
  69. 69. Walk through provisioning an EMR cluster using EMR instance fleets (Console and CLI)
  70. 70. What is AWS Glue?
  71. 71. Fully Managed Data Catalog & ETL Service Integrates with AWS/Non-AWS Data Stores Scalable No Admin AWS Glue Learn more: https://aws.amazon.com/glue/
  72. 72. Glue automates data cataloging & preparation  Catalogues data sources  Identifies data formats and data types  Generates Extract, Transform, Load code  Executes ETL jobs; managing dependencies Amazon Glue – Fully Managed ETL Service
  73. 73. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  74. 74. Use an External Metastore AWS Glue Use the AWS Glue Data Catalog to store external table metadata for Hive and Spark Amazon S3Set metastore location in hive-site
  75. 75. Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore (Console and CLI)
  76. 76. Q & A
  77. 77. Thank you!
  78. 78. Appendix
  79. 79. Reference links EC2 Spot Documentation: http://aws.amazon.com/ec2/spot/ http://aws.amazon.com/ec2/spot/bid-advisor/ http://aws.amazon.com/ec2/spot/getting-started/ http://aws.amazon.com/ec2/spot/faqs/ http://aws.amazon.com/ec2/spot/testimonials/ User Guide http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html Helpful AWS Blog Posts https://aws.amazon.com/blogs/aws/focusing-on-spot-instances-lets-talk-about-best-practices/ https://aws.amazon.com/blogs/aws/building-price-aware-applications-using-ec2-spot-instances/ https://aws.amazon.com/blogs/compute/cost-effective-batch-processing-with-amazon-ec2-spot/ https://aws.amazon.com/blogs/compute/dynamic-scaling-with-ec2-spot-fleet/

×