Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

12,877 views

Published on

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Published in: Technology, Travel, Business
0 Comments
28 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,877
On SlideShare
0
From Embeds
0
Number of Embeds
290
Actions
Shares
0
Downloads
401
Comments
0
Likes
28
Embeds 0
No embeds

No notes for slide

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

  1. 1. Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  3. 3. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  4. 4. Hadoop-as-a-service Map-Reduce engine Integrated with tools What is EMR? Massively parallel Integrated to AWS services Cost effective AWS wrapper
  5. 5. HDFS Amazon EMR
  6. 6. HDFS Amazon EMR Amazon S3 Amazon DynamoDB
  7. 7. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB
  8. 8. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB Amazon RDS
  9. 9. Data management Analytics languages HDFS Amazon EMR Amazon Redshift Amazon RDS AWS Data Pipeline Amazon S3 Amazon DynamoDB
  10. 10. Amazon EMR Introduction • Launch clusters of any size in a matter of minutes • Use variety of different instance sizes that match your workload
  11. 11. Amazon EMR Introduction • Don’t get stuck with hardware • Don’t deal with capacity planning • Run multiple clusters with different sizes, specs and node types
  12. 12. Amazon EMR Introduction • Integration with Spot market • 70-80% discount
  13. 13. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  14. 14. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 as HDFS Pattern #4: Amazon S3 & HDFS Pattern #5: Elastic Clusters
  15. 15. Pattern #1: Transient vs. Alive Clusters
  16. 16. Pattern #1: Transient Clusters • Cluster lives for the duration of the job • Shut down the cluster when the job is done • Data persist on Amazon S3 • Input & output Data on Amazon S3
  17. 17. Benefits of Transient Clusters 1. Control your cost 2. Minimum maintenance • Cluster goes away when job is done 3. Practice cloud architecture • Pay for what you use • Data processing as a workflow
  18. 18. When to use Transient cluster? If ( Data Load Time + Processing Time) * Number Of Jobs < 24 Use Transient Clusters Else Use Alive Clusters
  19. 19. When to use Transient cluster? ( 20min data load + 1 hour Processing time) * 10 jobs = 13 hours < 24 hour = Use Transient Clusters
  20. 20. Alive Clusters • Very similar to traditional Hadoop deployments • Cluster stays around after the job is done • Data persistence model: • Amazon S3 • Amazon S3 Copy To HDFS • HDFS and Amazon S3 as backup
  21. 21. Alive Clusters • Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage • Get in the habit of shutting down your cluster and start a new one, once a week or month • Design your data processing workflow to account for failure • You can use workflow managements such as AWS Data Pipeline
  22. 22. Benefits of Alive Clusters • Ability to share data between multiple jobs Transient cluster Long running clusters EMR EMR Amazon S3 Amazon S3 EMR HDFS HDFS
  23. 23. Benefit of Alive Clusters • Cost effective for repetitive jobs pm EMR pm pm EMR EMR EMR EMR pm
  24. 24. When to use Alive cluster? If ( Data Load Time + Processing Time) * Number Of Jobs > 24 Use Alive Clusters Else Use Transient Clusters
  25. 25. When to use Alive cluster? ( 20min data load + 1 hour Processing time) * 20 jobs = 26hours > 24 hour = Use Alive Clusters
  26. 26. Pattern #2: Core & Task nodes
  27. 27. Core Nodes Amazon EMR cluster Master instance group Run TaskTrackers (Compute) Run DataNode (HDFS) HDFS HDFS Core instance group
  28. 28. Core Nodes Amazon EMR cluster Master instance group Can add core nodes HDFS HDFS Core instance group
  29. 29. Core Nodes Amazon EMR cluster Can add core nodes Master instance group More HDFS space More CPU/mem HDFS HDFS Core instance group HDFS
  30. 30. Core Nodes Amazon EMR cluster Can’t remove core nodes because of HDFS Master instance group HDFS HDFS Core instance group HDFS
  31. 31. Amazon EMR Task Nodes Amazon EMR cluster Run TaskTrackers Master instance group No HDFS Reads from core node HDFS HDFS HDFS Core instance group Task instance group
  32. 32. Amazon EMR Task Nodes Amazon EMR cluster Can add task nodes Master instance group HDFS HDFS Core instance group Task instance group
  33. 33. Amazon EMR Task Nodes Amazon EMR cluster More CPU power Master instance group More memory HDFS HDFS Core instance group Task instance group
  34. 34. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  35. 35. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  36. 36. Tasknode Use-Case #1 • Speed up job processing using Spot market • Run task nodes on Spot market • Get discount on hourly price • Nodes can come and go without interruption to your cluster
  37. 37. Tasknode Use-Case #2 • When you need extra horse power for a short amount of time • Example: Need to pull large amount of data from Amazon S3
  38. 38. Example: HS1 48TB HDFS HS1 48TB HDFS Amazon S3
  39. 39. Add Spot task nodes to load data from Amazon S3 Example: HS1 48TB HDFS m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl Amazon S3 m1.xl m1.xl
  40. 40. Example: HS1 48TB HDFS Remove after data load from Amazon S3 m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl m1.xl m1.xl Amazon S3
  41. 41. Pattern #3: Amazon S3 as HDFS
  42. 42. Amazon S3 as HDFS Amazon EMR cluster • Use Amazon S3 as your permanent data store • HDFS for temporary storage data between jobs • No additional step to copy data to HDFS HDF S HDF S Core instance group Task instance group Amazon S3
  43. 43. Benefits: Amazon S3 as HDFS • Ability to shut down your cluster HUGE Benefit!! • Use Amazon S3 as your durable storage 11 9s of durability
  44. 44. Benefits: Amazon S3 as HDFS • No need to scale HDFS • Capacity • Replication for durability • Amazon S3 scales with your data • Both in IOPs and data storage
  45. 45. Benefits: Amazon S3 as HDFS • Ability to share data between multiple clusters • Hard to do with HDFS EMR EMR Amazon S3
  46. 46. Benefits: Amazon S3 as HDFS • Take advantage of Amazon S3 features • Amazon S3 ServerSideEncryption • Amazon S3 LifeCyclePolicy • Amazon S3 versioning to protect against corruption • Build elastic clusters • Add nodes to read from Amazon S3 • Remove nodes with data safe on Amazon S3
  47. 47. What About Data Locality? • Run your job in the same region as your Amazon S3 bucket • Amazon EMR nodes have high speed connectivity to Amazon S3 • If your job Is CPU/memory-bounded data, locality doesn’t make a difference
  48. 48. Anti-Pattern: Amazon S3 as HDFS • Iterative workloads – If you’re processing the same dataset more than once • Disk I/O intensive workloads
  49. 49. Pattern #4: Amazon S3 & HDFS
  50. 50. Amazon S3 & HDFS 1. Data persist on Amazon S3
  51. 51. 2. Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Amazon S3 & HDFS
  52. 52. 3. Start processing data on HDFS S3DistCp Amazon S3 & HDFS
  53. 53. Benefits: Amazon S3 & HDFS • Better pattern for I/O-intensive workloads • Amazon S3 benefits discussed previously applies • Durability • Scalability • Cost • Features: lifecycle policy, security
  54. 54. Pattern #5: Elastic Clusters
  55. 55. Amazon EMR Elastic Cluster (m) 1. Start cluster with certain number of nodes
  56. 56. Amazon EMR Elastic Cluster (m) 2. Monitor your cluster with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  57. 57. Amazon EMR Elastic Cluster (m) 3. Increase the number of nodes as you need more capacity by manually calling the API
  58. 58. Amazon EMR Elastic Cluster (a) 1. Start your cluster with certain number of nodes
  59. 59. Amazon EMR Elastic Cluster (a) 2. Monitor cluster capacity with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  60. 60. Amazon EMR Elastic Cluster (a) 3. Get HTTP Amazon SNS notification to a simple app deployed on Elastic Beanstalk
  61. 61. Amazon EMR Elastic Cluster (a) 4. Your app calls the API to add nodes to your cluster API
  62. 62. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  63. 63. Amazon EMR Nodes and Size • Use m1.smal, m1.large, c1.medium for functional testing • Use M1.xlarge and larger nodes for production workloads
  64. 64. Amazon EMR Nodes and Size • Use CC2 for memory and CPU intensive jobs • Use CC2/C1.xlarge for CPU intensive jobs • Hs1 instances for HDFS workloads
  65. 65. Amazon EMR Nodes and Size • Hi1 and HS1 instances for disk I/Ointensive workload • CC2 instances are more cost effective than M2.4xlarge • Prefer smaller cluster of larger nodes than larger cluster of smaller nodes
  66. 66. Holy Grail Question How many nodes do I need?
  67. 67. Introduction to Hadoop Splits • Depends on how much data you have • And how fast you like your data to be processed
  68. 68. Introduction to Hadoop Splits Before we understand Amazon EMR capacity planning, we need to understand Hadoop’s inner working of splits
  69. 69. Introduction to Hadoop Splits • Data gets broken up to splits (64MB or 128) Data 128MB Splits
  70. 70. Introduction to Hadoop Splits • Splits get packaged into mappers Data Splits Mappers
  71. 71. Introduction to Hadoop Splits • Mappers get assigned to Mappers nodes for processing Instances
  72. 72. Introduction to Hadoop Splits • More data = More splits = More mappers
  73. 73. Introduction to Hadoop Splits • More data = More splits = More mappers Queue
  74. 74. Introduction to Hadoop Splits • Data mappers > cluster mapper capacity = mappers wait for capacity = processing delay Queue
  75. 75. Introduction to Hadoop Splits • More nodes = reduced queue size = faster processing Queue
  76. 76. Calculating the Number of Splits for Your Job Uncompressed files: Hadoop splits a single file to multiple splits. Example: 128MB = 2 splits based on 64MB split size
  77. 77. Calculating the Number of Splits for Your Job Compressed files: 1. Splittable compressions: same logic as uncompressed files 64MB BZIP 128MB BZIP
  78. 78. Calculating the Number of Splits for Your Job Compressed files: 2. Unsplittable compressions: the entire file is a single split. 128MB GZ 128MB GZ
  79. 79. Calculating the Number of Splits for Your Job Number of splits If data files have unsplittable compression # of splits = number of files Example: 10 GZ files = 10 mappers
  80. 80. Cluster Sizing Calculation Just tell me how many nodes I need for my job!!
  81. 81. Cluster Sizing Calculation 1. Estimate the number of mappers your job requires.
  82. 82. Cluster Sizing Calculation 2. Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  83. 83. Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  84. 84. Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  85. 85. Cluster Sizing Calculation Estimated Number Of Nodes: Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time
  86. 86. Example: Cluster Sizing Calculation 1. Estimate the number of mappers your job requires 150 2. Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  87. 87. Example: Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  88. 88. Example: Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  89. 89. Cluster Sizing Calculation Estimated number of nodes: Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time 150 * 3 min = 11 m1.xlarge 8 * 5 min
  90. 90. File Size Best Practices • Avoid small files at all costs • Anything smaller than 100MB • Each mapper is a single JVM • CPU time required to spawn JVMs/mappers
  91. 91. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 100mgB = 100,000 mappers * 2Sec = total of 55 hours mapper CPU setup time
  92. 92. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 1000MB = 10,000 mappers * 2Sec = total of 5 hours mapper CPU setup time
  93. 93. File Size on Amazon S3: Best Practices • What’s the best Amazon S3 file size for Hadoop? About 1-2GB • Why?
  94. 94. File Size on Amazon S3: Best Practices • Life of mapper should not be less than 60 sec • Single mapper can get 10MB-15MB/s speed to Amazon S3 60sec * 15MB ≈ 1GB
  95. 95. Holy Grail Question What if I have small file issues?
  96. 96. Dealing with Small Files • Use S3DistCP to combine smaller files together • S3DistCP takes a pattern and target file to combine smaller input files to larger ones
  97. 97. Dealing with Small Files Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*, --targetSize,128,
  98. 98. Compressions • Compress as much as you can • Compress Amazon S3 input data files – Reduces cost – Speed up Amazon S3->mapper data transfer time
  99. 99. Compressions • Always Compress Data Files On Amazon S3 • Reduces Storage Cost • Reduces Bandwidth Between Amazon S3 and Amazon EMR • Speeds Up Your Job
  100. 100. Compressions • Compress Mappers and Reducer Output • Reduces Disk IO
  101. 101. Compressions • Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  102. 102. Compressions • If You Are Time Sensitive, Faster Compressions Are A Better Choice • If You Have Large Amount Of Data, Use Space Efficient Compressions • If You Don’t Care, Pick GZIP
  103. 103. Change Compression Type • You May Decide To Change Compression Type • Use S3DistCP to change the compression types of your files • Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --outputCodec,lzo’
  104. 104. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  105. 105. Architecting for cost • AWS pricing models: – On-demand: Pay as you go model. – Spot: Market place. Bid for instances and get a discount – Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  106. 106. Reserved Instances use-case For alive and long-running clusters
  107. 107. Reserved Instances use-case For ad-hoc and unpredictable workloads, use medium utilization
  108. 108. Reserved Instances use-case For unpredictable workloads, use Spot or on-demand pricing
  109. 109. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  110. 110. Adv. Optimizations (Stage 1) • The best optimization is to structure your data (i.e., smart data partitioning) • Efficient data structuring= limit the amount of data being processed by Hadoop= faster jobs
  111. 111. Adv. Optimizations (Stage 1) • Hadoop is a batch processing framework • Data processing time = an hour to days • Not a great use-case for shorter jobs • Other frameworks may be a better fit – Twitter Storm – Spark – Amazon Redshift, etc.
  112. 112. Adv. Optimizations (Stage 1) • Amazon EMR team has done a great deal of optimization already • For smaller clusters, Amazon EMR configuration optimization won’t buy you much – Remember you’re paying for the full hour cost of an instance
  113. 113. Adv. Optimizations (Stage 1) Best Optimization??
  114. 114. Adv. Optimizations (Stage 1) Add more nodes
  115. 115. Adv. Optimizations (Stage 2) • Monitor your cluster using Ganglia • Amazon EMR has Ganglia bootstrap action
  116. 116. Adv. Optimizations (Stage 2) • Monitor and look for bottlenecks – Memory – CPU – Disk IO – Network IO
  117. 117. Adv. Optimizations Run Job
  118. 118. Adv. Optimizations Run Job Find Bottlenecks Ganglia CPU Disk Memory
  119. 119. Adv. Optimizations Run Job Find Bottlenecks Ganglia CPU Disk Memory Address Bottleneck Fine Tune Change Algo
  120. 120. Network IO • Most important metric to watch for if using Amazon S3 for storage • Goal: Drive as much network IO as possible from a single instance
  121. 121. Network IO • Larger instances can drive > 600Mbps • Cluster computes can drive 1Gbps -2 Gbps • Optimize to get more out of your instance throughput – Add more mappers?
  122. 122. Network IO • If you’re using Amazon S3 with Amazon EMR, monitor Ganglia and watch network throughput. • Your goal is to maximize your NIC throughput by having enough mappers per node
  123. 123. Network IO, Example Low network utilization Increase number of mappers if possible to drive more traffic
  124. 124. CPU • Watch for CPU utilization of your clusters • If >50% idle, increase # of mapper/reducer per instance – Reduces the number of nodes and reduces cost
  125. 125. Example Adv. Optimizations (Stage 2) What potential optimization do you see in this graph?
  126. 126. Example Adv. Optimizations (Stage 2) 40% CPU idle. Maybe add more mappers?
  127. 127. Disk IO • Limit the amount of disk IO • Can increase mapper/reducer memory • Compress data anywhere you can • Monitor cluster and pay attention to HDFS bytes written metrics • One play to pay attention to is mapper/reducer disk spill
  128. 128. Disk IO • Mapper has in memory buffer mapper mapper memory buffer
  129. 129. Disk IO • When memory gets full, data spills to disk mapper mapper memory buffer data spills to disk
  130. 130. Disk IO • If you see mapper/reducer excessive spill to disk, increase buffer memory per mapper • Excessive spill when ratio of “MAPPER_SPILLED_RECORDS” and “MAPPER_OUTPUT_RECORDS” is more than 1
  131. 131. Disk IO Example: MAPPER_SPILLED_RECORDS = 221200123 MAPPER_OUTPUT_RECORDS = 101200123
  132. 132. Disk IO • Increase mapper buffer memory by increasing “io.sort.mb” <property><name>io.sort.mb<name><value>200</value>< /property> • Same logic applies to reducers
  133. 133. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  134. 134. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  135. 135. Remember! Run Job Find Bottlenecks Ganglia CPU Disk Memory Address Bottleneck Fine Tune Change Algo
  136. 136. Please give us your feedback on this presentation BDT404 As a thank you, we will select prize winners daily for completed surveys!

×