• Like
  • Save

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)

  • 746 views
Uploaded on

Join this advanced technical session on Amazon Elastic MapReduce (EMR) for an introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, how you can take advantage of both …

Join this advanced technical session on Amazon Elastic MapReduce (EMR) for an introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, how you can take advantage of both long and short-lived clusters as well as other Amazon EMR architectural patterns. Learn how to scale your cluster up or down dynamically and about ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

More in: Technology , Travel , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Hi the save is disabled. I would like to save as a pdf. Is it possible to enable this? Thanks Shaun
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
746
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
1
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Ian Meyers, AWS (meyersi@) John Telford, Channel 4 (jtelford@) April 30, 2014
  • 2. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 3. Map-­‐Reduce  Engine   Vibrant  Ecosystem   Hadoop-­‐as-­‐a-­‐Service   Massively  Parallel   Cost  Effec>ve  AWS  Wrapper   Integrated  to  AWS  services   What  is  EMR?  
  • 4. HDFS   Amazon EMR
  • 5. HDFS   Amazon EMR Amazon S3 Amazon DynamoDB
  • 6. HDFS   Analytics languagesData management Amazon EMR Amazon S3 Amazon DynamoDB
  • 7. HDFS   Analytics languagesData management Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB
  • 8. HDFS   Analytics languagesData management Amazon Redshift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB AWS Data Pipeline
  • 9. Amazon EMR Introduction •  Launch clusters of any size in a matter of minutes •  Use variety of different instance sizes that match your workload
  • 10. Amazon EMR Introduction •  Don’t get stuck with hardware •  Don’t deal with capacity planning •  Run multiple clusters with different sizes, specs and node types
  • 11. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 12. Architecting for cost •  EC2/EMR pricing models: –  On-demand: Pay as you go model. –  Spot: Market place. Bid for instances and get a discount –  Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  • 13. Architecting for cost •  On-demand –  Research & Development, Data Science •  Spot –  Restartable Tasks –  Embarrassingly Parallel Workloads •  Reserved Instance –  Well Understood, Frequent and Predicable Workloads
  • 14. EMR Architecture for Optimal Cost Heavy Utilisation RI’s for alive and long- running clusters
  • 15. Use Medium Utilisation RI’s for ad-hoc and unpredictable workloads EMR Architecture for Optimal Cost
  • 16. Supplement with Spot for unpredictable workloads or Turbo Boost EMR Architecture for Optimal Cost
  • 17. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 18. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 & HDFS
  • 19. Pattern #1: Transient vs. Alive Clusters
  • 20. Pattern #1: Transient Clusters •  Cluster lives for the duration of the job •  Shut down the cluster when the job is done •  Data persist on Amazon S3 •  Input & Output Data on Amazon S3
  • 21. Benefits of Transient Clusters 1.  Control your cost 2.  Minimum maintenance •  Cluster goes away when job is done 3.  Practice cloud architecture •  Pay for what you use •  Data processing as a workflow
  • 22. Alive Clusters •  Very similar to traditional Hadoop deployments •  Cluster stays around after the job is done •  Data persistence model: •  Amazon S3 •  Amazon S3 Copy To HDFS •  HDFS and Amazon S3 as backup
  • 23. Alive Clusters •  Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage •  Get in the habit of shutting down your cluster and start a new one, once a week or month •  Design your data processing workflow to account for failure •  You can use workflow managements such as AWS Data Pipeline
  • 24. Pattern #2: Core & Task nodes
  • 25. Core Nodes Master instance group Amazon EMR cluster Core instance group HDFS HDFS Run TaskTrackers (Compute) Run DataNode (HDFS)
  • 26. Core Nodes Can add core nodes More HDFS space More CPU/ memory Master instance group Amazon EMR cluster Core instance group HDFS HDFS HDFS
  • 27. Core Nodes Can’t remove core nodes because of HDFS Master instance group Core instance group HDFS HDFS HDFS Amazon EMR cluster
  • 28. Amazon EMR Task Nodes Run TaskTrackers No HDFS Reads from core node HDFS Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 29. Amazon EMR Task Nodes Can add task nodes Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 30. Amazon EMR Task Nodes More CPU power More memory Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 31. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 32. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 33. Task Node Use-Cases •  Speed up job processing using Spot market –  Run task nodes on Spot market •  Get discount on hourly price –  Nodes can come and go without interruption to your cluster •  When you need extra horsepower for a short amount of time –  Example: Need to pull large amount of data from Amazon S3
  • 34. Pattern #3: Amazon S3 & HDFS
  • 35. Option 1: Amazon S3 as HDFS •  Use Amazon S3 as your permanent data store •  HDFS for temporary storage data between jobs •  No additional step to copy data to HDFS Amazon EMR cluster Task instance groupCore instance group HD FS HD FS Amazon S3
  • 36. Benefits: Amazon S3 as HDFS •  Ability to shut down your cluster HUGE Benefit!! •  Use Amazon S3 as your durable storage 11 9s of durability
  • 37. Benefits: Amazon S3 as HDFS •  No need to scale HDFS •  Capacity •  Replication for durability •  Amazon S3 scales with your data •  Both in IOPs and data storage
  • 38. Benefits: Amazon S3 as HDFS •  Ability to share data between multiple clusters •  Hard to do with HDFS Amazon S3 EMR EMR
  • 39. Benefits: Amazon S3 as HDFS •  Take advantage of Amazon S3 features •  Amazon S3 Server Side Encryption •  Amazon S3 Lifecycle Policies •  Amazon S3 versioning to protect against corruption •  Build elastic clusters •  Add nodes to read from Amazon S3 •  Remove nodes with data safe on Amazon S3
  • 40. What About Data Locality? •  Run your job in the same region as your Amazon S3 bucket •  Amazon EMR nodes have high speed connectivity to Amazon S3 •  If your job Is CPU/memory-bound, locality doesn’t make a huge difference
  • 41. Anti-Pattern: Amazon S3 as HDFS •  Iterative workloads –  If you’re processing the same dataset more than once •  Disk I/O intensive workloads
  • 42. Option 2: Optimise for Latency with HDFS 1.  Data persisted on Amazon S3
  • 43. 2.  Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Option 2: Optimise for Latency with HDFS
  • 44. 3.  Start processing data on HDFS S3DistCp Option 2: Optimise for Latency with HDFS
  • 45. Benefits: HDFS instead of S3 •  Better pattern for I/O-intensive workloads •  Amazon S3 as system of record •  Durability •  Scalability •  Cost •  Features: lifecycle policy, security
  • 46. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 47. Amazon EMR Nodes and Size •  Use m1 and c1 family for functional testing •  Use m3 and c3 xlarge and larger nodes for production workloads •  Use cc2/c3 for memory and CPU intensive jobs •  hs1, hi1, i2 instances for HDFS workloads •  Prefer a smaller cluster of larger nodes
  • 48. Holy Grail Question How many nodes do I need?
  • 49. Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires.
  • 50. Cluster Sizing Calculation 2.  Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  • 51. Resource Capability / Instance Type EC2 Instance Type Mappers Reducers m1.small 2 1 m1.large 3 1 m1.xlarge 8 3 m2.xlarge 3 1 m2.2xlarge 6 2 m2.4xlarge 14 4 m3.xlarge 6 1 m3.2xlarge 12 3 cc2.8xlarge 24 6 c3.4xlarge 24 6 hi1.4xlarge 24 6 hs1.8xlarge 24 6 cr1.8xlarge & c3.8xlarge 48 12
  • 52. Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  • 53. Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 54. Cluster Sizing Calculation Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time Estimated Number Of Nodes:
  • 55. Example: Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires 150 2.  Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  • 56. Example: Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 57. Example: Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 58. Cluster Sizing Calculation Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time Estimated number of nodes: 150 * 3 min 8 * 5 min = 11 m1.xlarge
  • 59. File Best Practices •  Avoid small files at all costs (smaller than 100MB) •  Use Compression
  • 60. Holy Grail Question What if I have small file issues?
  • 61. Dealing with Small Files •  Use S3DistCP to combine smaller files together •  S3DistCP takes a pattern and target file to combine smaller input files to larger ones ./elastic-mapreduce –jar /home/hadoop/lib/ emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[0-9]+- [0-9]+-[0-9]+).*, --targetSize,128,
  • 62. Compression •  Always Compress Data Files On Amazon S3 •  Reduces Bandwidth Between Amazon S3 and Amazon EMR •  Speeds Up Your Job •  Compress Mappers and Reducer Output
  • 63. •  Compression Types: –  Some are fast BUT offer less space reduction –  Some are space efficient BUT Slower –  Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s Compression
  • 64. In Summary •  Practice Cloud Architecture with Transient Clusters •  Utilise Task Nodes on Spot for Increased performance and Lower Cost •  Utilize S3 as the system of record for durability bit.ly/1n0hRSr
  • 65. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk EMR at C4 1.  Who we are. 2.  What we’re doing with EMR. 3.  Lessons learnt.
  • 66. Channel 4 •  State owned, public service broadcaster. •  Self-funded mostly by selling advertising (no TV license fee money!) •  Turnover £1B. •  800 employees. •  Programmes supplied by 250 independent production companies.
  • 67. 12 Years A Slave
  • 68. C4 Virtuous Circle Ad Revenue (£s) = Impacts x Rate Brilliant Program mes Oodles of Viewers Massive Ad Revenue Gigantic Program me Budget
  • 69. C4 Viewer Insight Database •  Clickstream & Ad Server behavioral data. •  10M registered viewers. •  Viewer Panel / Survey & 3rd Party Data. •  Programme metadata. •  60 Tbytes of S3 storage. Google “Channel 4 viewer promise”
  • 70. Expect to pre-process your data We want our Data Scientists to enjoy a User Friendly, High Performance system, containing High Quality Data. Embellish DeriveDecorateIngestAcquire AWS storage S3 Hive HQL query Raw DD Smoke test Analytical Outputs Row by row Drop columns Cleanse data Add flags Lookup values Decorated DD Multirow Multipass Dwell Last visit hit Embellish DD Segmentations Last activity Summary tables Derived DD Raw Data
  • 71. Data profiling SELECT SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)), SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0, 1)), SUM (IF (page_url <> '', 0, 1)), COUNT (DISTINCT service) FROM raw_clickstream; Big Data requires Big Data Profiling.
  • 72. Partitioning CREATE EXTERNAL TABLE web_log ( hit_time_gmt BIGINT, cookie STRING -- and many more columnns. ) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘s3n://bucket/’; ALTER TABLE web_log ADD PARTITION (month='2010-06') LOCATION '2010-06'; ALTER TABLE web_log ADD PARTITION (month='2010-07') LOCATION '2010-07'; -- etc. Help EMR go direct to the data it needs.
  • 73. Connecting data 1 Instanc e RD SSlave s Slave Old approach Redis New approach Slave Redis Slave Redis Slave Redis
  • 74. Handling large amounts of data •  AWS Import/Export. –  Consumer grade USB drives… sent by courier. •  AWS Direct Connect. –  Dedicated network connection from your premises to AWS. –  We have not completed our implementation. •  Glacier.
  • 75. Choosing instances for EMR Source: https://aws.amazon.com/ec2/pricing/ Some instance types omitted from diagram to ease clarity. Exchange rate, $1 = £0.61.
  • 76. Social engineering •  Make the Data Scientists aware of EMR costs. •  We give them visibility of clusters running, who started them, idle time, etc.
  • 77. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk Thanks! Youtube: “Channel 4 Paralympics Meet the Superhumans”
  • 78. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  • 79. Please rate this session using the AWS Summits App and help us build better events
  • 80. #AWSSummit @AWScloud @AWS_UKI bit.ly/1n0hRSr