Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive - Amazon Elastic MapReduce (EMR)

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.

In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey

Deep Dive - Amazon Elastic MapReduce (EMR)

  1. 1. Deep Dive – Amazon Elastic MapReduce Ian Meyers, Solution Architect – Amazon Web Services Guest Speakers: Ian McDonald & James Aley - Swiftkey (,
  2. 2. Agenda Amazon Elastic MapReduce (EMR) Leverage Amazon Simple Storage Service (S3) with Amazon EMR File System (EMRFS) Design patterns and optimizations Space Ape Games
  3. 3. Amazon Elastic MapReduce
  4. 4. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  5. 5. Easy to deploy AWS Management Console Command Line Or use the Amazon EMR API with your favorite SDK.
  6. 6. Easy to monitor and debug Integrated with Amazon CloudWatch Monitor Cluster, Node, IO, Hadoop 1 & 2 Processes Monitor Debug
  7. 7. Amazon S3 and HDFS Browser
  8. 8. Query Editor
  9. 9. Job Browser
  10. 10. Try different configurations to find the optimal cost/performance balance. CPU c3 family cc2.8xlarge d2 family Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types ETL ML Spark HDFS
  11. 11. Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Resizable clusters
  12. 12. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  13. 13. Use bootstrap actions to install applications…
  14. 14. …or to configure Hadoop --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml   core C c hdfs-site.xml   hdfs H h mapred-site.xml   mapred M m yarn-site.xml   yarn Y y
  15. 15.   Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams   No intermediate data persistence required   Simple way to introduce real-time sources into batch-oriented systems   Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  16. 16. Leverage Amazon S3
  17. 17. Amazon S3 as your persistent data store Amazon S3 Designed for 99.999999999% durability Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3
  18. 18. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing of large prefixes via EMRFS metadata
  19. 19. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor
 AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  20. 20. Amazon S3 EMRFS metadata 
 in Amazon DynamoDB List and read-after-write consistency Faster list operations Number of objects Without Consistent View With Consistent View 1,000,000 147.72 29.70 100,000 12.70 3.69 Fast listing of Amazon S3 objects using EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  21. 21. Optimize to leverage HDFS Iterative workloads If you’re processing the same dataset more than once Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing.
  22. 22. Amazon EMR – Design Patterns
  23. 23. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data
  24. 24. Amazon EMR example #2: HBase Server Cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  25. 25. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multi-petabyte warehouse data-platform.html
  26. 26. Optimizations for Storage
  27. 27. File formats Row oriented Text files Sequence files Writable object Avro data files Described by schema Columnar format Object Record Columnar (ORC) Parquet Logical Table Row oriented Column oriented
  28. 28. Choosing the right file format Processing and query tools Hive, Pig, Impala, Presto, Spark Evolution of schema Avro for schema and Orc/Parquet for storage File format “splittability” Avoid JSON/XML files with newlines - default Split is n Compression Block, File or Internal
  29. 29. File sizes Avoid small files Avoid anything smaller than 100 MB Each process is a single Java Virtual machine (JVM) CPU time is required to spawn JVMs Fewer files, matching closely to block size Fewer calls to Amazon S3 Fewer network/HDFS requests
  30. 30. Dealing with small files You *can* reduce HDFS block size (e.g., 1 MB [default is 128 MB]) --bootstrap-action s3://elasticmapreduce/bootstrap-actions/ configure-hadoop --args “-m,dfs.block.size=1048576” Instead use S3DistCp to combine small files together S3DistCp takes a pattern and target path to combine smaller input files into larger ones Supply a target size and compression codec
  31. 31. Compression Always compress data files on Amazon S3 Reduces network traffic between Amazon S3 and Amazon EMR Speeds up your job Compress Mapper and Reducer output Amazon EMR compresses internode traffic with LZO on Hadoop 1, and Snappy on Hadoop 2.
  32. 32. Choosing the right compression Time sensitive: faster compressions are a better choice (Snappy) Large amount of data: use space-efficient compressions (Gzip) Combined workload: use LZO. Algorithm Splittable? Compression Ratio Compress + Decompress Speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  33. 33. Cost-saving tips Use Amazon S3 as your persistent data store (only pay for compute when you need it!). Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost. Use Amazon EC2 Reserved Instances if you have steady workloads. Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).
  34. 34. Cost-saving tips Contact your account manager about custom pricing options, if you are spending more than $10K per month on Amazon EMR.
  35. 35. ©2015,  Amazon  Web  Services,  Inc.  or  its  affiliates.  All  rights  reserved Next Generation Analytics Ian McDonald, Director of IT SwiftKey Twitter: @imcdnzl
  36. 36. What is SwiftKey?
  37. 37. Architecture
  38. 38. Data Capture Architecture
  39. 39. ETL Architecture
  40. 40. Cascalog •  Cascalog is an open source Clojure library implemented using Cascading •  Using instead of things like Hive / Pig •  Write a few lines of Clojure and end up with an EMR job
  41. 41. Parquet (Apache project) •  Developed by Cloudera and Twitter •  Efficient compression and encoding on Hadoop / EMR •  Use for storing and processing our data
  42. 42. Things we’ve learnt
  43. 43. Lessons •  Get on top of serialisation •  Don’t just stick with JSON / Gzip •  Many small files in S3 painful – rebuild to fewer bigger files •  Use spot instances for EMR (except Master node) •  Experiment with different instance types to find best speed / cost effectiveness
  44. 44. What’s Next
  45. 45. Apache Spark •  Easier / faster than Hadoop or database queries •  Processes in RAM •  Directly against S3 data •  Available on EMR •  Not necessarily great for big joins
  46. 46. LONDON