Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT305) Amazon EMR Deep Dive and Best Practices


Published on

Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.

Published in: Technology
  • For Business Analytics tools Online Training register at
    Are you sure you want to  Yes  No
    Your message goes here

(BDT305) Amazon EMR Deep Dive and Best Practices

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Pathak, AWS Scott Donaldson, FINRA Clayton Kovar, FINRA October 2015 Amazon EMR Deep Dive & Best Practices BDT305
  2. 2. What to expect from the session • Update on the latest Amazon EMR release • Information on advanced capabilities of Amazon EMR • Tips for lowering your Amazon EMR costs • Deep dive into how FINRA uses Amazon EMR and Amazon S3 as their multi-petabyte data warehouse
  3. 3. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Pathak, Sr. Mgr. Amazon EMR (@rahulpathak) October 2015 Amazon EMR Deep Dive & Best Practices
  4. 4. Amazon EMR • Managed clusters for Hadoop, Spark, Presto, or any other applications in the Apache/Hadoop stack • Integrated with the AWS platform via EMRFS – connectors for Amazon S3, Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and AWS KMS • Secure with support for AWS IAM roles, KMS, S3 client-side encryption, Hadoop transparent encryption, Amazon VPC, and HIPAA-eligible • Built in support for resizing clusters and integrated with the Amazon EC2 spot market to help lower costs
  5. 5. New Features EMR Release 4.1 • Hadoop KMS with transparent HDFS encryption support • Spark 1.5, Zeppelin 0.6 • Presto 0.119, Airpal • Hive, Oozie, Hue 3.7.1 • Simple APIs for launch and configuration Intelligent Resize • Incrementally scale up based on available capacity • Wait for work to complete before resizing down • Can scale core nodes and HDFS as well as task nodes
  6. 6. Leverage Amazon S3 with EMR File System (EMRFS)
  7. 7. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at the same data in Amazon S3 • Easily evolve your analytic infrastructure as technology evolves EMR EMR Amazon S3
  8. 8. EMRFS makes it easier to use Amazon S3 • Read-after-write consistency • Very fast list operations • Error handling options • Support for Amazon S3 encryption • Transparent to applications: s3:// Amazon S3
  9. 9. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘samples/pig-apache/input/'
  10. 10. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://elasticmapreduce.samples/pig- apache/input/'
  11. 11. Amazon S3 EMRFS metadata in Amazon DynamoDB List and read-after-write consistency Faster list operations Consistent view and fast listing using the optional EMRFS metadata layer *Tested using a single node cluster with a m3.xlarge instance. Number of objects Without consistent view With consistent view 1,000,000 147.72 29.70 100,000 12.70 3.69
  12. 12. EMRFS client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  13. 13. HDFS is still there if you need it • Iterative workloads • If you’re processing the same dataset more than once • Consider using Spark & RDDs for this too • Disk I/O intensive workloads • Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  14. 14. Optimizations for storage
  15. 15. File formats Row oriented • Text files • Sequence files • Writable object • Avro data files • Described by schema Columnar format • Object Record Columnar (ORC) • Parquet Logical table Row oriented Column oriented
  16. 16. Factors to consider Processing and query tools • Hive, Impala and Presto Evolution of schema • Avro for schema and Presto for storage File format “splittability” • Avoid JSON/XML Files. Use them as records Encryption requirements
  17. 17. File sizes Avoid small files • Anything smaller than 100MB Each mapper is a single JVM • CPU time is required to spawn JVMs/mappers Fewer files, matching closely to block size • fewer calls to S3 • fewer network/HDFS requests
  18. 18. Dealing with small files Reduce HDFS block size, e.g. 1MB (default is 128MB) • --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args “-m,dfs.block.size=1048576” Better: Use S3DistCp to combine smaller files together • S3DistCp takes a pattern and target path to combine smaller input files to larger ones • Supply a target size and compression codec
  19. 19. Compression Always compress data files On Amazon S3 • Reduces network traffic between Amazon S3 and Amazon EMR • Speeds Up Your Job Compress mappers and reducer output Amazon EMR compresses inter-node traffic with LZO with Hadoop 1, and Snappy with Hadoop 2
  20. 20. Choosing the right compression • Time sensitive, faster compressions are a better choice • Large amount of data, use space efficient compressions • Combined Workload, use gzip Algorithm Splittable? Compression ratio Compress + decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  21. 21. Cost saving tips for Amazon EMR Use S3 as your persistent data store – query it using Presto, Hive, Spark, etc. Only pay for compute when you need it Use Amazon EC2 Spot instances to save >80% Use Amazon EC2 Reserved instances for steady workloads
  22. 22. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scott Donaldson, Senior Director Clayton Kovar, Principal Architect EMR & Interactive Analytics
  23. 23. EMR is Ubiquitous in our architecture Data Marts (Amazon Redshift) Query Cluster (EMR) Query Cluster (EMR) Auto Scaled EC2 Analytics App Normalization ETL Clusters (EMR) Batch Analytic Clusters (EMR) Adhoc Query Cluster (EMR) Auto Scaled EC2 Analytics App Users Data Providers Auto Scaled EC2 Data Ingestion Services Optimization ETL Clusters (EMR) Shared Metastore (RDS) Query Optimized (S3) Auto Scaled EC2 Data Catalog & Lineage Services Reference Data (RDS) Shared Data Services Auto Scaled EC2 Cluster Mgt & Workflow Services Source of Truth (S3)
  24. 24. It starts with the data S3 is your durable system of record Separate your compute and storage Shutdown your cluster when not in use Share data among multiple clusters Fault tolerance and disaster recovery Use EMRFS for consistent view Partition your data for performance Optimize for your query use cases and access patterns Larger files >256MB are more efficient Compact small files into >100MB FINRA Data Manager orchestrates data between storage and compute clusters Unified catalog Manage EMR clusters Track usage & lineage Job orchestration
  25. 25. File formats & compression Text format for archival copies on S3 & Amazon Glacier Select compression algorithm for best fit We wanted high compression for archive copy Select a row or columnar format for performance Sequence or AVRO ORC, Parquet, RC File Columnar Benefits: Predicate pushdown Skip unwanted columns Serve multiple query engines: Hive, Presto, Spark Avoid bloated formats with repetitive markup (e.g. XML)
  26. 26. Our partition and query strategy Data received as: Users query by: Symbol Group 1 Symbol Group 2 Symbol Group 3 … Symbol Group 100 Symbol & Firm Query Late Data All late records scanned for all queries On Time Data (Processing Date = Event Date) 99.97% of all records are on time Symbol Only Query FirmOnlyQuery
  27. 27. Example hive table creation create external table if not exists NEW_ORDERS (…) partitioned by (EVENT_DT DATE, HASH_PRTN_NB SMALLINT) stored as orc location 's3://reinvent/new_orders/' tblproperties ("orc.compress"="SNAPPY"); alter table NEW_ORDERS add if not exists partition (event_dt='2015-10-08', hash_prtn_nb=0) location 's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=0/’ … partition (event_dt='2015-10-08', hash_prtn_nb=1000) location 's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=1000/’ ; Each record’s hash partition number is calculated by ((pmod(hash(symbol), 100) * 10) + pmod(firm, 10))
  28. 28. Made hive on EMR/S3 competitive
  29. 29. Partitions are great, but beware… select … from NEW_ORDERS where EVENT_DT between '2015-10-06' and '2015-10-09’ and FIRM = 12345 and (pmod(HASH_PRTN_NB, 10) = pmod(12345, 10) or HASH_PRTN_NB = 1000) -- 1000 is always read Using PMOD around the hash_prtn_nb prevents Hive from using a targeted query on the metastore resulting in millions of partitions returned for pruning
  30. 30. Optimized query with enumeration select … from NEW_ORDERS where EVENT_DT >= '2015-10-06' and EVENT_DT <= '2015-10-09’ and FIRM = 12345 and (HASH_PRTN_NB = 5 or HASH_PRTN_NB = 15 … or HASH_PRTN_NB = 985 or HASH_PRTN_NB = 995 or HASH_PRTN_NB = 1000) -- 1000 is always read Using an IN clause was insufficient to avoid the pruning issue Explicitly enumerating all partitions vastly improved query planning time
  31. 31. Data security Required to have encryption of all data both at-rest, and in-transit S3 server-side encryption was evaluated and determined to be suitable for purpose Encrypt ephemeral storage on Master, Core, and Task nodes Use a custom bootstrap action with LUKS with a random, memory only key Task nodes don’t have HDFS but Mapper and Reducer temporary files need to also be encrypted Lose the server, lose the data – Remember S3 is our source of truth Use security groups to ensure only the client applications connect to the Master node Hive authentication/authorization was not necessary for our usage scenarios Evaluating transparent encryption (Hadoop 2.6+) in HDFS
  32. 32. Selection of the fittest HDFS was cost prohibitive for our use cases Need 30 D2.8XL’s just to store two of our tables: ~$1.5M/yr on HDFS vs ~$120K/yr on S3 Need 90 D2.8XL’s to store all queryable data: ~$4.5M/yr on HDFS vs. $360K/yr on S3 Data locality is desirable but not practical for our scale EMR & S3 with partitioned data is a great fit Tuned queries & data structures on S3 take ~2X if on HDFS under perfect locality conditions Localize data into HDFS on Core nodes using S3DistCp if making 3 or more passes Consider tiered storage External tables in Hive can have a blend of some partitions in HDFS and others in S3 Introduces operational complexity for partition maintenance Doesn’t play well with shared metastore for multiple clusters
  33. 33. Darwin rules: Adaptation Take advantage of new instance types Find the right instance type(s) for your workload Prefer a smaller cluster of larger nodes: e.g. 4XL With millions of partitions, more memory is needed for the Master node (HS2) Use CLI based scripts rather than console → Infrastructure is code Node Type Before After Master 1 - R3.4XL 1 - R3.2XL Core 40 - M3.2XL 10 - C3.4XL Task (peak) 100 - M3.2XL 35 - C3.4XL
  34. 34. Beat the incumbent
  35. 35. Right size your cluster Transient use cases: ETL and batch analytics Size cluster to complete within ten minutes of an hour boundary to optimize $$ Use Spot when you have flexible SLA to save $$ Use On Demand or Reserved to meet SLA at predictable cost Always On use case: Interactive analytics Size Core based on HDFS needs (statistics, logging, etc) Reserve Master and Core nodes Resize # of Task nodes as demand changes Use Spot on Task nodes to save $$ Keep a ratio of Core to Task of 1:5 to avoid bottlenecks Consider bidding Spot above the On Demand price to ensure greater stability
  36. 36. One metastore to rule them all Consider creating a shared hive metastore service Fault tolerance & DR with Multi-AZ RDS Offload metastore hydration of tables and partitions Transient clusters initialize faster Millions of partitions/day can take >7 min/day per table Avoid duplicative effort by separate development teams Separate metastores are needed for Hive 0.13.1, Hive 1.0 and Presto However, you can locate them all on a single RDS instance Utilize FINRA Data Management services to orchestrate metastore updates Register new tables and partitions as the data arrives via notifications
  37. 37. Monitor, learn, and optimize Utilize workload management: Fair Scheduler Refactor your code as necessary to remove bottlenecks Optimize transient clusters, size to execute workload 10 minutes from an hour boundary Set hive.mapred.reduce.tasks.speculative.execution = FALSE when writing to external tables in S3 via Map Reduce Use broadcast joins when joining small tables (SET EMR Step API works fore simple job queuing; use Oozie for more complex jobs
  38. 38. The impact Removed obstacles “Before data analysis of this magnitude required intervention from the technology team.” Lowered the cost of curiosity “Analysts are able to quickly obtain a full picture of what happens to an order over time, helping to inform decision making as to whether a rule violation has occurred.” Elasticity allows us to process years of data in days as opposed to months and save money by using Spot market Separately optimize batch and interactive workloads without compromise Increased teams delivery velocity
  39. 39. Recap Use Amazon S3 as your durable system of record Use transient clusters as much as possible Resize clusters and use the Spot to more efficiently manage capacity, performance, & cost Move to new instance families to take advantage of performance Monitor to determine when to resize or change instance types Share a persistent Hive metastore in RDS among multiple EMR clusters Be prepared to switch your query engine or execution framework in the future Budget time to experiment for new tools & engines at scale that weren’t possible before
  40. 40. Related sessions BDT208 - A Technical Introduction to Amazon Elastic MapReduce Thursday, Oct 8, 12:15 PM - 1:15 PM– Titian 2201B BDT303 - Running Spark & Presto on the Netflix Big Data Platform Thursday, Oct 8, 11:00 AM - 12:00 PM– Palazzo F BDT309 - Best Practices for Apache Spark on Amazon EMR Thursday, Oct 8, 5:30 PM - 6:30 PM– Palazzo F BDT314 - Big Data/Analytics on Amazon EMR & Amazon Redshift Thursday, Oct 8, 1:30 PM - 2:30 PM– Palazzo F
  41. 41. Remember to complete your evaluations!
  42. 42. Thank you!