Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS May Webinar Series - Getting Started with Amazon EMR


Published on

If you are interested to know more about AWS Chicago Summit, please use the following to register:

Many AWS customers store vast amounts of data in Amazon S3, a low cost, scalable, and durable object store; Amazon DynamoDB, a NoSQL database; or Amazon Kinesis, a real time data stream processing service. With large datasets in various AWS services, how do you derive value from this information in a cost-effective way? Using Amazon Elastic MapReduce (Amazon EMR) with applications in the Apache Hadoop ecosystem, you can directly interact with data in each of these storage services for scalable analytics workloads or ad hoc queries. You can quickly and easily launch an Amazon EMR cluster from the AWS Management Console, and scale your cluster to match the compute and memory resources needed for your workflow, independent from the storage capacity used in your AWS storage services. The webinar will accelerate your use of Amazon EMR by showing you how to create and monitor Amazon EMR clusters, and provide several use cases and architectures for using Amazon EMR with different AWS data stores.

Learning Objectives: • Recognize when to use Amazon EMR • Understand the steps required to set up and monitor an Amazon EMR cluster • Architect applications that effectively use Amazon EMR • Understand how to use HUE for ad hoc query of data in Amazon S3

Who Should Attend: • Developers, LOB owners, Continuous Integration & Continuous Delivery (CICD) practitioners

Published in: Technology

AWS May Webinar Series - Getting Started with Amazon EMR

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz Senior Product Manager, Amazon EMR May 20, 2015 Getting Started with Amazon EMR Easy, fast, secure, and cost-effective Hadoop on AWS.
  2. 2. Agenda • Is Hadoop the answer? • Amazon EMR 101 • Integration with AWS storage and database services • Common Amazon EMR design patterns • Q+A
  3. 3. When leveraging your data to derive new insights, Big Data problems are everywhere • Data lacks structure • Analyzing streams of information • Processing large datasets • Warehousing large datasets • Flexibility for undefined ad hoc analysis • Speed of queries on large data sets
  4. 4. Hadoop is the right system for Big Data • Massively parallel • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  5. 5. Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Pig, Hive, Cascading, Mahout, Giraph HBase Presto Impala Hadoop 2 Batch MapReduce Storage S3, HDFS Hadoop 1 Applications
  6. 6. Customers across many verticals
  7. 7. Amazon Elastic MapReduce (EMR) is the easiest way to run Hadoop in the cloud.
  8. 8. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  9. 9. Easy to Use Launch a cluster in minutes
  10. 10. Easy to deploy AWS Management Console AWS Command Line Interface You can also use the Amazon EMR API with your favorite SDK or use AWS Data Pipeline to start your clusters.
  11. 11. Try different configurations to find your optimal architecture. CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  12. 12. Low Cost Pay an hourly rate
  13. 13. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Mix on-demand and EC2 Spot capacity for low costs Meet SLA at predictable cost Exceed SLA at lower cost
  14. 14. Use multiple EMR instance groups Master Node r3.2xlarge Example Amazon EMR Cluster Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge (EC2 Spot) Slave Group – Task m3.2xlarge (EC2 Spot) Core nodes run HDFS (DataNode). Task nodes do not run HDFS. Core and Task nodes each run YARN (NodeManager). Master node runs NameNode (HDFS), ResourceManager (YARN), and serves as a gateway.
  15. 15. Elastic Easily add or remove capacity
  16. 16. Easy to add and remove compute capacity in your cluster from the console, CLI, or API. Match compute demands with cluster sizing. Resizable clusters
  17. 17. Use S3 instead of HDFS for your data layer to decouple your compute capacity and storage Amazon S3 Amazon EMR Shut down your EMR clusters when you are not processing data, and stop paying for them!
  18. 18. Reliable Spend less time monitoring
  19. 19. Easy to monitor and debug Monitor with Amazon CloudWatch or Ganglia Cluster, Node, and IO Monitor Debug
  20. 20. EMR logging to S3 makes logs easily available
  21. 21. Secure Integrates with AWS security features
  22. 22. Use Identity and Access Management (IAM) roles with your Amazon EMR cluster • IAM roles give AWS services fine grained control over delegating permissions to AWS services and access to AWS resources • EMR uses two IAM roles: • EMR service role is for the Amazon EMR control plane • EC2 instance profile is for the actual instances in the Amazon EMR cluster • Default IAM roles can be easily created and used from the AWS Console and AWS CLI
  23. 23. EMR Security Groups: default and custom A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster • There is a single default master and default slave security group across all of your clusters • The master security group has port 22 access for SSHing to your cluster You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies. Slave Security Group Master Security Group
  24. 24. Other Amazon EMR security features EMRFS encryption options • S3 server-side encryption • S3 client-side encryption (use AWS Key Management Service keys or custom keys) CloudTrail integration • Track Amazon EMR API calls for auditing Launch your Amazon EMR clusters in a VPC • Logically isolated portion of the cloud (“Virtual Private Network”) • Enhanced networking on certain instance types
  25. 25. Flexible Customize the cluster
  26. 26. Hadoop applications available in EMR
  27. 27. Use Hive on EMR to interact with your data in HDFS and Amazon S3 • Batch or ad hoc workloads • Integration with EMRFS for better performance reading and writing to S3 • SQL-like query language to make iterative queries easier • Schema-on-read to query data without needing pre-processing • Use Tez engine for faster queries
  28. 28. Use Pig to easily create ETL workflows • Uses high-level “Pig Latin” language to easily script data transformations in Hadoop • Strong optimizer for workloads • Options to create robust user defined functions
  29. 29. Use HBase on a persistent EMR cluster as a noSQL scalable database • Billions of rows and millions of columns • Backup to and restore from Amazon S3 • Flexible datatypes • Modulate your HBase tables when adding new data to your system
  30. 30. Impala: a fast SQL query engine for EMR Clusters • Low-latency SQL query engine for Hadoop • Perfect for fast ad hoc, interactive queries on structured on unstructured data • Can be easily installed on an EMR cluster, and queried from the CLI or a 3rd party BI tool • Perfect for memory optimized instances • Currently uses HDFS as data layer
  31. 31. Hadoop User Experience (Hue) Query Editor
  32. 32. Hue Job Browser
  33. 33. Hue File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
  34. 34. To install anything else, use Bootstrap Actions
  35. 35. Spark: an alternative engine to Hadoop with its own ecosystem of applications • Does not use map-reduce framework • In-memory for fast queries • Great for machine learning or other iterative queries • Use Spark SQL to create a low-latency data warehouse • Spark Streaming for real-time workloads
  36. 36. Also use Bootstrap Actions to configure your applications --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  37. 37. EMR Step API • EMR step can be a map- reduce job, Hive program, Pig script, or even an arbitrary script • Easily submit Step from console, CLI, or API • Submit multiple steps to use EMR as a sequential workflow engine Submit work via the EMR Step API or SSH to the EMR master node Connect to Master Node • Connect to HUE, interact with application CLIs, or submit work directly to the Hadoop APIs • View the Hadoop UI • Useful for long-running clusters and interactive use cases
  38. 38. Let’s see it! Quick tour of the EMR Console and HUE on an EMR cluster
  39. 39. Diverse set of partners to use with Amazon EMR BI / Visualization Business Intelligence BI / Visualization BI / Visualization Hadoop Distribution Data Transfer Data Transformation Monitoring Performance Tuning Graphical IDE Graphical IDE Available on AWS Marketplace Available as a distribution in Amazon EMR ETL Tool BI / Visualization
  40. 40. Integration with AWS storage and database services
  41. 41. Choose your data stores
  42. 42. Amazon S3 as your persistent data store Amazon S3 • Designed for 99.999999999% durability • Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
  43. 43. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view • For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata
  44. 44. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Consistent view and fast listing using the optional EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  45. 45. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  46. 46. Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real-time sources into batch-oriented systems Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  47. 47. Use Hive with EMR to query data DynamoDB • Export data stored in DynamoDB to Amazon S3 • Import data in Amazon S3 to DynamoDB • Query live DynamoDB data using SQL- like statements (HiveQL) • Join data stored in DynamoDB and export it or query against the joined data • Load DynamoDB data into HDFS and use it in your EMR job
  48. 48. Use AWS Data Pipeline and EMR to transform data and load into Amazon Redshift Unstructured Data Processed Data Pipeline orchestrated and scheduled by AWS Data Pipeline
  49. 49. Amazon EMR design patterns
  50. 50. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data
  51. 51. Using Amazon S3 and HDFS Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Ad-hoc Query Data aggregated and stored in Amazon S3 Amazon Confidential
  52. 52. Multiple EMR workflows using the same S3 dataset Computations S3DistCp CascalogLZO Input Amazon S3 bucket Intermediate Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Crashlytics (part of Twitter) uses EMR to analyze data in S3 to power dashboards on its Answers platform.
  53. 53. Amazon EMR example #2: Long-running cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  54. 54. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multipetabyte warehouse data-platform.html
  55. 55. EMR example #4: EMR for ETL and query engine for investigations which require all raw data TBs of logs sent daily Logs stored in S3 Hourly EMR cluster using Spark for ETL Load subset into Redshift DW Transient EMR cluster using Spark for ad hoc analysis of entire log set
  56. 56. Client/Sensor Recording Service Aggregator/ Sequencer Continuous Processor Data Warehouse Analytics and Reporting EMR Example #5: Streaming Data
  57. 57. Client/Sensor Recording Service Aggregator/ Sequencer Continuous Processor Data Warehouse Analytics and Reporting Kafka Common Tools
  58. 58. Amazon Kinesis Streaming Data Repository Amazon Kinesis
  59. 59. Client/ Sensor Recording Service Aggregator/ Sequencer Continuous Processor for Dashboard Data Warehouse Analytics and Reporting Amazon Kinesis Amazon EMR Streaming Data RepositoryLogging Data Processing Log4J Amazon Kinesis + Amazon EMR = Fewer Moving Parts
  60. 60. Processed output in real-time and batch workflows Input push with Log 4J to Hive Pig Cascading pull from Spark Amazon EMR Amazon Kinesis Customer Application Amazon DynamoDB Real-time processing with Spark Streaming and batch workloads on Kinesis streams with the Hadoop stack
  61. 61. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services. Details • July 1, 2015 • Chicago, Illinois • @ McCormick Place Featuring • New product launches • 36+ sessions, labs, and bootcamps • Executive and partner networking Registration is now open • Come and see what AWS and the cloud can do for you.
  62. 62. CTA Script - If you are interested in learning more about how to navigate the cloud to grow your business - then attend the AWS Summit Chicago, July 1st. - Register today to learn from technical sessions led by AWS engineers, hear best practices from AWS customers and partners, and participate in some of the 30+ paid sessions and labs. - Simply go to amps&trk=Webinar_slide to register today. - Registration is FREE. TRACKING CODE: - Listed above.
  63. 63. Thank you!