Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop in the cloud with AWS' EMR

1,705 views

Published on

Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoop

Published in: Technology
  • Be the first to comment

Hadoop in the cloud with AWS' EMR

  1. 1. Hadoop in the Cloud: AWS Elastic Map Reduce • What is EMR? • How does EMR compare to Hadoop? • Use cases
  2. 2. EMR is an AWS Service • AWS review helpful to understand • Infiniteskills offers a course! – http://bit.ly/learn-aws • AWS constantly changing and evolving http://aws.amazon.com/documentation/elasticmapreduce/
  3. 3. EMR Overview • Abstracts out cluster setup & management – Integrated provisioning, tooling, debug, monitoring – AWS constantly tuning and optimizing – Failed nodes automatically re-provisioned by AWS • Reduced costs – Clusters shut down automatically by default – Excellent for sporadic MapReduce needs • Integration to AWS – Leverage cost-effective EC2 instances for processing, S3 for storage – Monitoring done via CloudWatch
  4. 4. EMR Architecture Master Instance Group EC2 S3 Core Instance Group EC2EC2 HDFS HDFS Task Instance Group EC2 EC2 EC2 EC2 • Master group controls cluster • Core group runs DataNode & TaskTracker daemons • Task group runs tasks • Can be added & removed • S3 can be used for data input / output • Master group coordinates core + task activities and manages cluster state • Core + task instances read / write to / from S3
  5. 5. EMR AWS Integration • Datastore pull / push to – RDS – DynamoDB – S3 • Derived data can be stored in RedShift – Via AWS DataPipelines – Further post-processing • Data can be pre-processed with Kinesis
  6. 6. What you give up with EMR • Control – Always 2-3 months behind Hadoop releases – Cannot use CDH or HDP releases (although MapR is supported) • Speed (if you’re not an AWS customer) • Vendor lock-in
  7. 7. EMR Use Cases • Already AWS customer – Lots of data in S3 / DynamoDB / RDS • Sporadic MapReduce needs • Proof-of-concepting Hadoop • Ease of use – Seamless, near-infinite scale – Simple administration
  8. 8. Hadoop in the Cloud: AWS Elastic Map Reduce • What is EMR? • How does EMR compare to Hadoop? • Benefits & downsides • Use cases

×