Challenges for running Hadoop on AWS - AdvancedAWS Meetup


Published on

Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.

Published in: Engineering, Technology

Challenges for running Hadoop on AWS - AdvancedAWS Meetup

  1. 1. Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Challenges of running Hadoop on AWS June 12, 2014 @ AdvancedAWS Meetup - Citizen Space Andrei Savu - @andreisavu Software Engineer, Cloud Automation Team
  2. 2. Overview ● Introduction ● Context ● Challenges ● Questions
  3. 3. Andrei Savu Software Engineer Cloud Automation Team @ Cloudera Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)
  4. 4. Cloud Automation Team @ Cloudera Focused on: ● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure ● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.) We are hiring!
  5. 5. Context ● Hadoop ● Types of Deployments ● Cluster Topology ● AWS
  6. 6. Context: Hadoop Hadoop is a broad, coherent stack of products for data storage and processing. “Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc. Usually running on bare metal now moving towards cloud infra.
  7. 7. Types of Deployments Long running: - store data for analytics jobs with MapReduce, Impala, Spark - online data serving with HBase On-demand: - analytical workloads, fetch data on-demand - triggered by workflows (1:1) - disconnected lifecycle (Netflix Genie)
  8. 8. Cluster Topology #1 Simple: ● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN Complex: ● VPC: multiple subnets & security groups ● DirectConnect ● highly available with disaster recovery ● multiple users & security
  9. 9. Cluster Topology #2 ● Cloudera Reference Architecture for AWS Deployments: http://www.cloudera. com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise- reference-architecture-for-aws-deployments.html ● Best Practices for Deploying Cloudera Enterprise on AWS: enterprise-on-amazon-web-services/
  10. 10. Amazon Web Services Paradigm shift in how we work with infrastructure. Key concept: software defined - controlled by APIs Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.) Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.
  11. 11. Challenges ● Instance Provisioning & Health ● Ensuring Idempotency ● Networking & Performance ● AMIs & Bootstrap Speed ● Data durability ● S3 integration
  12. 12. What makes it more difficult? … versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages ● statefulness (think databases) ● each cluster has multiple processes playing different roles ● topology & configuration changes require orchestration ● knowledge of service inter-dependencies is required
  13. 13. Instance Provisioning & Health Questions: ● How do you define your cluster size to deal lack of capacity? ● How do you define health? Is that stable during setup? ● Is health a binary property? Or a threshold that needs to be continuously evaluated? Potential answers: ● match AWS semantics: define size as a range ● make simplifying assumptions (e.g. healthy during setup)
  14. 14. Ensuring Idempotency Questions: ● How do you safely retry expensive calls? ● How do you build reliable workflows? Potential answers: ● AWS User Guide via client token ● Discuss: Convergence vs. Single step retries
  15. 15. Networking & Performance Questions: ● What’s the ideal setup that’s both usable and secure? ● How do you get consistent intra-cluster performance? Potential answers: ● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it can do a lot more (disk encryption, SSL, kerberos)
  16. 16. Images & Bootstrap Speed Questions: ● Do you allow custom AMIs or force your own choices? ● If using custom AMIs how can you reduce bootstrap time? Potential answers: ● Custom AMIs are common - integrated with existing infra ● Fast bootstrap by baking on top with custom bits
  17. 17. Data durability Questions: ● How do you place replicas? Datacenter topology? ● How are instances distributed in different failure domains? Potential answers: ● ignore or go with large instances that map 1:1 to hosts ● would be nice to have: a way to influence host to instance allocation or to get datacenter topology data
  18. 18. S3 integration Questions: ● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency) ● How do you get most out of it in terms of performance? Potential answers: ● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.) ● performance is network bound
  19. 19. Thanks! Questions? Andrei Savu - Twitter: @andreisavu Join us to take Hadoop to the clouds!