0
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Challenges of running Hadoop on...
Overview
● Introduction
● Context
● Challenges
● Questions
Andrei Savu
Software Engineer
Cloud Automation Team @ Cloudera
Previously: founder of Axemblr, Apache Whirr PMC, contribut...
Cloud Automation Team @ Cloudera
Focused on:
● building tools to automate deployment and ongoing
management of Hadoop clus...
Context
● Hadoop
● Types of Deployments
● Cluster Topology
● AWS
Context: Hadoop
Hadoop is a broad, coherent stack of products for data storage
and processing.
“Hadoop” is more than HDFS ...
Types of Deployments
Long running:
- store data for analytics jobs with MapReduce, Impala, Spark
- online data serving wit...
Cluster Topology #1
Simple:
● EC2 classic (being phased-out)
● VPC: single subnet, security group with an optional VPN
Com...
Cluster Topology #2
● Cloudera Reference Architecture for AWS Deployments:
http://www.cloudera.
com/content/cloudera/en/re...
Amazon Web Services
Paradigm shift in how we work with infrastructure.
Key concept: software defined - controlled by APIs
...
Challenges
● Instance Provisioning & Health
● Ensuring Idempotency
● Networking & Performance
● AMIs & Bootstrap Speed
● D...
What makes it more difficult?
… versus a typical stateless web application in an auto-scaling
group monitoring request lat...
Instance Provisioning & Health
Questions:
● How do you define your cluster size to deal lack of capacity?
● How do you def...
Ensuring Idempotency
Questions:
● How do you safely retry expensive calls?
● How do you build reliable workflows?
Potentia...
Networking & Performance
Questions:
● What’s the ideal setup that’s both usable and secure?
● How do you get consistent in...
Images & Bootstrap Speed
Questions:
● Do you allow custom AMIs or force your own choices?
● If using custom AMIs how can y...
Data durability
Questions:
● How do you place replicas? Datacenter topology?
● How are instances distributed in different ...
S3 integration
Questions:
● How do you reconcile differences in semantics with HDFS?
(strongly consistent vs. eventual con...
Thanks! Questions?
Andrei Savu - asavu@cloudera.com
Twitter: @andreisavu
Join us to take Hadoop to the clouds!
https://hir...
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Upcoming SlideShare
Loading in...5
×

Challenges for running Hadoop on AWS - AdvancedAWS Meetup

696

Published on

Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.

Published in: Engineering, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
696
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Challenges for running Hadoop on AWS - AdvancedAWS Meetup"

  1. 1. Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Challenges of running Hadoop on AWS June 12, 2014 @ AdvancedAWS Meetup - Citizen Space Andrei Savu - @andreisavu Software Engineer, Cloud Automation Team
  2. 2. Overview ● Introduction ● Context ● Challenges ● Questions
  3. 3. Andrei Savu Software Engineer Cloud Automation Team @ Cloudera Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)
  4. 4. Cloud Automation Team @ Cloudera Focused on: ● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure ● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.) We are hiring!
  5. 5. Context ● Hadoop ● Types of Deployments ● Cluster Topology ● AWS
  6. 6. Context: Hadoop Hadoop is a broad, coherent stack of products for data storage and processing. “Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc. Usually running on bare metal now moving towards cloud infra.
  7. 7. Types of Deployments Long running: - store data for analytics jobs with MapReduce, Impala, Spark - online data serving with HBase On-demand: - analytical workloads, fetch data on-demand - triggered by workflows (1:1) - disconnected lifecycle (Netflix Genie)
  8. 8. Cluster Topology #1 Simple: ● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN Complex: ● VPC: multiple subnets & security groups ● DirectConnect ● highly available with disaster recovery ● multiple users & security
  9. 9. Cluster Topology #2 ● Cloudera Reference Architecture for AWS Deployments: http://www.cloudera. com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise- reference-architecture-for-aws-deployments.html ● Best Practices for Deploying Cloudera Enterprise on AWS: http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera- enterprise-on-amazon-web-services/
  10. 10. Amazon Web Services Paradigm shift in how we work with infrastructure. Key concept: software defined - controlled by APIs Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.) Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.
  11. 11. Challenges ● Instance Provisioning & Health ● Ensuring Idempotency ● Networking & Performance ● AMIs & Bootstrap Speed ● Data durability ● S3 integration
  12. 12. What makes it more difficult? … versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages ● statefulness (think databases) ● each cluster has multiple processes playing different roles ● topology & configuration changes require orchestration ● knowledge of service inter-dependencies is required
  13. 13. Instance Provisioning & Health Questions: ● How do you define your cluster size to deal lack of capacity? ● How do you define health? Is that stable during setup? ● Is health a binary property? Or a threshold that needs to be continuously evaluated? Potential answers: ● match AWS semantics: define size as a range ● make simplifying assumptions (e.g. healthy during setup)
  14. 14. Ensuring Idempotency Questions: ● How do you safely retry expensive calls? ● How do you build reliable workflows? Potential answers: ● AWS User Guide via client token ● Discuss: Convergence vs. Single step retries
  15. 15. Networking & Performance Questions: ● What’s the ideal setup that’s both usable and secure? ● How do you get consistent intra-cluster performance? Potential answers: ● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it can do a lot more (disk encryption, SSL, kerberos)
  16. 16. Images & Bootstrap Speed Questions: ● Do you allow custom AMIs or force your own choices? ● If using custom AMIs how can you reduce bootstrap time? Potential answers: ● Custom AMIs are common - integrated with existing infra ● Fast bootstrap by baking on top with custom bits
  17. 17. Data durability Questions: ● How do you place replicas? Datacenter topology? ● How are instances distributed in different failure domains? Potential answers: ● ignore or go with large instances that map 1:1 to hosts ● would be nice to have: a way to influence host to instance allocation or to get datacenter topology data
  18. 18. S3 integration Questions: ● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency) ● How do you get most out of it in terms of performance? Potential answers: ● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.) ● performance is network bound
  19. 19. Thanks! Questions? Andrei Savu - asavu@cloudera.com Twitter: @andreisavu Join us to take Hadoop to the clouds! https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×