Automated Hadoop
 Clusters on EC2
    Mark Kerzner
     SHMsoft
What is Hadoop? :) :) :)
Everybody knows that

... What is your definition?
What is a cloud?
Everybody knows that, but

1.   Elastic resources
2.   Internet delivery
3.   SAAS
4.   Virtualization
5.   Device-enabled
6.   Only (1) or all of the above
You are the Hadoop programmer
... and you need tools

What are your alternatives?
● IDE
● Local "cluster"
● Pseudo-distributed cluster
● EC2
You are the Hadoop programmer
... and you need tools

What are your alternatives?
● IDE - compile and run the code
● Local "cluster" - local file system
● Pseudo-distributed cluster - test outside
● EC2 - test on the cluster, test for scale
What are your resources
●   Tom White, "Hadoop, the Definitive Guide"
●   www.hadoopilluminated.com
For real play, you need a cluster
Hadoop+ (oh, by the way...)
HBase, Cassandra, MongoDB, NoSQL,
Dynamo, BigTable, Dryad (MS), Azure (MS),
MapReduce, MapR (EMC), Cloudera
distribution, EMC distribution, IBM distribution...
Whirr
Setup

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...


Install
curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz
tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1

Generate key

sssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr

Run
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
Whirr limitations
● No EBS
● All or nothing
● Generates configuration artifacts
● Takes over your computer, no more local
  development - uses proxy
● Hard to customize
Amazon EMR
EMR limitations
●   No choice of image
●   Fixed architecture
●   Hard to debug
●   Hard to customize
You do it
Repeat the manual procedure, only automate it

Prepare
AMI, Java, Hadoop

On-the-fly
Start AMI, login, configure, start services,
verify, run test jobs
You do it - advanced

On startup

Under-provision, over-provision, progress

On-the-fly

Monitor, run test jobs, watch for cluster
deterioration
Cloudera Manager
MapR Manager
On the large scale
Hadoop 0.20 - up to 4,000 nodes
Hadoop 0.23 - up to 20,000
GridGain - 100's of 1,000's
Thank you
Questions?

Automated Hadoop Cluster Construction on EC2

  • 1.
    Automated Hadoop Clusterson EC2 Mark Kerzner SHMsoft
  • 2.
    What is Hadoop?:) :) :) Everybody knows that ... What is your definition?
  • 3.
    What is acloud? Everybody knows that, but 1. Elastic resources 2. Internet delivery 3. SAAS 4. Virtualization 5. Device-enabled 6. Only (1) or all of the above
  • 4.
    You are theHadoop programmer ... and you need tools What are your alternatives? ● IDE ● Local "cluster" ● Pseudo-distributed cluster ● EC2
  • 5.
    You are theHadoop programmer ... and you need tools What are your alternatives? ● IDE - compile and run the code ● Local "cluster" - local file system ● Pseudo-distributed cluster - test outside ● EC2 - test on the cluster, test for scale
  • 6.
    What are yourresources ● Tom White, "Hadoop, the Definitive Guide" ● www.hadoopilluminated.com
  • 7.
    For real play,you need a cluster
  • 8.
    Hadoop+ (oh, bythe way...) HBase, Cassandra, MongoDB, NoSQL, Dynamo, BigTable, Dryad (MS), Azure (MS), MapReduce, MapR (EMC), Cloudera distribution, EMC distribution, IBM distribution...
  • 9.
    Whirr Setup export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... Install curl-O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1 Generate key sssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr Run bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
  • 10.
    Whirr limitations ● NoEBS ● All or nothing ● Generates configuration artifacts ● Takes over your computer, no more local development - uses proxy ● Hard to customize
  • 11.
  • 12.
    EMR limitations ● No choice of image ● Fixed architecture ● Hard to debug ● Hard to customize
  • 13.
    You do it Repeatthe manual procedure, only automate it Prepare AMI, Java, Hadoop On-the-fly Start AMI, login, configure, start services, verify, run test jobs
  • 14.
    You do it- advanced On startup Under-provision, over-provision, progress On-the-fly Monitor, run test jobs, watch for cluster deterioration
  • 15.
  • 16.
  • 17.
    On the largescale Hadoop 0.20 - up to 4,000 nodes Hadoop 0.23 - up to 20,000 GridGain - 100's of 1,000's
  • 18.