Automated Hadoop Cluster Construction on EC2

Automated Hadoop
Clusters on EC2
Mark Kerzner
SHMsoft

What is Hadoop? :) :) :)
Everybody knows that

... What is your definition?

What is a cloud?
Everybody knows that, but

1. Elastic resources
2. Internet delivery
3. SAAS
4. Virtualization
5. Device-enabled
6. Only (1) or all of the above

You are the Hadoop programmer
... and you need tools

What are your alternatives?
● IDE
● Local "cluster"
● Pseudo-distributed cluster
● EC2

You are the Hadoop programmer
... and you need tools

What are your alternatives?
● IDE - compile and run the code
● Local "cluster" - local file system
● Pseudo-distributed cluster - test outside
● EC2 - test on the cluster, test for scale

What are your resources
● Tom White, "Hadoop, the Definitive Guide"
● www.hadoopilluminated.com

For real play, you need a cluster

Hadoop+ (oh, by the way...)
HBase, Cassandra, MongoDB, NoSQL,
Dynamo, BigTable, Dryad (MS), Azure (MS),
MapReduce, MapR (EMC), Cloudera
distribution, EMC distribution, IBM distribution...

Whirr
Setup

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Install
curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz
tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1

Generate key

sssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr

Run
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

Whirr limitations
● No EBS
● All or nothing
● Generates configuration artifacts
● Takes over your computer, no more local
development - uses proxy
● Hard to customize

EMR limitations
● No choice of image
● Fixed architecture
● Hard to debug
● Hard to customize

You do it
Repeat the manual procedure, only automate it

Prepare
AMI, Java, Hadoop

On-the-fly
Start AMI, login, configure, start services,
verify, run test jobs

You do it - advanced

On startup

Under-provision, over-provision, progress

On-the-fly

Monitor, run test jobs, watch for cluster
deterioration

On the large scale
Hadoop 0.20 - up to 4,000 nodes
Hadoop 0.23 - up to 20,000
GridGain - 100's of 1,000's

Automated Hadoop Cluster Construction on EC2

More Related Content

What's hot

Similar to Automated Hadoop Cluster Construction on EC2

More from Mark Kerzner

Recently uploaded

Automated Hadoop Cluster Construction on EC2