1. Hadoop Cluster Configuration on AWS EC2
-----------------------------------------------------------------------------------------------------------
Buy some Instances on Aws amazon and one master and 10 slaves
ec2-50-17-21-209.compute-1.amazonaws.com master
ec2-54-242-251-124.compute-1.amazonaws.com slave1
ec2-23-23-17-15.compute-1.amazonaws.com slave2
ec2-50-19-79-241.compute-1.amazonaws.com slave3
ec2-50-16-49-229.compute-1.amazonaws.com slave4
ec2-174-129-99-84.compute-1.amazonaws.com slave5
ec2-50-16-105-188.compute-1.amazonaws.com slave6
ec2-174-129-92-105.compute-1.amazonaws.com slave7
ec2-54-242-20-144.compute-1.amazonaws.com slave8
ec2-54-243-24-10.compute-1.amazonaws.com slave9
ec2-204-236-205-227.compute-1.amazonaws.com slave10
----------------------------------------------------------------------------------------------------------------------------
• Make seperation as one master and 10 slaves
----------------------------------------------------------------------------------------------------------------------------
• Make sure ssh is working from master to all slaves
----------------------------------------------------------------------------------------------------------------------------
• Add the Ip and DNS name and duplicate DNS name in /etc/hosts in master
----------------------------------------------------------------------------------------------------------------------------
• Master /etc/hosts file Looks like this.
127.0.0.1 localhost localhost.localdomain
10.155.245.153 ec2-50-17-21-209.compute-1.amazonaws.com master
10.155.244.83 ec2-54-242-251-124.compute-1.amazonaws.com slave1
10.155.245.185 ec2-23-23-17-15.compute-1.amazonaws.com slave2
10.155.244.208 ec2-50-19-79-241.compute-1.amazonaws.com slave3
10.155.244.246 ec2-50-16-49-229.compute-1.amazonaws.com slave4
10.155.245.217 ec2-174-129-99-84.compute-1.amazonaws.com slave5
10.155.244.177 ec2-50-16-105-188.compute-1.amazonaws.com slave6
10.155.245.152 ec2-174-129-92-105.compute-1.amazonaws.com slave7
10.155.244.145 ec2-54-242-20-144.compute-1.amazonaws.com slave8
10.155.244.71 ec2-54-243-24-10.compute-1.amazonaws.com slave9
10.155.244.46 ec2-204-236-205-227.compute-1.amazonaws.com slave10
----------------------------------------------------------------------------------------------------------------------------
• and slaves etc/hosts file looks like this.
• remove 127.0.0.1 in all slaves
10.155.245.153 ec2-50-17-21-209.compute-1.amazonaws.com master
10.155.244.83 ec2-54-242-251-124.compute-1.amazonaws.com slave1
10.155.245.185 ec2-23-23-17-15.compute-1.amazonaws.com slave2
10.155.244.208 ec2-50-19-79-241.compute-1.amazonaws.com slave3
10.155.244.246 ec2-50-16-49-229.compute-1.amazonaws.com slave4
2. 10.155.245.217 ec2-174-129-99-84.compute-1.amazonaws.com slave5
10.155.244.177 ec2-50-16-105-188.compute-1.amazonaws.com slave6
10.155.245.152 ec2-174-129-92-105.compute-1.amazonaws.com slave7
10.155.244.145 ec2-54-242-20-144.compute-1.amazonaws.com slave8
10.155.244.71 ec2-54-243-24-10.compute-1.amazonaws.com slave9
10.155.244.46 ec2-204-236-205-227.compute-1.amazonaws.com slave10
---------------------------------------------------------------------------------------------------------------------------
• Download Hadoop installation folder from ApacheHadoop release and keep it in master folder
(Ex:-/usr/local/hadoop1.0.4)
----------------------------------------------------------------------------------------------------------------------------
• Open the Hadoop.env.sh file from (Hadoop-.10.4/conf/) folder
----------------------------------------------------------------------------------------------------------------------------
• set the environment variables for JAVA PATH,HADOOP_HOME, LD_LIBRARY_PATH,
HADOOP_OPTS
export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64
export HADOOP_HOME=/usr/local/hadoop-1.0.4/
export LD_LIBRARY_PATH=/usr/local/hadoop-1.0.4/lib/native/Linux-amd64-64
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
export HADOOP_HEAPSIZE=400000 (in MB)
----------------------------------------------------------------------------------------------------------------------------
• Open the Hdfs-Site.xml file.
• and set the following param's
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.log.dir</name>
<value>/media/ephemeral0/logs</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/media/ephemeral0/tmp-${user.name}
</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/media/ephemeral0/data-${user.name}</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/media/ephemeral0/name-${user.name}</value>
</property>
3. <property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication</description>
</property>
<property>
<name>dfs.block.size</name>
<value>536870912</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
----------------------------------------------------------------------------------------------------------------------------
• Open the Mapred-site.xml.
• and set the following param's
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.log.dir</name>
<value>/media/ephemeral0/logs</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>60000</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>-Xmx400m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>14</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
4. <value>14</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/media/ephemeral0/system-${user.name}</value>
<description>
system directory to run map and reduce tasks
</description>
</property>
<property>
<name>hadoop.log.dir</name>
<value>/media/ephemeral0/log-${user.name}</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>10</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name> mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
<property>
<name>mapred.create.symlink</name>
<value>true</value>
</property>
<property>
<name>mapred.child.ulimit</name>
<value>unlimited</value>
</property>
</configuration>
5. ----------------------------------------------------------------------------------------------------------------------------
• Open the Core-Site.Xml
• and set the following param's
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/media/ephemeral0/data-${user.name}</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/media/ephemeral0/tmp-${user.name}</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/media/ephemeral0/data-${user.name}</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/media/ephemeral0/name-${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
----------------------------------------------------------------------------------------------------------------------------
• Open the Masters file and set the following param's
master
----------------------------------------------------------------------------------------------------------------------------
• Open the Slaves file and set the following param's
slave1
salve2
salve3
salve4
salve5
salve6
salve7
salve8
salve9
6. salve10
----------------------------------------------------------------------------------------------------------------------------
• Give owner Permision to the Ec2-user in all slaves for the folder /media(all folders which are
all we using for hadoop).
----------------------------------------------------------------------------------------------------------------------------
• from master copy full hadoop-1.0.4 to all slave
ex:- Scp -r /usr/local/hadoop-1.0.4 ec2-50-17-21-209.compute-
1.amazonaws.com:/usr/local/hadoop-1.0.4
----------------------------------------------------------------------------------------------------------------------------
• copy to all slaves from master.
----------------------------------------------------------------------------------------------------------------------------
• Add port 50000-50100 in security groups in aws console.
Hadoop namenode -format from master
and start-all.sh
----------------------------------------------------------------------------------------------------------------------------