Create your own Hadoop distributed cluster using 3 virtual machines. Linux (CentOS 6 or RHEL 6) can be used, along with Java and Hadoop binary distributions.
1. Setting up a HADOOP 2.2 Cluster on RHEL / CentOS 6
This article presents steps to create a HADOOP 2.2 cluster on VMware workstation 8/9/10. Following
is an outline of the installation process.
1. Clone and configure Virtual Machines for setup
2. Install and configure Java and HADOOP software on Master node
3. Copy Master node VM configuration to slave nodes
Let us start with the cluster configuration. We need at least 3 Virtual Machines. 1 Master node, and 2
Slave nodes. All VMs have similar configuration, as follows.
Processor – 2 CPU (dual core)
RAM – 2 GB
HDD – 100 GB
NIC – Virtual NIC
Virtual Machine (VM) Configuration
Create a virtual machine and install RHEL 6.2 on it. Following is the initial configuration done for this VM.
Hostname node1
IP Address 192.168.1.15
MAC Address 00:0C:29:11:66:D3
Subnet mask 255.255.255.0
Gateway 192.168.1.1
After configuring these settings, make a copy of it that will be utilized for other virtual machines. To
make VMs unique, prior to cloning a VM, change its MAC address and after booting, configure the IP
addresses as per following table.
Step 1– Clone and configure Virtual Machines for setup
Machine Role MAC Address IP Address Hostname
HADOOP Master Node 00:0C:29:11:66:D3 192.168.1.15 master1
HADOOP Slave Node 1 00:50:56:36:EF:D5 192.168.1.16 slave1
HADOOP Slave Node 2 00:50:56:3B:2E:64 192.168.1.17 slave2
After setting up the first virtual machine, we may need to configure initial settings, as per following
details.
2. 1. Disabling SELinux
2. Disabling Firewall
3. Host names, IP addresses and MAC addresses
A record of above is good to be kept for ready reference, as given in the table above.
Configure Hosts for IP network communication
# vim /etc/hosts
192.168.1.15 master1
192.168.1.16 slave1
192.168.1.17 slave2
Create a user hadoop with password-less authentication
A user called hadoop is created and we have to login as "hadoop" for all configuration and management
of HADOOP cluster.
# useradd hadoop
# passwd hadoop
su - hadoop
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave2
chmod 0600 ~/.ssh/authorized_keys
exit
Download Java binaries
Let us see installing Java from a tar file obtained from oracle.com, unlike the rpm method.
# wget http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-
i586.tar.gz?AuthParam=1386669648_7d41138392c2fe62a5ad481d4696b647
Java Installation using tarball
Java is a prerequisite for installing HADOOP on any system. Recommended java versions are given for
HADOOP on Apache foundation website. We should go with the recommended versions.
Following steps explain installation of Java on Linux using a tarball.
cd /opt/
tar xvf JDK_7u45_tar/jdk-7u45-linux-i586.tar.gz
cd jdk1.7.0_45/
3. alternatives --install /usr/bin/java java /opt/jdk1.7.0_45/bin/java 2
alternatives --config java
Output
[root@master1 opt]# cd jdk1.7.0_45/
[root@master1 jdk1.7.0_45]# alternatives --install /usr/bin/java java /opt
/jdk1.7.0_45/bin/java 2
[root@master1 jdk1.7.0_45]# alternatives --config java
There are 3 programs which provide 'java'.
Selection Command
-----------------------------------------------
*+ 1 /usr/lib/jvm/jre-1.6.0-openjdk/bin/java
2 /usr/lib/jvm/jre-1.5.0-gcj/bin/java
3 /opt/jdk1.7.0_45/bin/java
Enter to keep the current selection[+], or type selection number: 3
[root@master1 jdk1.7.0_45]# ll /etc/alternatives/java
lrwxrwxrwx 1 root root 25 Dec 10 16:03 /etc/alternatives/java -> /opt/jdk1.7.0_4
5/bin/java
[root@master1 jdk1.7.0_45]#
[root@master1 jdk1.7.0_45]# java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Client VM (build 24.45-b08, mixed mode)
[root@master1 jdk1.7.0_45]# export JAVA_HOME=/opt/jdk1.7.0_45/
[root@master1 jdk1.7.0_45]# export JRE_HOME=/opt/jdk1.7.0_45/jre
[root@master1 jdk1.7.0_45]# export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin
[root@master1 jdk1.7.0_45]#
Configure Java PATH
export JAVA_HOME=/opt/jdk1.7.0_45/
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin
After installing Java, its path need to be persistent across reboots. The above setting can be appended to
/etc/profile so that it is common to all users.
4. Installing HADOOP binaries
The "/opt" directory in Linux is provided for 3rd party applications.
# cd /opt/
[root@master1 hadoop]# wget http://hadoop-2.2.....tar.gz
# tar -xzf hadoop-2.2....tar.gz
# mv hadoop-2.2.0... hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
cd /opt
Tar -zxvf hadoop.2.2.tar
[root@master1 ~]# ll /opt/
total 12
drwxr-xr-x 11 hadoop hadoop 4096 Jun 26 02:31 hadoop
[hadoop@master1 ~]$ ll /opt/hadoop/
total 2680
drwxr-xr-x 2 hadoop hadoop 4096 Jun 27 02:14 bin
drwxr-xr-x 3 hadoop hadoop 4096 Oct 6 2013 etc
-rwxrw-rw- 1 hadoop hadoop 2679682 Jun 26 02:29 hadoop-test.jar
drwxr-xr-x 2 hadoop hadoop 4096 Oct 6 2013 include
drwxr-xr-x 3 hadoop hadoop 4096 Oct 6 2013 lib
drwxr-xr-x 2 hadoop hadoop 4096 Jun 12 09:52 libexec
-rw-r--r-- 1 hadoop hadoop 15164 Oct 6 2013 LICENSE.txt
drwxrwxr-x 3 hadoop hadoop 4096 Jun 27 02:38 logs
-rw-r--r-- 1 hadoop hadoop 101 Oct 6 2013 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop 1366 Oct 6 2013 README.txt
drwxr-xr-x 2 hadoop hadoop 4096 May 18 04:55 sbin
drwxr-xr-x 4 hadoop hadoop 4096 Oct 6 2013 share
drwxrwxr-x 4 hadoop hadoop 4096 Jun 26 20:47 tmp
Configure hadoop cluster setup using these steps on all nodes:
Login as user hadoop and edit '~/.bashrc' as follows.
[hadoop@master1 ~]$ pwd
/home/hadoop
[hadoop@master1 ~]$ cat .bashrc
# .bashrc
5. # Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
export JAVA_HOME=/opt/jdk1.7.0_60
export HADOOP_INSTALL=/opt/hadoop
export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
[hadoop@master1 ~]$
Configuring HADOOP, starting, and viewing status
Change folder to /opt/hadoop/hadoop/etc/hadoop
Edit 'hadoop-env.sh' and set proper value for JAVA_HOME such as '/opt/jdk1.7.0_40'.
Do not leave it as ${JAVA_HOME} as that does not works.
[hadoop@master1 ~]$ cd /opt/hadoop/etc/hadoop/
[hadoop@master1 hadoop]$ cat hadoop-env.sh
export JAVA_HOME=/opt/jdk1.7.0_60
Edit '/opt/hadoop/hadoop/libexec/hadoop-config.sh' and prepend following line at start of
script:
export JAVA_HOME=/opt/jdk1.7.0_60
Create Hadoop tmp directory
Use 'mdkir /opt/hadoop/tmp'
Edit 'core-site.xml' and add following between <configuration> and </configuration>:
[hadoop@master1 hadoop]$ cat core-site.xml
<configuration>
<property>
7. <configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit 'yarn-site.xml' and as following
[hadoop@master1 hadoop]$ cat yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master1:8040</value>
</property>
</configuration>
Copy Master node VM configuration to slave nodes
Format all namenodes master, slave1, slave2, etc. using 'hdfs namenode -format'
Do following only on master machine:
Edit 'slaves' files so that it contains:
slave1
slave2
Note : If master is also expected to serve as datanode (store hdfs files) then add 'master' to the slaves
file as well.
8. Run 'start-dfs.sh' and 'start-yarn.sh' commands
Run 'jps' and verify on master 'ResourceManager', 'NameNode' and 'SecondaryNameNode' are
running.
Run 'jps' on slaves and verify that 'NodeManager' and 'DataNode' are running.
To stop all HADOOP services, run the following command:
Run 'stop-dfs.sh' and 'stop-yarn.sh' commands
Web Access URLs for Services
After starting HADOOP services, you can view and monitor their status using following URLs.
Access NameNode at http://master1:50070 and ResourceManager at http://master1:8088