ACADGILD:: HADOOP LESSON

ACADGILDACADGILD
INTRODUCTION
Are you a Hadoop developer and want to know the basics of configuring Hadoop cluster? If yes then
this blog will help you to set up a single node cluster on your machine right away!
This blog aims to provide a brief on the most needed settings that need to be taken care of, for a
successful installation.
What Is The Default Configuration In Hadoop?
This blog will guide you with the right settings to setup a single node cluster step by step. The single
node mode is usually used by the developers to test their sample codes.
When you download the Hadoop tar file and install it with default settings, you get a standalone mode.
All the xml files for Hadoop contains properties defined by Apache through which Hadoop understands
its limitations and responsibilities as well as its working nature.
The links below give us the default property settings for all types of configuration files that are needed
for Hadoop:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-yarn /hadoop-yarn-common/yarn-default.xml
The four files that need to be configured explicitly while setting up a single node hadoop cluster are:
•Core-site.xml
•HDFS-site.xml
•YARN-site.xml
•xml
Overriding The Default xml Properties In site.xml File
We can override some explicit properties by configuring them in above files.
Example:
https://acadgild.com/blog/key-configurations-in-hadoop-installation/https://acadgild.com/blog/key-configurations-in-hadoop-installation/

ACADGILDACADGILD
In Hadoop, default replication factor is 3 but we can override that property by making replication factor
as 1 by explicitly configuring the property in hdfs-site.xml.
Overriding the default parameters optimizes the cluster, improves performance and lets one know about
the internal working of Hadoop ecosystem.
Below screenshot shows different files which can be either overridden with explicit properties or can
be used as default properties in Hadoop cluster.
How site.xml Overrides default.xml Settings
Hadoop’s jar files are available in the following path:
$HADOOP_HOME/share/hadoop/
[here HADOOP_HOME indicates path where Hadoop is installed]
It gets all the default configuration details, like default replication factor which is 3
from DFSclient.java from one of the jar files

ACADGILDACADGILD
$HADOOP_HOME/share/hadoop/
The default configuration files have specific classpath from where it is always loaded in reference for
working Hadoop. Similarly the modified site.xml files given to developer are loaded from classpath
and checked for additional configuration objects created and deployed into the existing Hadoop
ecosystem overriding the default.xml files.
We will look through the xml files wich we specifically need to alter files at the time of basic
installation of the single node cluster.
Common things to all xml files
We can specify the new value with tags like <property>, <name>, <description>, <final>, etc. inside
predefined <configuration> tag. As Hadoop is an open source framework so the owners have provided
option to override some features by declaring some attribute inside various site.xml files.
Settings that need to be done in Core-site.xml
Some of the important properties are:
•Configuring the name node address
•Configuring the rack awareness factor
•Selecting the type of security
Refer the Table below for the schematic representation of the above properties:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. Either the literal string "local" or a host:port for
NDFS. </description>
<final>true</final>
</property>
<property>
<name>hadoop.security.authentication</name>

ACADGILDACADGILD
<value>kerberos</value>
<description>
Set the authentication for the cluster. Valid values are: simple or kerberos.
</description>
</property>
<property>
<name>fs.trash.interval</name>
<value>0</value>
<description>Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</description>
</property>
<name>fs.default.name</name>Here is a detailed description of the below attribute which is
compulsorily needed for configuring Hadoop single node cluster.
<value>hdfs://localhost:9000</value>
A filesystem path in Hadoop has two main components:
•A URI (Uniform Resource Identifier) that identifies the file system
•A path which specifies only the path
Hadoop tries to find that path on the file system defined by fs.default.name
Syntax:
hdfs://<authority><port>
Hadoop tries to find the path on HDFS whose namenode is running at <authority><port>
At some point, if a user specifies both the URI and the path in the request, then the URI in the request
overrides fs.default.name and Hadoop tries to find the path on the filesystem identified by the URI in

ACADGILDACADGILD
the request.
One of the important tasks done by fs.default.name filesystem is handling the delete operation in
Hadoop ecosystem.
Some of the overridden name attributes are hadoop.security.authentication, fs.trash.interval,
fs.default.name. Explanation for the attribute we use while setting single node cluster is explained here
with the help of these examples.These examples help us to understand it better while sharing the
customized config.
Settings To Be Done In HDFS-site.xml
The properties inside this xml file deals with storage procedure inside HDFS of Hadoop. Some of the
important properties are:
•Configure port access
•Manages ssl client authentication
•Controls Network interface
•Changes file permission
Some of the overridden name attributes are dfs.namenode.name.dir, dfs.datanode.data.dir, blocksize,
replication, etc.
Explanation for the attributes that we use while setting single node cluster is explained here.
Block replication can be configured using the below setting:
<name>replication </name>
<value>3</value>
The default is used if replication is not specified in create time which is 3 .
Maximum block replication can be 512 and minimum can be 1.

ACADGILDACADGILD
We can change the replication factor on a per file basis using the Hadoop FS shell
$hadoop fs-setrep –w 3 /my_file
All files inside directory are available here
$hadoop fs-setrep –w 3 /my_dir
Block size can be configured using
<name>dfs.namenode.name.dir</name>
<value>/user/tom/Hadoop/namenode</value>
This takes the specified path for namenode directory on local filesystem. It has the parent property of
directory and stores the name table. If this is a comma-delemited list of directories then the name table
is replicated in all the directories, for redundancy. In case of any loss for data, this redundancy helps in
recovering the lost data. Here comes the replication factor, which again defines how many copies of a
file has been stored.
<name>dfs.datanode.data.dir</name>
<value>/user/tom/Hadoop/namenode </value>
This takes the specified path for datanode directory on local filesystem. It has the parent property of
directory on the local filesystem on DFS datanode and stores it in blocks. If this is comma delimited list
of directories then data will be stored in named directories, typically on different devices.
<name>dfs.block.size</name>
<value>134217728</value>
It will change the default block size for all files inside HDFS. In this case, we set the dfs.block.size to
128MB. Changing this setting will only affect the block size of files placed into HDFS after this
settings has taken effect.
The fsck command will give replication factor as result with other important factors as shown in figure
below:
$hdfs fsck /<path of file >/<name of file >

ACADGILDACADGILD
Settings In yarn-site.xml
Understanding about yarn-site.xml is easy if I explain you some relative concepts of YARN and why
YARN came into existence in Hadoop v2.x .
In Hadoop v1.x TaskTraker and JobTracker were present to handle the job of allocating resources to
processes.
YARN has ResourceManager settings which effects resource allocation with node manager and
application manager. Some of the important properties are:
•WebAppProxy Configuration
•MapReduce Configuration
•NodeManager Configuration
•ResourceManager Configuration
•IPC Configuration

ACADGILDACADGILD
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
It tells the NodeManager if any auxillary service called mapreduce.shuffle need to implemented. After
we tell the NodeManager to implement that service, we give it a class name as the means to implement
that service. This particular configuration tells MapReduce how to do its shuffle because
NodeManagers won’t shuffle data for a non- MapReduce job. We need to configure such a service for
MapReduce by default.
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
This property tells NodeManager that MapReduce container will have to do a shuffle from the map
tasks to the reduce task.
Previously the shuffle step was part of the MapReduce TaskTracker.
The shuffle is an auxillary service and must be sent in the configuration file. In addition we
have yarn.nodemanager.aux.services.mapreduce.shuffle. Although it is possible to write your own
shuffle handler by extending this class, it is recommended that the default class be used.
Shuffle handler :- It is a process that runs inside the YARN NodeManager, the rest server and many
third party applications and they all use the port 8080. This will result in conflicts if you deploy more
than one at a time without reconfiguring the default port.
Some of the overridden name attributes are yarn.resourcemanager.am.max-attempts,
yarn.resourcemanager.proxy-user-privileges.enabled, yarn.nodemanager.aux-services,
yarn.nodemanager.aux-services.mapreduce.shuffle.class etc.
mapred-site.xml
When Hadoop runs for any analysis of dataset, the framework at runtime for MapReduce jobs is a vast
set of rules for assigning jobs to slave and maintain the jobs records. Here YARN in Hadoop2.x is
introduced to help this framework to work efficiently and take the workload for job related
assignments. It is again a large unit of Hadoop ecosystem which helps running the map and reduce the
collaboration with YARN. Some of the important features it handles are:
•Node health script variables
•Proxy Configuration

ACADGILDACADGILD
•Job Notification Configuration
<name>mapreduce.framework.name</name>
<value>yarn</value>
The value of this attribute determines whether you are running MapReduce framework in local mode,
classic (mapreduce v1) mode or YARN(MapReduce v2) mode. The local mode indicates that the job is
run locally using local JobRunner. If set to YARN , the job is submitted and executed via the YARN-
cluster.
Some of the overridden name attributes are yarn.app.mapreduce.client.max-retries,
mapreduce.shuffle.port, mapreduce.job.tags, I/O properties.
All these properties explained above sum up the requirement for a single node hadoop cluster.
Follow the document given in the below link to set up a pseudo mode single node hadoop cluster for a
deep understanding.
https://drive.google.com/file/d/0Bxr27gVaXO5scjVxZDBzV3IwRVE/view?usp=sharing

ACADGILD:: HADOOP LESSON

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to ACADGILD:: HADOOP LESSON

Similar to ACADGILD:: HADOOP LESSON (20)

More from Padma shree. T

More from Padma shree. T (11)

Recently uploaded

Recently uploaded (20)

ACADGILD:: HADOOP LESSON