Hadoop installation, Configuration, and Mapreduce program

Outline Introduction Hadoop Installation Hadoop Conﬁguration Starting & Stopping Map Reduce Execution
Big Data & Hadoop
D. Praveen Kumar
Research Scholar (Full-Time)
Department of Computer Science & Engineering
YSREC of Yogi Vemana University, Proddatur
Kadapa Dt., A. P, India
November 8, 2016
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 1 / 43

1 Introduction
2 Hadoop Installation
3 Hadoop Conﬁguration
4 Starting & Stopping
5 Map Reduce
6 Execution
Big Data & Hadoop

GUESTS =4
Transportation from railway station to your
home( one Auto/car is suﬃcient)
mom can prepare food or snacks without risk.
Your house is suﬃcient for Accommodation.
Facilities like bed, bathrooms, water and TV are
provided which you use.
You can talk to each other and crack jokes and
you can make them happy
Expenditure is nearly Rs.1000/-
Big Data & Hadoop

GUESTS =100
Transportation = 25 autos/car or two
buses
Food = catering.
Accommodation = Lodge.
Facilities = AC, TV, and all other facilities
Maintenance= somewhat diﬃcult
Expenditure =nearly Rs. 90,000/-
Big Data & Hadoop

GUESTS =10000
Transportation = 2500 autos or 500 buses
Food = catering.
Accommodation = all Lodges, function
halls and cottages in the town.
Facilities = AC, TV, and all other
facilities are somewhat diﬃcult to provide.
Maintenance= more diﬃcult
Expenditure =nearly Rs. 2,00,000/-
Big Data & Hadoop

GUESTS =10000000
Transportation=how many autos=?
Food =?
Accommodation =?
Facilities =?
Maintenance=?
Cost =?
Big Data & Hadoop

Problems
Same we assume in computing environment
Difficult to handle a huge and ever growing amount of data
Processing of data can not be possible with few machines
distributing large data sets is difficult
Construction of online or offline models are very difficult
Big Data & Hadoop

Solution
A single solution to all these problems is
Big Data & Hadoop

What is Big Data?
Big data refers to voluminous amounts of structured or
unstructured data that organizations can potentially mine and
analyze.
Big data is huge amount of large data sets characterized by
Big Data & Hadoop

Big Data Platforms and Analytical Software
Big Data & Hadoop

Hadoop
Here we go with
Why ?
Big Data & Hadoop

Hadoop
Apache Hadoop is an open-source software framework for
distributed storage and distributed processing of very large data
sets on computer clusters built from commodity hardware.
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common contains libraries and utilities needed by
other Hadoop modules
Hadoop Distributed File System (HDFS) a distributed
ﬁle-system that stores data
Hadoop YARN a resource-management platform
Hadoop MapReduce for large scale data processing.
Big Data & Hadoop

Hadoop Components
Big Data & Hadoop

Requirements
Necessary
Java >= 7
ssh
Linux OS (Ubuntu >=
14.04)
Hadoop framework
Optional
Eclipse
Internet connection
Big Data & Hadoop

Java 7 & Installation
Hadoop requires a working Java installation. However, using
java 1.7 or more is recommended.
Following command is used to install java in linux platform
sudo apt-get install openjdk-7-jdk (or)
sudo apt-get install default-jdk
Big Data & Hadoop

Java PATH Setup
We need to set JAVA path
Open the .bashrc ﬁle located in home directory
gedit ~/.bashrc
Add below line at the end:
export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64
Big Data & Hadoop

Installation & Conﬁguration of SSH
Hadoop requires SSH(Secure Shell) access to manage its
nodes, i.e. remote machines plus your local machine if you
want to use Hadoop on it.
Install SSH using following command
sudo apt-get install ssh
First, we have to generate DSA an SSH key for user.
ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa
cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys
Big Data & Hadoop

Download & Extract Hadoop
Download Hadoop from the Apache Download Mirrors
http://mirror.ﬁbergrid.in/apache/hadoop/common/
Extract the contents of the Hadoop package to a location of your
choice. I picked /usr/local/hadoop.
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.2.tar.gz
$ sudo mv hadoop-2.7.2 hadoop
Big Data & Hadoop

Add Hadoop conﬁguration in .bashrc
Add Hadoop conﬁguration in .bashrc in home directory.
export HADOOP INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP INSTALL/bin
export PATH=$PATH:$HADOOP INSTALL/sbin
export HADOOP MAPRED HOME=$HADOOP INSTALL
export HADOOP HDFS HOME=$HADOOP INSTALL
export HADOOP COMMON HOME=$HADOOP INSTALL
export YARN HOME=$HADOOP INSTALL
export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native
export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib"
Big Data & Hadoop

Create temp ﬁle, DataNode & NameNode
Execute below commands to create NameNode
mkdir -p /usr/local/hadoopdata/hdfs/namenode
Execute below commands to create DataNode
mkdir -p /usr/local/hadoopdata/hdfs/datanode
Execute below code to create the tmp directory in hadoop
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
Big Data & Hadoop

Files to Configure
The following are the files we need to configure
core-site.xml
hadoop-env.sh
mapred-site.xml
hdfs-site.xml
Big Data & Hadoop

Add properties in /usr/local/hadoop/etc/core-site.xml
Add the following snippets between the
< configuration > ... < /configuration > tags in the core-site.xml
file.
Add below property to specify the location of tmp
< property >
< name > hadoop.tmp.dir < /name >
< value > /app/hadoop/tmp < /value >
< /property >
Add below property to specify the location of default file
system and its port number.
< property >
< name > fs.default.name < /name >
< value > hdfs : //localhost : 54310 < /value >
< /property >
Big Data & Hadoop

Add properties in /usr/local/hadoop/etc/hadoop-env.sh
Un-Comment the JAVA HOME and Give Correct Path For
Java.
export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64
Big Data & Hadoop

Add property in
/usr/local/hadoop/etc/hadoop/mapred-site.xml
In ﬁle we add The host name and port that the MapReduce job
tracker runs at. Add following in mapred-site.xml :
< property >
< name > mapred.job.tracker < /name >
< value > localhost : 54311 < /value >
< /property >
Big Data & Hadoop

Add properties in ... etc/hadoop/hdfs-site.xml
In file hdfs-site.xml add following:
Add replication factor
< property >
< name > dfs.replication < /name >
< value > 1 < /value >
< /property >
Specify the NameNode
< property >
< name > dfs.namenode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/namenode < /value >
< /property >
Specify the DataNode
< property >
< name > dfs.datanode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/datanode < /value >
< /property >
Big Data & Hadoop

Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is
Formatting the Hadoop file system
We need to do this the first time you set up a Hadoop.
Do not format a running Hadoop filesystem as you will lose all
the data currently in HDFS
To format the filesystem, run the command
hadoop namenode -format
Big Data & Hadoop

Starting single-node cluster
Run the command:
start-all.sh
This will startup a NameNode,SecondaryNameNode,
DataNode, ResourceManager and a NodeManager on your
machine.
A nifty tool for checking whether the expected Hadoop
processes are running is jps
hadoop1@hadoop1:/usr/local/hadoop$ jps
2598 NameNode
3112 ResourceManager
3523 Jps
2917 SecondaryNameNode
2727 DataNode
3242 NodeManager
Big Data & Hadoop

Stopping your single-node cluster
Run the command
stop-all.sh
To stop all the daemons running on your machine output will be
like this.
stopping NodeManager
localhost: stopping ResourceManager
stopping NameNode
localhost: stopping DataNode
localhost: stopping SecondaryNameNode
Big Data & Hadoop

Map-Reduce Framework
Map Reduce programming paradigm
It relies basically on two functions, Map and Reduce
Map Reduce used to manage many large-scale computations
The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
The framework to eﬀectively schedule tasks on the nodes
where data is already present
Big Data & Hadoop

Map-Reduce Computation Steps
The key-value pairs from each Map task are collected by a
master controller and sorted by key. The keys are divided
among all the Reduce tasks, so all key-value pairs with the
same key wind up at the same Reduce task.
The Reduce tasks work on one key at a time, and combine
all the values associated with that key in some way. The
manner of combination of values is determined by the code
written by the user for the Reduce function.
Big Data & Hadoop

Hadoop - MapReduce
Big Data & Hadoop

Hadoop - MapReduce (Word Count) Example
Big Data & Hadoop

MapReduce - WordCountMapper
In WordCountMapper class we perform the following operations
Read a line from ﬁle
Split line into Words
Assign Count 1 to each word
Big Data & Hadoop

WordCountMapper source code
public static class WordCountMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Big Data & Hadoop

MapReduce - WordCountReducer
In WordCountReducer class we perform the following operations
Sum the list of values
Assign sum to corresponding word
Big Data & Hadoop

WordCountReducer source code
public static class WordCountReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Big Data & Hadoop

WordCountJob
public class WordCountJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCountJob.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Big Data & Hadoop

Header Files to include
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
Big Data & Hadoop

Execution of Hadoop Program in Eclipse
Step1:
1 Starting Hadoop in terminal using command:
$ Start-all.sh
2 Use JPS command to check all services of Hadoop are started
or not.
Step 2: open Eclipse
Step 3: Go to ﬁle ⇒ New ⇒ Project
Select Java Project and click on Next button
Write project name and click on Finish button
Big Data & Hadoop

Continue...
Step 4: Right side it creates a project
1 Right click on Project ⇒ New ⇒ Class
2 Write Name of Class and then Click Finish
3 Write MapReduce program in that class
Step 5: Write JAVA Program
Big Data & Hadoop

Continue...
Step 6: Importing JAR ﬁles
1 Right click on Project and select properties (Alt+Enter)
2 Select Java Build Path ⇒ Click on Libraries, then click on add
external JARS
3 Select the following jars from Hadoop library.
/usr/local/Hadoop/share/Hadoop/common/libs
/usr/local/Hadoop/share/Hadoop/hdfs/libs
/usr/local/Hadoop/share/Hadoop/httpfs/libs
/usr/local/Hadoop/share/Hadoop/mapreduce/libs
/usr/local/Hadoop/share/Hadoop/yarn/libs
/usr/local/Hadoop/share/Hadoop/tools/
Big Data & Hadoop

Continue ....
Step 7: Set input file path
1 Create folder in home dir
2 copy text files in to that
3 Select path of Input
Step 8: Set input and output path
1 right click on source ⇒ Run As ⇒ Run Configuration ⇒
Argument
2 Enter your input and out put path with a single space
3 click on Run
Big Data & Hadoop

thank You
Big Data & Hadoop

Hadoop installation, Configuration, and Mapreduce program

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop installation, Configuration, and Mapreduce program

Similar to Hadoop installation, Configuration, and Mapreduce program (20)

Recently uploaded

Recently uploaded (20)

Hadoop installation, Configuration, and Mapreduce program