This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Hadoop installation, Configuration, and Mapreduce program
1. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Big Data & Hadoop
D. Praveen Kumar
Research Scholar (Full-Time)
Department of Computer Science & Engineering
YSREC of Yogi Vemana University, Proddatur
Kadapa Dt., A. P, India
November 8, 2016
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 1 / 43
3. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
GUESTS =4
Transportation from railway station to your
home( one Auto/car is sufficient)
mom can prepare food or snacks without risk.
Your house is sufficient for Accommodation.
Facilities like bed, bathrooms, water and TV are
provided which you use.
You can talk to each other and crack jokes and
you can make them happy
Expenditure is nearly Rs.1000/-
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 3 / 43
4. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
GUESTS =100
Transportation = 25 autos/car or two
buses
Food = catering.
Accommodation = Lodge.
Facilities = AC, TV, and all other facilities
Maintenance= somewhat difficult
Expenditure =nearly Rs. 90,000/-
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 4 / 43
5. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
GUESTS =10000
Transportation = 2500 autos or 500 buses
Food = catering.
Accommodation = all Lodges, function
halls and cottages in the town.
Facilities = AC, TV, and all other
facilities are somewhat difficult to provide.
Maintenance= more difficult
Expenditure =nearly Rs. 2,00,000/-
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 5 / 43
7. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Problems
Same we assume in computing environment
Difficult to handle a huge and ever growing amount of data
Processing of data can not be possible with few machines
distributing large data sets is difficult
Construction of online or offline models are very difficult
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 7 / 43
8. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Solution
A single solution to all these problems is
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 8 / 43
9. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
What is Big Data?
Big data refers to voluminous amounts of structured or
unstructured data that organizations can potentially mine and
analyze.
Big data is huge amount of large data sets characterized by
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 9 / 43
10. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Big Data Platforms and Analytical Software
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 10 / 43
11. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Hadoop
Here we go with
Why ?
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 11 / 43
12. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Hadoop
Apache Hadoop is an open-source software framework for
distributed storage and distributed processing of very large data
sets on computer clusters built from commodity hardware.
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common contains libraries and utilities needed by
other Hadoop modules
Hadoop Distributed File System (HDFS) a distributed
file-system that stores data
Hadoop YARN a resource-management platform
Hadoop MapReduce for large scale data processing.
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 12 / 43
14. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Requirements
Necessary
Java >= 7
ssh
Linux OS (Ubuntu >=
14.04)
Hadoop framework
Optional
Eclipse
Internet connection
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 14 / 43
15. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Java 7 & Installation
Hadoop requires a working Java installation. However, using
java 1.7 or more is recommended.
Following command is used to install java in linux platform
sudo apt-get install openjdk-7-jdk (or)
sudo apt-get install default-jdk
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 15 / 43
16. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Java PATH Setup
We need to set JAVA path
Open the .bashrc file located in home directory
gedit ~/.bashrc
Add below line at the end:
export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 16 / 43
17. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Installation & Configuration of SSH
Hadoop requires SSH(Secure Shell) access to manage its
nodes, i.e. remote machines plus your local machine if you
want to use Hadoop on it.
Install SSH using following command
sudo apt-get install ssh
First, we have to generate DSA an SSH key for user.
ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa
cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 17 / 43
18. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Download & Extract Hadoop
Download Hadoop from the Apache Download Mirrors
http://mirror.fibergrid.in/apache/hadoop/common/
Extract the contents of the Hadoop package to a location of your
choice. I picked /usr/local/hadoop.
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.2.tar.gz
$ sudo mv hadoop-2.7.2 hadoop
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 18 / 43
19. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Add Hadoop configuration in .bashrc
Add Hadoop configuration in .bashrc in home directory.
export HADOOP INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP INSTALL/bin
export PATH=$PATH:$HADOOP INSTALL/sbin
export HADOOP MAPRED HOME=$HADOOP INSTALL
export HADOOP HDFS HOME=$HADOOP INSTALL
export HADOOP COMMON HOME=$HADOOP INSTALL
export YARN HOME=$HADOOP INSTALL
export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native
export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib"
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 19 / 43
20. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Create temp file, DataNode & NameNode
Execute below commands to create NameNode
mkdir -p /usr/local/hadoopdata/hdfs/namenode
Execute below commands to create DataNode
mkdir -p /usr/local/hadoopdata/hdfs/datanode
Execute below code to create the tmp directory in hadoop
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 20 / 43
21. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Files to Configure
The following are the files we need to configure
core-site.xml
hadoop-env.sh
mapred-site.xml
hdfs-site.xml
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 21 / 43
22. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Add properties in /usr/local/hadoop/etc/core-site.xml
Add the following snippets between the
< configuration > ... < /configuration > tags in the core-site.xml
file.
Add below property to specify the location of tmp
< property >
< name > hadoop.tmp.dir < /name >
< value > /app/hadoop/tmp < /value >
< /property >
Add below property to specify the location of default file
system and its port number.
< property >
< name > fs.default.name < /name >
< value > hdfs : //localhost : 54310 < /value >
< /property >
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 22 / 43
23. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Add properties in /usr/local/hadoop/etc/hadoop-env.sh
Un-Comment the JAVA HOME and Give Correct Path For
Java.
export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 23 / 43
24. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Add property in
/usr/local/hadoop/etc/hadoop/mapred-site.xml
In file we add The host name and port that the MapReduce job
tracker runs at. Add following in mapred-site.xml :
< property >
< name > mapred.job.tracker < /name >
< value > localhost : 54311 < /value >
< /property >
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 24 / 43
25. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Add properties in ... etc/hadoop/hdfs-site.xml
In file hdfs-site.xml add following:
Add replication factor
< property >
< name > dfs.replication < /name >
< value > 1 < /value >
< /property >
Specify the NameNode
< property >
< name > dfs.namenode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/namenode < /value >
< /property >
Specify the DataNode
< property >
< name > dfs.datanode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/datanode < /value >
< /property >
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 25 / 43
26. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is
Formatting the Hadoop file system
We need to do this the first time you set up a Hadoop.
Do not format a running Hadoop filesystem as you will lose all
the data currently in HDFS
To format the filesystem, run the command
hadoop namenode -format
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 26 / 43
27. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Starting single-node cluster
Run the command:
start-all.sh
This will startup a NameNode,SecondaryNameNode,
DataNode, ResourceManager and a NodeManager on your
machine.
A nifty tool for checking whether the expected Hadoop
processes are running is jps
hadoop1@hadoop1:/usr/local/hadoop$ jps
2598 NameNode
3112 ResourceManager
3523 Jps
2917 SecondaryNameNode
2727 DataNode
3242 NodeManager
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 27 / 43
28. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Stopping your single-node cluster
Run the command
stop-all.sh
To stop all the daemons running on your machine output will be
like this.
stopping NodeManager
localhost: stopping ResourceManager
stopping NameNode
localhost: stopping DataNode
localhost: stopping SecondaryNameNode
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 28 / 43
29. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Map-Reduce Framework
Map Reduce programming paradigm
It relies basically on two functions, Map and Reduce
Map Reduce used to manage many large-scale computations
The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
The framework to effectively schedule tasks on the nodes
where data is already present
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 29 / 43
30. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Map-Reduce Computation Steps
The key-value pairs from each Map task are collected by a
master controller and sorted by key. The keys are divided
among all the Reduce tasks, so all key-value pairs with the
same key wind up at the same Reduce task.
The Reduce tasks work on one key at a time, and combine
all the values associated with that key in some way. The
manner of combination of values is determined by the code
written by the user for the Reduce function.
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 30 / 43
32. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Hadoop - MapReduce (Word Count) Example
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 32 / 43
33. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
MapReduce - WordCountMapper
In WordCountMapper class we perform the following operations
Read a line from file
Split line into Words
Assign Count 1 to each word
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 33 / 43
34. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
WordCountMapper source code
public static class WordCountMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 34 / 43
35. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
MapReduce - WordCountReducer
In WordCountReducer class we perform the following operations
Sum the list of values
Assign sum to corresponding word
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 35 / 43
36. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
WordCountReducer source code
public static class WordCountReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 36 / 43
37. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
WordCountJob
public class WordCountJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCountJob.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 37 / 43
39. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Execution of Hadoop Program in Eclipse
Step1:
1 Starting Hadoop in terminal using command:
$ Start-all.sh
2 Use JPS command to check all services of Hadoop are started
or not.
Step 2: open Eclipse
Step 3: Go to file ⇒ New ⇒ Project
Select Java Project and click on Next button
Write project name and click on Finish button
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 39 / 43
40. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Continue...
Step 4: Right side it creates a project
1 Right click on Project ⇒ New ⇒ Class
2 Write Name of Class and then Click Finish
3 Write MapReduce program in that class
Step 5: Write JAVA Program
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 40 / 43
41. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Continue...
Step 6: Importing JAR files
1 Right click on Project and select properties (Alt+Enter)
2 Select Java Build Path ⇒ Click on Libraries, then click on add
external JARS
3 Select the following jars from Hadoop library.
/usr/local/Hadoop/share/Hadoop/common/libs
/usr/local/Hadoop/share/Hadoop/hdfs/libs
/usr/local/Hadoop/share/Hadoop/httpfs/libs
/usr/local/Hadoop/share/Hadoop/mapreduce/libs
/usr/local/Hadoop/share/Hadoop/yarn/libs
/usr/local/Hadoop/share/Hadoop/tools/
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 41 / 43
42. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
Continue ....
Step 7: Set input file path
1 Create folder in home dir
2 copy text files in to that
3 Select path of Input
Step 8: Set input and output path
1 right click on source ⇒ Run As ⇒ Run Configuration ⇒
Argument
2 Enter your input and out put path with a single space
3 click on Run
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 42 / 43
43. Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution
thank You
Bapatla Engineering College, Bapatla, Guntur
Big Data & Hadoop
November 8, 2016 Slide: 43 / 43