2. Hadoop
Hadoop is a framework for
running applications on large
clusters built of commodity
hardware.
The Hadoop framework
transparently provides
applications both reliability and
data motion.
Hadoop implements a computational
paradigm named Map/Reduce, where the
application is divided into many small
fragments of work, each of which may be
executed or reexecuted on any node in the
cluster.
it provides a distributed file system (HDFS)
that stores data on the compute nodes,
providing very high aggregate bandwidth
across the cluster. Both Map/Reduce and
the distributed file system are designed so
that node failures are automatically handled
by the framework
3. HDFS(Hadoop’s Distributed File
System)
Hadoop's Distributed File System is
designed to reliably store very large files
across machines in a large cluster.
It is inspired by the Google File System.
Hadoop DFS stores each file as a
sequence of blocks, all blocks in a file
except the last block are the same size.
4.
5. Map Reduce
A MapReduce job
usually splits the input
data-set into
independent chunks
which are processed
by the map tasks in a
completely parallel
manner.
The framework sorts
the outputs of the
maps, which are then
input to the reduce
tasks. Typically both
the input and the
output of the job are
stored in a file-system.
7. Installation step’s
Check either Java 1.8.0 is already installed on your system or not, use "Javac -
version" to check.
If Java is not installed on your system then first install java under "C:JAVA"
Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place
under "C:Hadoop-2.8.0".
Set the path HADOOP_HOME Environment variable on windows (Variable Name
: HADOOP_HOME and Variable Value : C:Hadoop-2.8.0bin) click ok.
Set the path JAVA_HOME Environment variable on windows (Variable Name :
JAVA_HOME and Variable Value : C:javabin) click ok.
8. Configuration
Edit file C:/Hadoop-2.8.0/etc/hadoop/core-site.xml, paste below xml paragraph and save
this file.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
</configuration>
Rename "mapred-site.xml.template" to "mapred-site.xml" and edit this file C:/Hadoop-
2.8.0/etc/hadoop/mapred-site.xml, paste below xml paragraph and save this file.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
</configuration>
Create folder "data" under "C:Hadoop-2.8.0"
Create folder "datanode" under "C:Hadoop-2.8.0data"
Create folder "namenode" under "C:Hadoop-2.8.0data"
9. Edit file C:Hadoop-2.8.0/etc/hadoop/hdfs-site.xml, paste below xml
paragraph and save this file.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property>
<name>dfs.namenode.name.dir</name> <value>C:hadoop-2.8.0datanamenode</value> </property> <property>
<name>dfs.datanode.data.dir</name> <value>C:hadoop-2.8.0datadatanode</value> </property> </configuration>
Edit file C:/Hadoop-2.8.0/etc/hadoop/yarn-site.xml, paste below xml
paragraph and save this file.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
Edit file C:/Hadoop-2.8.0/etc/hadoop/hadoop-env.cmd by closing the
command line"JAVA_HOME=%JAVA_HOME%" instead of
set "JAVA_HOME=C:Java" (On C:java this is path to file jdk.18.0)
10. Testing
Open cmd and change directory to
"C:Hadoop-2.8.0sbin" and
type "start-all.cmd" to start
apache.
Make sure these apps are running:-
Hadoop Namenode
Hadoop datanode
YARN Resourc Manager
YARN Node Manager
Open: http://localhost:8088
Open: http://localhost:50070
11. Run wordcount Using MapReduce
Example
Download MapReduceClient.jar
(Link: https://github.com/MuhammadBilalYa
r/HADOOP-INSTALLATION-ON-WINDOW-
10/blob/master/MapReduceClient.jar)
Download Input_file.txt
(Link: https://github.com/MuhammadBilalYa
r/HADOOP-INSTALLATION-ON-WINDOW-
10/blob/master/input_file.txt)
Open cmd in Administrative mode and
move to "C:/Hadoop-2.8.0/sbin" and start
cluster.
Start-all.cmd
Create an input directory in HDFS.
hadoop fs -mkdir /input_dir
Copy the input text file named input_file.txt in the input
directory (input_dir)of HDFS.
hadoop fs -put C:/input_file.txt /input_dir
Verify input_file.txt available in HDFS input directory
(input_dir).
hadoop fs -ls /input_dir/
Verify content of the copied file.
hadoop dfs -cat /input_dir/input_file.txt
Run MapReduceClient.jar and also provide input and out
directories.
hadoop jar C:/MapReduceClient.jar wordcount
/input_dir /output_dir
Verify content for generated output file.
12. File System Command’s
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the
namenode as well as the data nodes as cluster.
start-dfs.sh
Listing Files in HDFS
bin/hadoop fs -ls <args>
13. Inserting Data into HDFS
/bin/hadoop fs -mkdir /user/input (You have to create an input directory.)
/bin/hadoop fs -put /home/file.txt /user/input (Transfer and store a data file from local systems to
the HFS)
/bin/hadoop fs -ls /user/input (You can verify the file using this command.)
Retrieving Data from HDFS
/bin/hadoop fs -cat /user/output/outfile (view the data from HDFS using cat command.)
/bin/hadoop fs -get /user/output/ /home/hadoop_tp/ (Get the file from HDFS to the local file
system using get command.)
stop-dfs.sh (Shutting Down the HDFS)