A new way to store and analyze
data
An Elephant can't jump. But can carry heavy load.
Overview…
 What is Hadoop?
 Why Hadoop?
 Famous Hadoop users.
 Core-Components of Hadoop.
 Hadoop installation and configuration.
 Starting your Single-node cluster.
 Stopping your Single-node cluster.
 Running a single-node cluster program.
What is Apache Hadoop?
The most well known technology used
for Big Data is Hadoop.
Apache Hadoop is a Framework that allows
for the distributed processing of large data
sets across clusters of commodity computers
using a simple programming model.
Why Hadoop?
 Need to process 100TB datasets –
 On 1 node
 Scanning @50MB/s = 23days
 On 1000 node cluster :
 Scanning @50MB/s = 33 mins
Famous Hadoop Users..
Two cluster of 8000 and 3000
Nodes
100 Nodes
4500 Nodes
532 Nodes
Core-Components of Hadoop..
 The Cluster is the set of host machines (nodes). Nodes may be
partitioned in racks. This is the hardware part of the infrastructure.
 The YARN Infrastructure (Yet Another Resource Negotiator) is
the framework responsible for providing the computational
resources (e.g., CPUs, memory, etc.) needed for application
executions. Two important elements are:
 Resource Manager
 Node Manager
 The HDFS (Hadoop Distributed File System) Inspired by
Google File System is a primary distributed storage used by Hadoop.
 The MapReduce is the original processing model for Hadoop
clusters. It distributes work within the cluster or map, then
organizes and reduces the result from the nodes into a response to a
query..
Mapreduce Example
Two input files:
file1: “hello world hello moon”
file2: “goodbye world goodnight moon”
Three operations:
 map
 combine
• reduce
we’ll use an example: WordCount Count occurrences of each
word across different files
What is the output per step?
MAP
First map: Second map:
< hello, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 1 > < goodnight, 1 >
< moon, 1 > < moon, 1 >
COMBINE
First map: Second map:
< moon, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 2 > < goodnight, 1 >
< moon, 1 >
REDUCE
< goodbye, 1 >
< goodnight, 1 >
< moon, 2 >
< world, 2 >
< hello, 2 >
Hadoop Installation and Configuration...
1. Installing Java:- Java is the primary requirement for run
Hadoop on any system.
$ java -version
Cont...
2. Configuring SSH:-Hadoop requires SSH access to manage its
nodes, i.e. remote machines plus your local machine.
$ ssh-keygen -t rsa -P ""
Cont...
3. The next step is to test the SSH setup by connecting to your local
machine with the user.
Cont...
4. Disabling IPv6 : To disable IPv6 on Ubuntu, open /etc/sysctl.conf
in the editor of your choice and add the following lines to the end of
the file:
/etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1 –
You have to reboot your machine in order to make the changes take
effect.
Cont...
You can check whether IPv6 is enabled on your machine with the
following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means
disabled (that’s what we want).
Cont...
5. Installing Hadoop:-
Cont...
Let’s unpack the Hadoop gz:
$ tar xvzf hadoop 2.8.1.tar.gz
Cont...
6. Configure Hadoop Pseudo-Distributed Mode:-
• Setup Environment Variables - First we need to set environment variable
uses by hadoop. Edit ~/.bashrc file and append following values at end of file.
Cont...
• Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set
JAVA_HOME environment variable
Cont...
6. Edit Configuration Files :- Hadoop has many of configuration
files, which need to configure as per requirements of your
hadoop infrastructure.
$ gedit hadoop/etc/hadoop/core-site.xml:
Cont...
$ gedit hadoop/etc/hadoop/hdfs-site.xml
Cont...
$ Edit mapred-site.xml
Cont...
Edit yarn-site.xml
Cont...
7. Formatting the NameNode:- Now format the namenode using
following command.
$ hdfs namenode -format
Starting your Single-node cluster
Now run start-dfs.sh script.
$ start-dfs.sh
Now run start-yarn.sh script.
$ start-yarn.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your
machine.
A nifty tool for checking whether the expected Hadoop processes
are running is jps
$ /usr/lib/jvm/java-8-oracle/bin/jps
Go to the browser and open http://127.0.0.1:50070
Stopping your Single-node cluster
$ Stop-all.sh
Running a single-node cluster program
WordCount Program :
-What is wordcount program?
- Wordcount is a simple application that determines how many
times different words appear in a given input set. Steps for this
session:
1. Upload the input data text file in hdfs.
2. Create your own jar and put the output file.
3. Run a jar on hadoop architecture.
4. View output on command line.
5. View and download output file.
Input File..
Output File..
Download Output File..
 You can download output file:
References..
•Apache Hadoop!
(http://hadoop.apache.org)
•Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
•Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com
Hadoop installation with an example
Hadoop installation with an example

Hadoop installation with an example

  • 1.
    A new wayto store and analyze data An Elephant can't jump. But can carry heavy load.
  • 2.
    Overview…  What isHadoop?  Why Hadoop?  Famous Hadoop users.  Core-Components of Hadoop.  Hadoop installation and configuration.  Starting your Single-node cluster.  Stopping your Single-node cluster.  Running a single-node cluster program.
  • 3.
    What is ApacheHadoop? The most well known technology used for Big Data is Hadoop. Apache Hadoop is a Framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
  • 4.
    Why Hadoop?  Needto process 100TB datasets –  On 1 node  Scanning @50MB/s = 23days  On 1000 node cluster :  Scanning @50MB/s = 33 mins
  • 5.
    Famous Hadoop Users.. Twocluster of 8000 and 3000 Nodes 100 Nodes 4500 Nodes 532 Nodes
  • 6.
    Core-Components of Hadoop.. The Cluster is the set of host machines (nodes). Nodes may be partitioned in racks. This is the hardware part of the infrastructure.
  • 7.
     The YARNInfrastructure (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions. Two important elements are:  Resource Manager  Node Manager  The HDFS (Hadoop Distributed File System) Inspired by Google File System is a primary distributed storage used by Hadoop.  The MapReduce is the original processing model for Hadoop clusters. It distributes work within the cluster or map, then organizes and reduces the result from the nodes into a response to a query..
  • 8.
    Mapreduce Example Two inputfiles: file1: “hello world hello moon” file2: “goodbye world goodnight moon” Three operations:  map  combine • reduce we’ll use an example: WordCount Count occurrences of each word across different files
  • 9.
    What is theoutput per step? MAP First map: Second map: < hello, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 1 > < goodnight, 1 > < moon, 1 > < moon, 1 > COMBINE First map: Second map: < moon, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 2 > < goodnight, 1 > < moon, 1 > REDUCE < goodbye, 1 > < goodnight, 1 > < moon, 2 > < world, 2 > < hello, 2 >
  • 10.
    Hadoop Installation andConfiguration... 1. Installing Java:- Java is the primary requirement for run Hadoop on any system. $ java -version
  • 11.
    Cont... 2. Configuring SSH:-Hadooprequires SSH access to manage its nodes, i.e. remote machines plus your local machine. $ ssh-keygen -t rsa -P ""
  • 12.
    Cont... 3. The nextstep is to test the SSH setup by connecting to your local machine with the user.
  • 13.
    Cont... 4. Disabling IPv6: To disable IPv6 on Ubuntu, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: /etc/sysctl.conf # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 – You have to reboot your machine in order to make the changes take effect.
  • 14.
    Cont... You can checkwhether IPv6 is enabled on your machine with the following command: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
  • 15.
  • 16.
    Cont... Let’s unpack theHadoop gz: $ tar xvzf hadoop 2.8.1.tar.gz
  • 17.
    Cont... 6. Configure HadoopPseudo-Distributed Mode:- • Setup Environment Variables - First we need to set environment variable uses by hadoop. Edit ~/.bashrc file and append following values at end of file.
  • 18.
    Cont... • Now edit$HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable
  • 19.
    Cont... 6. Edit ConfigurationFiles :- Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. $ gedit hadoop/etc/hadoop/core-site.xml:
  • 20.
  • 21.
  • 22.
  • 23.
    Cont... 7. Formatting theNameNode:- Now format the namenode using following command. $ hdfs namenode -format
  • 24.
    Starting your Single-nodecluster Now run start-dfs.sh script. $ start-dfs.sh Now run start-yarn.sh script. $ start-yarn.sh This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
  • 25.
    A nifty toolfor checking whether the expected Hadoop processes are running is jps $ /usr/lib/jvm/java-8-oracle/bin/jps
  • 26.
    Go to thebrowser and open http://127.0.0.1:50070
  • 28.
    Stopping your Single-nodecluster $ Stop-all.sh
  • 29.
    Running a single-nodecluster program WordCount Program : -What is wordcount program? - Wordcount is a simple application that determines how many times different words appear in a given input set. Steps for this session: 1. Upload the input data text file in hdfs. 2. Create your own jar and put the output file. 3. Run a jar on hadoop architecture. 4. View output on command line. 5. View and download output file.
  • 30.
  • 31.
  • 32.
    Download Output File.. You can download output file:
  • 33.
    References.. •Apache Hadoop! (http://hadoop.apache.org) •Hadoop onWikipedia (http://en.wikipedia.org/wiki/Hadoop) •Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com