Hadoop basics

Presentation on
Big Data/Hadoop
Submitted to :
Department of CSE
AITS, Udaipur
Submitted by:
1.Laxmi Rauth
2.Anand Mohan
B.Tech (4th year)

Big Data
"Big Data” is a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications.
In simple terms, "Big Data" consists of very large volumes of
heterogeneous data that is being generated, often, at high
speeds.
Big Data requires the use of a new set of tools, applications
and frameworks to process and manage the data.

Characteristics of Big Data:
The characteristics of Big Data are
popularly known as Three V's of
Big Data.
Volume:
This size aspect of data is referred to as
Volume in the Big Data world.
Velocity:
This speed aspect of data generation is
referred to as Velocity in the Big Data
world.
Variety:
This aspect of varied data formats is
referred to as Variety in the Big Data
world.
Sources of Big Data can be
broadly classified into six
different categories:
1.Enterprise Data
2. Transactional Data
3. Social Media
4. Activity Generated
5. Public Data
6. Archives
Sources of Big Data:

Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming
models. That manages data processing and storage for big
data applications running in clustered systems.

History of Hadoop had
started in the year 2002 with
the project Apache
Nutch. Hadoop was created
by Doug Cutting, the creator
of Apache Lucene, the widely
used text search library.
According to Hadoop's creator
Doug Cutting, "The name
Hadoop given by my kid to a
stuffed yellow elephant. Short,
relatively easy to spell and
pronounce, meaningless, and
not used elsewhere.
Hadoop was created by Doug Cutting and
Mike Cafarella.
History of Hadoop

Characteristics of Hadoop
Hadoop provides
a reliable shared
storage (HDFS)
and analysis
system (Map-
Reduce).
Hadoop is highly
scalable
As Hadoop
scales linearly, a
Hadoop Cluster
can contain tens,
hundreds, or
even thousands
of servers.
Hadoop is highly
flexible and can
process both
structured as
well as
unstructured
data. Hadoop has
built-in fault
tolerance.
Hadoop works on
the principle of
write once and
read multiple
times. Hadoop is
optimized for
large and very
large data sets.
Hadoop is very
cost effective as
it can work with
commodity
hardware and
does not require
expensive high-
end hardware.

Hadoop works in a master-worker / master-slave
fashion.
Hadoop has two core components: HDFS and
MapReduce.
HDFS (Hadoop Distributed File System) offers a
highly reliable and distributed storage, and ensures
reliability, by storing the data across multiple nodes.
MapReduce offers an analysis system which can
perform complex computations on large datasets. This
component is responsible for performing all the
computations and works by breaking down a large
complex computation into multiple tasks and assigns
those to individual worker/slave nodes.
The master contains the Namenode and Job Tracker
components.
Namenode holds the information about all the other
nodes in the Hadoop Cluster.
Job Tracker keeps track of the individual tasks/jobs
assigned to each of the nodes and coordinates the
exchange of information and results.
Each Worker / Slave contains the Task Tracker and a
Datanode components.
Task Tracker is responsible for running the task /
computation assigned to it.
Datanode is responsible for holding the data.

Hadoop Distributions
Cloudera was the first company
to be formed to build enterprise
solutions based on Hadoop.
Cloudera has a Hadoop
distribution known as
Cloudera's Distribution for
Hadoop (CDH).
MapR is another major
distribution available in the
market.
MapR is available in the cloud
through some of the leading
cloud providers Amazon Web
Services (AWS), Google
Compute Engine, CenturyLink
Technology Solutions, and
OpenStack.
Amazon Web Services (AWS)
Elastic MapReduce (EMR) was
among the first Hadoop
offerings available in the market.
Azure HDInsight is Microsoft's
distribution of Hadoop.
Hortonworks has a Hadoop
distribution known as
Hortonworks Data Platform
(HDP).
Cloudera
Hortonworks
Amazon
Elastic Map
Reduce
(EMR)
MapR
Azure
HDInsight

Hadoop Ecosystem
Apach
e Hive
Apache
Pig
Apache
ZooKeeper
Y
1. Apache Pig is a software framework which offers a run-time
environment for execution of MapReduce jobs on a Hadoop Cluster via
a high-level scripting language called Pig Latin.
2. Apache Hive Data Warehouse framework facilitates the querying and
management of large datasets residing in a distributed store/file system
like Hadoop Distributed File System (HDFS).
3. Apache Mahout is a scalable machine learning and data mining library.
4. Apache HBase is a distributed, versioned, column-oriented, scalable
and a big data store on top of Hadoop/HDFS.
5. Apache Sqoop is a tool designed for efficiently transferring the data
between Hadoop and Relational Databases (RDBMS).
6. Apache Oozie is a job workflow scheduling and coordination manager
for managing the jobs executed on Hadoop.
7. Apache ZooKeeper is an open source coordination service for
distributed applications.
8. Apache Ambari is an open source software framework for provisioning,
managing, and monitoring Hadoop clusters.
Apache
Hive
Apache
HBase
Apache
Mahout
Apache
Sqoop
Apache
Oozie
HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework.
other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem.
Apache
Ambari

Hadoop Core Components
YARN
Yet another resource negotiator) is a
resource manager that knows how to
allocate distributed computer
resources to various cluster
MapReduce
MapReduce is a framework that
enables running MapReduce jobs on
the hadoop cluster powered by YARN.
It provides a high level API for
implementing Custom Map and
Reduce function in various languages
as well as the code infrastructure
needed to submit, run and monitor
MapReduce jobs.
HDFS
(Hadoop diatributed file system)
designed for storing large files of the
magnitute of hundreds of megabytes
or gigabytes and provides
high_throughput streaming data
access to them.

Hadoop Versions
2.7.x2.7.7
31 May 2018
2.8 x.2.8.5
15 September 2018
2.9 x 2.9.2
9 november 2018
3.1 x 3.1.2
6 February 2019
3.2 x 3.2.0
16 january

Hadoop 2.8.0 installation
Download
Hadoop
and Java
Install
Java
Extract
the
Hadoop
file
Testing
Edit
configuration
files
Set
environment
variable
Set path
Replace
the bin
file
Format
name
node

Download Hadoop 2.8.0 (Link: http://www-
eu.apache.org/dist/hadoop/common/hadoop-
2.8.0/hadoop-
2.8.0.tar.gz OR http://archive.apache.org/dist/hadoop/c
ore//hadoop-2.8.0/hadoop-2.8.0.tar.gz)
install java on your under "C:JAVA"Java JDK 1.8.0.zip
(Link: http://www.oracle.com/technetwork/java/javase/
downloads/jdk8-downloads-2133151.html)
.
use "Javac -version" to check the version of Java installed
in your system

Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:Hadoop-2.8.0".
HADOOP_HOME
JAVA_HOME
path
Set the path HADOOP_HOME Environment variable
Environment Variable -> New ->
Variable name: HADOOP_HOME
Variable value: C:hadoop-2.8.0bin
->OK
Set the path JAVA_HOME Environment variable
Environment Variable -> New ->
Variable name: JAVA_HOME
Variable value: C:javabin
->OK
set the Hadoop bin directory path and JAVA bin
directory path.
Environment variables -> System variables ->
path -> Edit -> New -> C:hadoop-2.8.0bin -> New
-> C:javabin -> OK

Edit the configuration
files, paste these
code below xml
paragraph and save
the files.
Create foldersfile C:/Hadoop-2.8.0/etc/hadoop
/mapred-site.xml
file C:Hadoop-2.8.0/etc/hadoop/h
dfs-site.xml
file C:/Hadoop-2.8.0/etc/hadoo
p/hadoop-env.cmd
file C:/Hadoop-2.8.0/etc/hadoop/y
arn-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value
>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:hadoop-
2.8.0datanamenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:hadoop-
2.8.0datadatanode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.nam
e</name>
<value>yarn</value>
</property>
</configuration>
Create
folder "data" under "C:Hadoo
p-2.8.0"
Create
folder "datanode" under "C:H
adoop-2.8.0data"
Create
folder "namenode" under "C:
Hadoop-2.8.0data"
<configuration>
<property>
<name>yarn.nodemanager.aux-
services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices
.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.Sh
uffleHandler</value>
</property>
</configuration>
Edit file C:/Hadoop-
2.8.0/etc/hadoop/hadoop-
env.cmd by closing the
command
line "JAVA_HOME=%JAVA_HO
ME%" instead of
set "JAVA_HOME=C:Java" (On
C:java this is path to file
jdk.18.0)
file C:/Hadoop-
2.8.0/etc/hadoop/core-site.xml

Replace the bin file
Delete file bin on C:Hadoop-
2.8.0bin, replaced by file bin
on file just download (from
Hadoop Configuration.zip).
Dowload file Hadoop
Configuration.zip
(Link: https://github.com/Muha
mmadBilalYar/HADOOP-
INSTALLATION-ON-
WINDOW-
10/blob/master/Hadoop%20C
onfiguration.zip)

Open cmd and typing command "hdfs namenode –format" .
Format the namenode
We Create
Quality Professional
PPT Presentation

Testing
Open cmd and change directory to "C:Hadoop-2.8.0sbin" and
type "start-all.cmd" to start apache.
Make sure these apps
are running
1.Hadoop Namenode
2.Hadoop datanode
3.YARN Resourc
Manager
4.YARN Node
Manager
Open: http://localhost:50070
Open: http://localhost:8088

MAPREDUCE
The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
Map takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
MapReduce is a processing technique and a program
model for distributed computing based on java.
reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
Generally MapReduce
paradigm is based on sending
the computer to where the data
resides!
The major advantage of
MapReduce is that it is easy to
scale data processing over
multiple computing nodes.
MapReduce program executes in
three stages, namely map stage,
shuffle stage, and reduce stage.
MapReduce
1
8
7
6
5
4
3
2

Map stage − The
map or mapper’s job
is to process the input
data. Generally the
input data is in the
form of file or
directory and is stored
in the Hadoop file
system (HDFS). The
input file is passed to
the mapper function
line by line. The
mapper processes the
data and creates
several small chunks
of data.
Stages of
MapReduce
program
Reduce stage − This
stage is the
combination of
the Shuffle stage and
the Reduce stage.
The Reducer’s job is
to process the data
that comes from the
mapper. After
processing, it
produces a new set of
output, which will be
stored in the HDFS.

Hadoop basics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hadoop basics

Similar to Hadoop basics (20)

Recently uploaded

Recently uploaded (20)

Hadoop basics