Hadoop

December 1, 2017 www.snipe.co.in 1
Snipe Team

December 1, 2017 2
Introduction to BigData

December 1, 2017
1000 B= Kilobyte
1000² B= Megabyte
1000³ B= Gigabyte
1000 B= Terabyte⁴
1000 B= Petabyte⁵
1000 B= Exabyte⁶
1000 B= Zetabyte⁷
1000 B= Yottabyte⁸
Rise of bytes

December 1, 2017 4
Types of Data
Data is classified into 3 types:
•Structured Data
•Unstructured Data
•Semi-structured Data
Structured Data
•Fits into the world of RDBMS
•Data is perfectly aligned in rows and columns
•A tabular format is used for representing data
Example: Data storage in MySql Database

December 1, 2017 5
Unstructured Data
•No definite structure can be assigned to this data
•Cannot tabulate the data
•Cannot put in rows and columns
•Cannot fixed into any schema
Example: Text files, PDF document, Web server logs, Text,
Photos, Voice
Semi-structured Data
•Data which is between structured and Unstructured
•Unstructured data embedded within some structures
or tags or schema
Example: XML file

December 1, 2017 6
Big Data Sources:
•New York Stock Exchange 4 to 5 TB of data per day
•Internet stores 18.5 petabytes of data on a day
•Twitter handles 12 TB of tweets everyday
•Facebook :
 1.5 billion active users monthly
 300 PB of user data
 10 billion messages per day
Big Data – The buzz word
•80 to 90% of data is unstructured and that is cannot be fitted
into RDBMS based systems.
•Big Data is Efficient, Economical and Quicker

December 1, 2017 7
Characteristics of BigData:
(3 V’s is now 6 V’s)
6V’s
Volume
Velocity
Variety
Varacity
Value
Variability

December 1, 2017 8
Volume: Sheer bulk of data being generated.
Velocity: Rate at which data is generated
Ex: 5 to 10 TB of data is uploaded on YouTube every 10 mins.
Variety: 80 to 90% of Big Data is unstructured and semi-structured
Ex: Text data, Voice data, Sensor data etc…
Veracity: Uncertainity or correctness of data
Ex: Collecting data from sensor
Variability: Inconsistencies in the rate at which data is generated
Value: What is the value of proposition?
What is business value that it makes?
The output should be worth the investments made to analyze data.

December 1, 2017 9
Use Cases of Big Data
Financial Services
•Fraud detection
•Personalized banking services
Health Care
•Analysis previously restricted to the major players
due to expensive tools and technologies
•Now since the advent of big data the technology is not very
expensive
•Many players in health care segment will start working on big
data

December 1, 2017 10
Retail Industry
•Oldest consumers of big data
•Were using data warehousing techniques very heavily
•Now slowly shifting to big data technology
Web and Social Media Analytics
•Most recent entrant
•Heavily looking into big data related research works for
 Behavioral
 Social Analytics

December 1, 2017 11
RDBMS and BIG DATA
Benefits of RDBMS:
•Compatibility
• Flexibility
•Simplicity
•Performance
•Robustness
Well known RDBMS
•Oracle
•Microsoft SQL server
•MySQL
•Teradata
•DB2

December 1, 2017 12
RDBMS:
•Normalization
 Data consistency
 Eliminates data duplication
•Relational databases have to be incredibly complex
internally
 Example: Simple select statement could have hundreds
of query execution parts
 RDBMS determines the execution plan using cost based
algorithms

December 1, 2017 13
Drawbacks of RDBMS:
•New demand is scalability
•New apps being launched: Massive load on storage and
scalability
•Supporting large number of concurrent users
•Dynamic support is needed
•Scaling:
 A new application can go viral overnight, users increase
from zero to million
 Some users are frequent, others never return back
 Seasonal swings can create spikes
 Users need real-time high performance
 Vertical scaling possible, not ready for horizontal scaling
 Single server node

December 1, 2017 14
Today’s demand
Increased workload due to flexibility requirement
Database structures needs to be altered
Example: Started selling televisions
•Database schema defined for television
•Added refrigerator and music system to
catalogue

December 1, 2017 15
Introduction to Parallel computing:
Divide a task and conquer
Computer 1
Computer 3
Computer 2
Computer 4
• Reduced time by Parallel processing
• Faster and quicker
Task

December 1, 2017 16
Challenges of Super Computing:
•General purpose operating system did not exist
•Buyer of super computers locked to vendors for
hardware support
•High initial cost of the Hardware
•High cost of software maintenance and upgrades to be
taken care in-house
•Develop custom software for individual use cases
HADOOP – The rescue
• General purpose operating system like framework
• Built-in rich features set of software tools and components
• Not locked with one vendor, can be installed on any commodity
hardware
• Mid sized organizations can afford
• Free software (open source) with free upgrades
• Distributed computing to wider set of audience

December 1, 2017 17
Hadoop History:
2000-2002
•Project NUTCH
•Open source scalable and robust internet search engine
•Doug Cutting
and Mike
2003-2004
•Big Table
•Map Reduce
•Common features with NUTCH
2006
•Doug cutting joined yahoo and created hadoop
•NUTCH +Big table + Google MapReduce
•Hadoop MapReduce implemented in Java
2008
•Hadoop: Apache project
•Stable version of Hadoop used in Yahoo

December 1, 2017 19
Overview of Hadoop Architecture
Basic components of Hadoop 1.0
•Name node
•Secondary name node
• Job tracker
•Task tracker
•Data nodes
Job in Hadoop Ecosystem
•A job is some task submitted by the user to the Hadoop cluster
•The job is in the form of a program or collection of programs (a JAR file)
which needs to be executed

December 1, 2017 20
Attributes:
•Programs
•Input data to the program i.e a file or collection of files in a
directory
•Output directory where the results of execution is collected
in a files
Job
•Java mapreduce jobs.
•Programs will be submitted into the cluster in the form of a
JAR file.
•Packaging of all the classes.
•Programs need to executed on the particular set of data.

December 1, 2017 21
Apache Hadoop Core Features
Hadoop Distributed File System(HDFS)
•When file is submitted into the Hadoop cluster:
It resides on multiple data nodes
Original file divided into smaller pieces
Parallel Processing Framework
•Robust
•Known as a MapReduce Framework
•When you submit the job into the Hadoop cluster:
Program executes on a piece of data
Runs on multiple machines

Cluster of
1000 nodes
Name
node
Secondary name
node
Job
tracker
Remaining 997 machines work in the slave mode and act as
data nodes
Master Slave Architecture

Need not have a very high hard drive storage space
•Data nodes need to be very high in terms of hard drive
storage space
•Nodes take the load of all the data
•Big data storage consist of the bulk of data operations
Name node Secondary name node Job tracker

December 1, 2017 24
Master Slave Architecture: Host OS
•Hadoop as a piece of software framework is installed on native operating
system
•Installed on all these machines along with the data nodes
•The differentiating factor is the software configuration after installing
Hadoop
•Machines perform responsibilities associated with the name node,
secondary name node and job tracker
Hadoop Cluster Setup
•Machines within a rack communicate with the help of a switch at
the speed of 10 gigabytes per second
•Multiple racks communicate with the help of a multi layer switch or
uplink switch which also acts as a router
•Data transfer speed between machines within the same rack is
higher than the data transfer speed between the machines across
different racks

General Specifications of Hadoop Cluster
December 1, 2017 25
Built-in using Commodity Hardware
•Makes the hardware easy to procure and maintain
•Reduces dependency on just one vendor
Processor Built
•Most of the data nodes has two hex-core processor or two
octa-core processor
•2 CPU, each of them at 8 cores
•Processing speed lies anywhere between 2.4
to 3.5 gigahertz CPU

December 1, 2017 26
Storage
•Amount of RAM varies according to the organizational needs
•Name node and job tracker would have higher RAM
•Most of the data nodes will be ranging between 50 to 500 GB of
high speed RAM
Thumb Rule
•To decide how much hard drive storage is needed for each data
node
•Every single core of CPU requires at least 2 Terabytes of hard
drive

December 1, 2017 27
Hadoop Services
1.Once Hadoop is installed, certain services are enabled
2.Processes or services are associated with the name node running
3.The machine acquires the role of that data node
4.These are set of Hadoop services or set of Hadoop daemons
running
5.A set of software processes or collection of several processes
6.A set of software processes or collection of several processes

Standalone Mode
Pseudo Distributed Mode
Three different modes
in which Hadoop can
be installed and
deployed
Fully Distributed Mode
(Cluster Mode)
Hadoop is installed on the cluster of interconnected machines
Hadoop Deployment Modes

December 1, 2017 29
Hadoop Components: Standalone Mode
This mode is
mainly used
for testing
purposes
These are the software
services which actually run
as a part of your Hadoop
installation
Job tracker Name node
If JVM crashes, all Hadoop
services will also crasha
Secondary name
node
Data
node
It's the least
preferred mode
of Hadoop
installation and
deployment
If all the services are
sharing a single JVM,
this mode is called
the Standalone
Mode

Each of the Hadoop
services run on a
separate JVM
Services run as a
part of Hadoop
installation
Similarity Both
run on a single
machine
Crashing of JVM does not
impact Hadoop cluster
When JVM
crashes, all the
Hadoop services
also crash
Standalone Mode
Pseudo
Distributed Mode
Pseudo
Distribute
d Mode
vs.
Standalon
e Mode
This mode is widely used for learning and development
purpose but not for deployment
Hadoop Components: Pseudo Distributed Mode

December 1, 2017 31
Hadoop Deployment Mode
Hadoop Components: Fully Distributed Mode
• Used in real production environment
• If job tracker is configured on a machine, the dedicated
machine runs only the job tracker with Hadoop installation
• When the dedicated machine is working as name node, it is
when the particular hardware is running on name node
services
Real VS Pseudo Distributed Mode
• All the Hadoop services are well interconnected but on a
separate JVM in Real Distributed Mode
• All the Hadoop services are on a different JVM but on a single
machine in Pseudo Distributed mode

December 1, 2017 32
Functionalities of hadoop components
Functionalities of Hadoop Components
•Name Node
All the information is available at name node
It is a centralised file namespace server or a file system server
• Secondary Name Node
Helps the name node to backup the data present in the name
node server periodically
In the event of a name node failure, the secondary name node
will be used to recover and restore the name node

Hot standby means
that secondary
name node will work
in a UPS mode
Functionalities of Hadoop Components
When name node is
down, the entire
cluster goes down in
the case of Hadoop 1.0
As soon as power
is down UPS starts
backing up
Receives an
uninterrupted
power supply
mode
Not the case
with Hadoop
2.0
Hadoop 1.0 did not have the provision of hot standby
name node

When a cluster
is completely
down, bring
up name node
Copy all the
backup files
from the
secondary name
node
Restore name
node
operations
In Hadoop 2.0

December 1, 2017 35
Data Node Functionality
• Job is a collection of programs or a single program which is
going to be operating on a piece of data
•On each of the data nodes, a software service called the task
tracker runs continuously
•Data node stores the big data whenever a job is submitted
•Manipulates data before executing on the data node
•The decision of which program will be executed by which data
node is taken by job tracker

December 1, 2017 36
Job Submission and Execution
Job Submission and Execution in Hadoop cluster
How is a job submitted into the Hadoop cluster?
How exactly would the job get executed?
Imagine you are working as a data engineer or a data scientist in
your team and usually work on a desktop or a laptop
•Hadoop is installed in Pseudo Distributed Mode for all testing and
development purposes
•The Java files are compiled into JAR files, or Java MapReduce files
•These Java files or the JAR files are submitted into the cluster as a job

Job Client/ Name node Job tracker
Gateway Machine
Job Submission and Execution in Hadoop cluster
Job Client/Gateway Machine
•This job client is not exactly a part of the cluster
•Hadoop services are not running on it
•Configured to communicate with the name node and the job
tracker
•Job configuration details(.jar)
 Input file path
 Output file path

December 1, 2017 38
Name Node
•Job is picked up by a name node
•Provides information:
Blocks corresponding to the input files
Programs where work is residing
Job Tracker
• Schedule the jobs
• Distribute the job to multiple data nodes on which the input file is
residing
•Result of execution is available in the output path
•User can check status of job
•The status update can be found using the job tracker:
What percentage of the job is being currently completed
Information is available periodically

December 1, 2017 39
Basic HDFS
Basic HDFS
• HDFS stands for Hadoop distributed file system
• File storage component of Hadoop
• Basic architecture of HDFS and Hadoop
• How HDFS stores the file internally
• Failure handling and recovery mechanism
• Rack awareness and block placement strategies
• Role of name node and secondary name node
• When to use HDFS and when not to
Agenda

Identification
number
Individual
hard drive
storage
space
Data
nodes
DN
1
DN
2
DN
3
DN
4
DN
5
DN
6
DN
7
DN
8
DN
9
DN
10
DN
11
DN
12
DN 1
HDFS: Storage inside HDFS Cluster
Basic HDFS

Input file (200
MB)
Block
1
Block 2 Block 3 Block N
• HDFS breaks the user input file (200 MB) into smaller chunks
• Block size is configured by the administrator
• Default split size is 64 MB
• Split size can be configured depending upon the requirement
HDFS : Storage inside HDFS cluster
Basic HDFS

December 1, 2017 42
File Storage in HDFS
•Client machine tries to communicate with the name node
•Name node gives out the information about the default split size
•Client machine gets an idea of how big each input split will be
•Splitting of the files actually happens in the client machine
•The name node gives out the information about:
 The hostname of the IP addresses of the data nodes
 Free space to actually store the data
•Client machine directly writes blocks on to data nodes
•The client machine or the gateway machine performs this by
bypassing the name node
•Decision of which block is governed by a specific set of rules
•Decision of which block resides on which data node is not done
randomly
Basic HDFS

Data node sends a heartbeat signal to the name node
once in
every 3 seconds to indicate that it is up and running
Heartbeat sent every 3 seconds
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backup
Nodes write to local disk
Design and Architecture Overview
Overview of HDFS

• If the data node fails to set the signal once in every 3
seconds:
 Name node assumes that particular data node is
dead
 Takes actions for replicating the data
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backupHeartbeat
not received
Nodes write to local disk
Overview of HDFS

December 1, 2017 45
•Data node also sends status information to the name node
once in every 6 hours
•This value can be configured to a different number by the
Hadoop
Administrator
•Gives information or the block status report of the data node
•Complete detail information about what block going to exist
on that particular data node
Overview of HDFS

Name
node
Secondary
name node
Job tracker
Rack View of Hadoop Cluster
Hadoop cluster is deployed in a production environment into
multiple racks
• The name node, the secondary name node, and job
tracker are never placed in a single rack
• In the event of failure of rack, the entire cluster would
be down
Rack view of Hadoop Cluster

Input file (200 MB)
N1 (64 MB) N2 (64 MB) N3 (64 MB) N4 (8 MB)
Failure of a Data Node
DN 1
DN 2
DN 3
DN 4
N4 DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
N2
N1
N3
Data node Failure

December 1, 2017 48
Replication of Data Blocks
• To avoid loss of data, copies of the data blocks on data nodes
is stored on multiple data nodes
• Default replication factor is 3
• 3 copies of the same data block on 3 different data nodes
• Can be configured by administrator
• Replication factor should not be greater than 3 to avoid
consuming a lot of hard drive space
Data Block Replication

File Storage in HDFS
Block size = 64MB
300 MB
Total blocks = 5 RF = 3 Cluster = 5 nodes
5x64 = 320MB
5th block = 44MB
File storage in HDFS

December 1, 2017 50
Block placement Strategy
Block Placement Strategy/Replica Placement
same rack
• Two racks :
 Rack 1 in the left hand side
 Rack 2 in the right hand side
• First replica of the block is placed
in one of the data nodes in the left
hand side or the rack 1
• Two other replicas of same block
is split across multiple data nodes
but in a same rack

Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 11
DN 10 N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
• In event of data node failure:
 Data node goes down for the
replication count
 Replication factor for block
N2 is reduced to 1
• HDFS replicates the block
N2 into some other data
node
• Example: N2 is replicated
into data node 12
N2DN 12DN 6

Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 10
DN 11
N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
What if?
• Data node 10 comes
up after sometime
• Data node 10 was
temporarily down
• Data node actually comes
back after 2 minutes
• The name node or HDFS has
already replicated the block N2
into some other data node
• HDFS deletes one of the extra
copies of N2 and can happen
from any of the nodes
N2DN 12DN 6

December 1, 2017 53
Basic HDFS
When to/not to use HDFS?
•Storing large files order of gigabytes, terabytes and petabytes
•Input file size is greater than the input split size
Do not use HDFS
•Storing large number of small files
•High I/O latency when the data is written/read to/from disc
•Input file size is smaller than the input split size
Use HDFS

December 1, 2017 54
 WORM - write once read many times patterns
 Files cannot be edited/changed
 File deleted and retrieved back into the local file
system
 Edited and then put back into the HDFS data node
In HDFS
Basic HDFS

Master
node
• Name node with one cluster
• Manages the entire file system
• Namespace of the metadata of file blocks
• Controls the read write access to the files
• Manages the block replication
• Single point of failure
Architectural Overview of Hadoop 1.0
Master
Secondar
y name
node
Data node Data node Data node
Slav
e
Name
node
HDFS

Master
Secondar
y name
node
Slav
e
Name
node
• Secondary name node is HDFS namespace
backup
• One for the cluster
• Performs the housekeeping work
• Similar hardware as that of name node
machine
• Not used for a hot standby or a highly
available name node backup
• Uses system metadata and namespace
recovery
HDFS
Cluster
Secondary
name node

57
Secondary Name Node
•Heavy weight lifting nodes in cluster or data nodes
•Stores data
•Aids in data processing
•Serves read write request from clients
•Stores and retrieves data blocks
•Performs replication tasks upon requests by the name
node
•Reports block status of system to name node
HDFS client
•HDFS clients can be many
•Act as an interface between the end user and the
Hadoop cluster
•Help to communicate to the name node and data nodes
•Help to submit job
•Submit a read-write request to a file
•Interface with the name node

HDFS Namespace
•Hierarchy of files and directories
•Represented by name node data structures called as I-
nodes
•Record the attributes of a file
 Permission
 Access time
 Namespace
 Disc space quota
Metadata file maintains file
attributes:
Access time
Replication
factor
Stored persistently in
a local disc and is
called fsImage

• Edit log file records every change that occurs to file
system metadata
• Metadata saved in RAM for faster access
• Edit logs are merged with metadata periodically
• This operation of merging is known as checkpointin
• After each checkpoint operation:
 Edit logs are cleared
 A new entry is added
• Merging fsImage with edit logs is done in secondary
name node
• fsImage file not updated for every write operation
• fsImage is loaded into RAM at every node startup
• Every 1 hour, contents of RAM are flushed out

Checkpointing Process
• Happens in the secondary name node
• Copy of fsimage is kept in the RAM
• HDFS file system changes are captured in the edit logs
• 'fsimage' loaded as metadata is optimized for read
operations and fast searching
• Same data corresponding to the edits are captured in
edit logs
• Edit logs and 'fsimage' need to be merged periodically
• New copy of the 'fsimage' contents is reordered into the
main memory

Name node
HDF
S
Clien
t
FS_Data_Input_Strea
m
DFS_Input_Stream
Distributed File
System1
4
7
2 Metadata
Requestto get block location
3 Metadata Flow
6 Rea
dData
Flow
5 Rea
d
Client JVM
Client node
HDFS Dataflow: File Read Operation
Understanding the steps involved in reading a file from
HDFS, an anatomy of a file read operation
HDFS Dataflow Anatomy

Steps involed in reading a file from HDFS
1.The client opens the file to be read by calling OPEN
Distributed File System object
2.The object connects to the name node using RPC to get the
metadata information:
3.For each block, name node returns data nodes addresses
having a copy of that block
4.Distributed File System returns object which takes care of
data node and name node interactions
5.Client calls Read operation on streams to connect to first
data node for the first block in the file
6.The data is streamed from the data node back to the client
which calls the read repeatedly until it completes the reading
of the file
7.When client has finished reading, it calls ˄Close operation˅
on FS_Data_Input_Stream

12 Complete
HDFS Dataflow: Anatomy of a File Write Operation
Name node
Data node Data node
HDF
S
Clien
t
Distributed File
System
1
3
2 Creat
e
Ack Queue Client
JVM
Client node
5
Data
Streamer
4
Data Queue
Writing packet
8 8
10 10
11
7
10 Sending Acknowledgement packet
Data node
FS_Data_Output_Strea
m
DFS_Output_Stream
Data node pipeline6
Understanding the steps involved in writing a file from HDFS, an
anatomy of a file write operation

1. The client calls Create API on the distributed file system
object to create a file
2. Object connects to the name node using an RPC call. Creates
a new file in the file system s name with no blocks˅
associated
3. Client calls a Write API on the data
4. DFS_Output_Stream object splits the data into package and
writes into the internal Data Queue
5. Asks name node for allocation of new blocks by picking the
desirable data nodes to store the replicas
6. List of 3 data nodes form a pipeline
7. Data Steamer pours the packet into the first data node in the
pipeline
8. Data Steamer pours the packet into the first data node in the
pipeline
9. DFSOutputStream keeps the Ack Queue to store package
that are waiting to be acknowledged by the data nodes
10.Sending Acknowledgement packet
11.When client finishes writing data, it calls the ˄Close API on˅
the data stream

65
Architecture of MapReduce in Hadoop 1.0
Pig Hive
Java
Map Reduce
(Resource
Management)
+
Job Processing
HDFS (Storage)
Hadoop 1.X
 Jobs submitted to a Hadoop 1.0 cluster get converted to
MapR jobs
Hadoop Architecture

(YARN
)
Resourc
e
Manager
Job
Schedule
r
3rd party
framework,
plugs into
YARN
Critical for
Machine
Learning
algorithms
Allows Spark
to plug into
Hadoop
Primary & Secondary
NameNodes
Hot Standby/Highly
Available NameNode Allows Spark
to plug into
Hadoop
49
Hadoop 2.0 YARN
Hadoop Architecture

December 1, 2017 67
Hadoop 2.0 YARN advantages over Hadoop 1.0
Hadoop Architecture

December 1, 2017 68
HADOOP 2.X Core COMPONENTS
HDFS YARN
Node Manager
Resource ManagerName Node
Data Node
Secondary
Name Node
Storage Processing
Master
Slave
Hadoop 2.0 Core Components
Hadoop Architecture

December 1, 2017 69
Clientjjds
Scheduler
aApplications
Manager(AM)
CContainer
App
Master
CContainer
App
Master
Resource Manager
Node Manager Node Manager
Data Node Data Node
Resource Manager
Master
Slave
Hadoop Architecture

December 1, 2017 70
 One Resource Manager (RM) per cluster
 The ResourceManager is the rack-aware master node in YARN
 Works like an optimised JobTracker (JT)
 In YARN, JT is split into two daemons with the RM
 Scheduler
 Applications Manager (AM)
 The Scheduler component of the YARN ResourceManager
allocates resources to running applications.
 ResourceManager is the master that arbitrates all the available
cluster resources and thus helps manage the distributed
applications running on the YARN system.
 It works together with the per-node NodeManagers and the
per-application ApplicationMaster.
Hadoop Architecture

December 1, 2017 71
Scheduler
aApplications
Manager(AM)
Resource Manager
Application Manager
• Job queue
• Resource list
• Job Scheduling
• Resource allocation
Each time a new job is submitted by a client, it first has to
pass through the application manager
 Maintains log of finished jobs
 Validates job application requests and rejects those that
violate specifications.
 Eliminates duplicate job applications.
Hadoop Architecture

HADOOP 1.0 HADOOP 2.0
Scalability
 Maximum cluster size:
4,000 nodes
 Maximum # of
concurrent tasks (1000+
mappers and reducers
running in parallel):
40,000
 JobTracker bottleneck:
gets choked up when
there’s a lot of traffic (no
room for an additional
JobTracker)
 6,000-10,000 machine
clusters
 100,000+ concurrent
tasks &10,000
concurrent jobs (1
job=1000+ tasks)
 Instead of JobTracker,
it has a backup
Resource Manager. It
allows load distribution
within the tracker.
Hadoop 1.0 Vs Hadoop 2.0

Multitenancy
 No support for non-
map/reduce
jobs
 Designed for batch
processing workloads
 Iterative jobs (e.g. for
Machine Learning), not
supported
 Can’t accommodate
third-party frameworks
 Only MapReduce app
can be
 YARN supports both
batch processing and
non-batch oriented jobs.
 Supports TEZ, which is
a parallel processing
engine that supports
interactive and iterative
jobs useful for Machine
Learning algorithms

Availability
 Single point of failure,
i.e. NameNode
 When NameNode
crashes, cluster goes
down
 Jobs need to be re-
submitted by users
 The cluster is not highly
available
 Active/Standby NN which
works in Hot Standby
Mode i.e. Secondary
NameNode will kick in,
when cluster is still
running.
 If both Primary and Hot
Standby NameNodes go
down (which is rare), you
can resort to the
Secondary NameNode.
90% chance of both
NameNodes crashing
simultaneously.

 JobTracker: Gets choked
up from traffic.
Responsible for scheduling
and centralized resource
allocation in Master mode.
 TaskTracker: doing heavy
lifting in the DataNodes
 Resource Manager is like the
JobTracker
 Consists of a) Scheduler
that schedules activities &
and b) an Application
Manager (not Master) for,
resource allocation and
monitoring.
 Application Master:
equivalent of TaskTracker in
MR v1. Responsible for task
execution and updation..

December 1, 2017 76
Hadoop 3.x
•Apache Hadoop 3 is round the corner with members of the
Hadoop community at Apache Software Foundation still testing
it.
•Apache Hadoop 3.0 will bring in with thousands of new bug
fixes, features and enhancements over Hadoop 2.0.
•The major release of Hadoop 3.x is anticipated to be rolled out
sometime mid of 2017.
Why hadoop 3.x?
•With Java 7 attaining end of life in 2015, there was a need to
revise the minimum runtime version to Java 8 with a new
Hadoop release so that the new release is supported by Oracle
with security fixes and also will allow hadoop to upgrade its
dependencies to modern versions.
Overview of Hadoop 3.0

December 1, 2017 77
• With Hadoop 2.0 shell scripts were difficult to understand as
hadoop developers had to read almost all the shell scripts to
understand what is the correct environment variable to set an
option and how to set it whether it is java.library.path or java
classpath or GC options.
• With support for only 2 NameNodes, Hadoop 2 did not provide
maximum level of fault tolerance but with the release of Hadoop
3.x there will be additional fault tolerance as it offers multiple
NameNodes.
• Replication is a costly affair in Hadoop 2 as it follows a 3x
replication scheme leading to 200% additional storage space
and resource overhead. Hadoop 3.0 will incorporate Erasure
Coding in place of replication consuming comparatively less
storage space whilst providing same level of fault tolerance.

December 1, 2017 78
What’s New in Hadoop 3.0?
•Minimum Runtime Version for Hadoop 3.0 is JDK 8
•Support for Erasure Coding in HDFS
•Hadoop Shell Script Rewrite
•MapReduce Task Level Native Optimization
•Support for Multiple NameNodes to maximize Fault Tolerance
• Introducing a More Powerful YARN in Hadoop 3.0
•Change in Default Ports for Various Services and Addition of New
Default Ports

December 1, 2017 79
Hadoop 2.x vs. Hadoop 3.x
Features Hadoop 2.x Hadoop 3.x
Minimum
Required
Java
Version
JDK 6 and above.
JDK 8 is the minimum
runtime version of JAVA
required to run Hadoop
3.x as many dependency
library files have been
used from JDK 8.
Fault
Tolerance
Fault Tolerance is
handled through
replication leading to
storage and network
bandwidth overhead.
Support for Erasure
Coding in HDFS improves
fault tolerance

80
Storage
Scheme
Follows a 3x Replication
Scheme for data
recovery leading to
200% storage
overhead. For instance,
if there are 8 data
blocks then a total of
24 blocks will occupy
the storage space
because of the 3x
replication scheme
Storage overhead in
Hadoop 3.0 is reduced to
50% with support for
Erasure Coding. In this
case, if here are 8 data
blocks then a total of only
12 blocks will occupy the
storage space
Change in
Port
Numbers
Hadoop HDFS NameNode
-8020
Hadoop HDFS DataNode
-50010
Secondary NameNode
HTTP -50091
Hadoop HDFS NameNode
-9820
Hadoop HDFS DataNode
-9866
Secondary NameNode HTTP
-9869

December 1, 2017 81
YARN
Timeline
Service
YARN timeline service
introduced in Hadoop 2.0
has some scalability issues.
YARN Timeline service has
been enhanced with ATS v2
which improves the
scalability and reliability.
Intra
DataNode
Balancing
HDFS Balancer in Hadoop
2.0 caused skew within a
DataNode because of
addition or replacement of
disks.
Intra DataNode Balancing
has been introduced in
Hadoop 3.0 to address the
intra-DataNode skews
which occur when disks are
added or replaced.
Number of
NameNodes
Hadoop 2.0 introduced a
secondary namenode as
standby.
Hadoop 3.0 supports 2 or
more NameNodes.

December 1, 2017 82
Hadoop Installation:
1)update Ubuntu
$ sudo apt-get update
2) Download and Install JDK
$ sudo apt-get install default-jdk
Reference link:
https://www.digitalocean.com/community/tutorials/how-to-install-ja
3) Check java Installed or Not
$ java -version
4) Install SSH
$ sudo apt-get install openssh-server
Hadoop Installation

December 1, 2017 83
5) Configuring SSH
$ ssh-keygen -t rsa -P ""
note: Getting this line (Enter file in which to save the key
(/home/manju/.ssh/id_rsa): ) please enter “ENTER key” in
keyboard
6) Copy id_rsa.pub to authorized keys
$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
7) Disabling IPv6
For getting your IPv6 disable in your Linux machine, you need to
update /etc/sysctl.conf by adding following line of codes at end of
the file
Hadoop Installation

December 1, 2017 84
$ sudo gedit /etc/sysctl.conf
note: type above command in terminal you will get one sysctl.conf file,
put below 4 lines in that file
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
8) Now go and download Hadoop tar.gz file in below given link
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
2.6.4/hadoop-2.6.4.tar.gz
or use
http://mirror.fibergrid.in/apache/hadoop/common/hadoo
p-2.6.4/hadoop-2.6.4.tar.gz
9) now create one apache folder in home directory then go
to download folder copy that Hadoop newly downloaded file and
copy that file into apache folder
Hadoop Installation

December 1, 2017 85
10) extract Hadoop zip file in same directory
note: for extract purpose, select Hadoop tar file, right click on tar
file then you can see the option like extract here option choose
that option it will extract automatically in same folder.
11) Then create 2 new folders inside Hadoop directory
i) folder names are yarn inside yarn hdfs directory inside
hdfs namenode directory and datanode directory.
The folder structure like this:
/home/manju/apache/hadoop/yarn/hdfs/namenode
/home/manju/apache/hadoop/yarn/hdfs/datanode
12) Give permissions for newly created directories
$ chmod 777 -R /home/manju/apache/hadoop/yarn
Hadoop Installation

December 1, 2017 86
13) Update Hadoop configuration files
$ sudo gedit .bashrc
following environment variables at the end of bashrc file
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (Change the
path according to your pc configuration)
export HADOOP_HOME=/home/manju/apache/hadoop (Change the
path according to your pc configuration)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
Note: After Configure above Variables just refresh the bashrc for that…
$ source .bashr
Hadoop Installation

December 1, 2017 87
14) change the setting in Hadoop-env.sh
go to hadoop installed directory then open etc directory
then hadoop folder then open hadoop-env.sh then edit or
paste java home path available in /usr/lib/jvm/java-8-openjdk-
amd64
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
15) change the setting in core-site.xml file
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open core-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Hadoop Installation

December 1, 2017 88
16) change the setting in hdfs-site.xml file
go to Hadoop installed directory then open etc directory
then hadoop folder then open hdfs-site.xml then edit
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/datanode</value>
</property>
Note: here you have to change your directory structure according to
your pc, which one we created earlier directories
are namenode and datanode
Hadoop Installation

December 1, 2017 89
17) change the setting in yarn-site.xml file
go to Hadoop installed directory then open etc directory
then Hadoop folder then open yarn-site.xml then edit
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Hadoop Installation

December 1, 2017 90
18) change the setting in mapred-site.xml file
note: Copy template of mapred-site.xml.template file, then paste in
same directory, rename that copied file into mapred-site.xml
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open mapred-site.xml then edit
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
19) Format namenode
$ cd apache/hadoop/ (then press enter it will go to Hadoop home
directory)
manju@ubuntu:~/apache/hadoop$ hdfs namenode -format (use
this command to format hdfs)
Hadoop Installation

December 1, 2017 91
20) After Format completes run these 2 commands to start Hadoop
$ start-dfs.sh
$ start-yarn.sh
note: when you run the above commands it will ask (yes/no) just give
"yes" for that
21) finally check whether Hadoop working or not
$ jps
Note: its show total 6 Daemons in terminal
manju@ubuntu:~/apache/hadoop$ jps
2337 NameNode
3094 NodeManager
3127 Jps
2986 ResourceManager
2443 DataNode
2845 SecondaryNameNode
22) To stop Hadoop use
$ stop-dfs.sh
$ stop-yarn.sh
Hadoop Installation

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop

Similar to Hadoop (20)

More from Mallikarjuna G D

More from Mallikarjuna G D (20)

Recently uploaded

Recently uploaded (20)

Hadoop