Hadoop

1
All of Hadoop
Yasser Hassan
Master Student
14 March 2017

Agenda
 What is Hadoop ?
 Who Uses Hadoop ?
 Uses of Hadoop
 Sample Application
 Core Hadoop Concepts
 Hadoop Goals
 Hadoop Challenges
 Hadoop Architecture
◦ HDFS (Hadoop Distributed File System)
 How are Files Stored
 Replication Strategy
 HDFS Architecture
 Understand HDFS Architecture
 HDFS Folder & Internals Files

Agenda
◦ YARN (Yet Another Resource Negotiator)
 Two important elements (Resource Manager , Node Manager )
 The application startup process
◦ MapReduce (MR)
 Importance of MapReduce
 MapReduce program execution
 Flow of MapReduce
 Inputs and Outputs
 EXAMPLE 1
 EXAMPLE 2
◦ Other Tools
 Why do these tools exist?
 Main Differences between Hadoop 1 & Hadoop 2
 Advantages of Hadoop
 Disadvantages of Hadoop
 Previous Generation
 Question

What is Hadoop ?
 Open source software framework designed
for storage and processing of large scale
data on clusters of commodity hardware.
 Created by Doug Cutting and Mike
Carafella in 2005.

Uses of Hadoop
 Data-intensive text processing.
 Assembly of large genomes.
 Graph mining.
 Machine learning and data mining.
 Large scale social network analysis.

Sample Application
 Data analysis is the inner loop of Web
2.0
◦ Data ⇒ Information ⇒ Value
 Log processing: reporting .
 Search index
 Machine learning: Spam filters
 Competitive intelligence

Core Hadoop Concepts
 Applications are written in a high-level programming
language.
◦ No network programming.
 Nodes should communicate as little as possible.
 Data is spread among the machines in advance.
◦ Perform computation where the data is already
stored as often as possible.

Hadoop Goals
• Abstract and facilitate the storage and
processing
of large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability.
• Use commodity (cheap!) hardware.
• Fault-tolerance.
• Move computation rather than data.

Hadoop Challenges
 Hadoop is a cutting edge technology
◦ Hadoop is a new technology, and with adopting any new
technology, finding people who know the technology is
difficult!.
 Hadoop in the Enterprise Ecosystem
◦ Hadoop is designed to solve Big Data problems encountered
by Web and Social companies. Don’t support government
requirement. For example, HDFS does not offer native
support for security and authentication.
 Hadoop is still rough around the edges (under
development)
◦ The development and admin tools for Hadoop are still pretty
new. Companies like Cloudera, Hortonworks, MapR and

Hadoop Challenges
 Hadoop is NOT cheap
◦ Hardware Cost
 Hadoop runs on 'commodity' hardware. But these are not
cheapo machines, they are server grade hardware.
 So standing up a reasonably large Hadoop cluster, say
100 nodes, will cost a significant amount of money. For
example, lets say a Hadoop node is $5000, so a 100
node cluster would be $500,000 for hardware.
◦ IT and Operations costs
 A large Hadoop cluster will require support from various
teams like : Network Admins, IT, Security Admins,
System Admins. Also one needs to think about
operational costs like Data Center expenses : cooling,
electricity, etc.

Hadoop Challenges
 Map Reduce is a different programming
paradigm
◦ Solving problems using Map Reduce requires a
different kind of thinking. Engineering teams
generally need additional training to take
advantage of Hadoop.
 Hadoop and High Availability
◦ Hadoop version 1 had a single point of failure
problem because of NameNode. There was only
one NameNode for the cluster, and if it went
down, the whole Hadoop cluster would be
unavailable. This has prevented the use of

Hadoop Architecture
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce

Hadoop Architecture
 HDFS Concept
◦ HDFS is a file system written in Java based
on the Google’s GFS.
◦ Developed using distributed file system
design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is
highly fault tolerant and designed using low-
cost hardware.
◦ Responsible for storing data on the cluster.
◦ Optimized for streaming reads of large files
and not random reads.

Hadoop Architecture
 HDFS Concept
◦ How are Files Stored
 Data is organized into
files and directories.
 Files are divided into
uniform sized
blocks(default 64MB)
and distributed across
cluster nodes.
 HDFS exposes block
placement so that
computation can be
migrated to data.

Hadoop Architecture
 HDFS Concept
◦ How are Files Stored
Blocks are replicated (default 3) to handle
hardware failure.
Replication for performance and fault
tolerance (Rack-Aware placement).
HDFS keeps checksums of data for
corruption detection and recovery.

Hadoop Architecture
 HDFS Concept
◦ Replication Strategy
 One replica on local node.
 Second replica on a remote rack.
 Third replica on same remote rack.
 Additional replicas are randomly placed.
◦ Clients read from nearest replica.

Hadoop Architecture
 HDFS Architecture
◦ HDFS follows master-slave architecture and it
has the following elements.
 Namenode
 Datanode
 Block

Hadoop Architecture
◦ Namenode : is the commodity hardware that
contains the GNU/Linux operating system and
namenode software. It is a software that can be
run on commodity hardware. The system
having the namenode acts as the master server
and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as
renaming, closing, and opening files and
directories.

Hadoop Architecture
◦ Datanode : is a commodity hardware having the
GNU/Linux operating system and datanode
software. For every node (Commodity
hardware/System) in a cluster, there will be a
datanode. These nodes manage the data
storage of their system. it does the following
tasks:
 Datanodes perform read-write operations on the file
systems.
 They also perform operations such as block creation,
deletion, and replication according to the instructions
of the namenode.

Hadoop Architecture
◦ Block : Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into one or
more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read
or write is called a Block. The default block size is 64MB,
but it can be increased as per the need to change in
HDFS configuration.

Hadoop Architecture
 Understand HDFS Architecture
◦ The NameNode is the centerpiece of an HDFS file system. It
keeps the directory tree of all files in the file system, and
tracks where across the cluster the file data is kept. It does not
store the data of these files itself.
◦ Client applications talk to the NameNode whenever they wish
to locate a file, or when they want to add/copy/move/delete a
file.

Hadoop Architecture
◦ The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data
lives.

Hadoop Architecture
◦ Datanodes perform read-write operations on the file
systems.

Hadoop Architecture
◦ The NameNode is a Single Point of Failure for the HDFS
Cluster.
◦ HDFS is not currently a High Availability system. When the
NameNode goes down, the file system goes offline.
◦ There is an optional SecondaryNameNode that can be hosted
on a separate machine. It only creates checkpoints of the
namespace by merging the edits file into the fsimage file and
does not provide any real redundancy. Hadoop 0.21+ has a
BackupNameNode.

Hadoop Architecture
◦ How creates checkpoints reports of the namespace
?
◦ How Namenode responding for Client request ?

Hadoop Architecture
◦ Rack

Hadoop Architecture
◦ General Structure of HDFS

Hadoop Architecture
 HDFS Folder

Hadoop Architecture
 YARN (Yet Another Resource
Negotiator)
◦ is the framework responsible for providing the
compu-tational resources ( CPUs, memory, etc.)
needed for application executions.
◦ The YARN infrastructure and the HDFS are completely
decoupled and independent: the first one provides resources
for running an application while the second one provides
storage.

Hadoop Architecture
 YARN (Yet Another Resource Negotiator)
◦ Two important elements are:
 Resource Manager:(one per cluster) is the master. It knows
where the slaves are located (Rack) and how many
resources (container) they have. It runs several services, the
most important is Resource Scheduler which decides how
to assign resources.

Hadoop Architecture
◦ Two important elements are:
 Node Manager:(many per cluster) is the slave. When it
starts, it announces himself to the RM. Periodically, it sends
an heartbeat to RM. Each Node Manager offers some
resources to the cluster. Its resource capacity is the amount
of memory and the number of vcores. At run-time, the
Resource Scheduler will decide how to use this capacity:
a Container is a fraction of the NM capacity and it is used by
the client for running a program.

Hadoop Architecture
◦ In YARN, there are at least three actors:
 the Job Submitter (the client)
 the Resource Manager (the master)
 the Node Manager (the slave)

Hadoop Architecture
◦ The application startup process is the
following:
 a client submits an application to the Resource
Manager.
 Resource Manager allocates a container.(by RS)
 Resource Manager contacts the related Node
Manager.
 the Node Manager launches the container.
 the Container executes the Application Master.

Hadoop Architecture
 Hadoop MapReduce
o MapReduce : is a processing technique and a
program model for distributed computing based
on java.
o The MapReduce algorithm contains two
important tasks :
 Map : takes a set of data and converts it into
another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Reduce :which takes the output from a map as
an input and combines those data tuples into a
smaller set of tuples.

Hadoop Architecture
 Importance of MapReduce

Hadoop Architecture
◦ MapReduce program executes in three stages :-
 Map stage : The map or mapper’s job is to process the
input data. Generally the input data is in the form of file or
directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small
chunks of data.
 Reduce stage : This stage is the combination
of Shuffle stage and Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After
processing, it produces a new set of output, which will be
stored in the HDFS.

Hadoop Architecture
 Hadoop
MapReduce
◦ Flow of MapReduce :
 Define Inputs.
 Define Map function.
 Define Reduce
function.
 Define Output.

Hadoop Architecture
◦ Flow of MapReduce :
 During a MapReduce job, Hadoop sends the Map
and Reduce tasks to the appropriate servers in the
cluster.
 The framework manages all the details of data-
passing such as issuing tasks and verifying task
completion.
 Most of the computing takes place on nodes with
data on local disks that reduces the network traffic.
 After completion of the given tasks, the cluster
collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.

Hadoop Architecture
◦ Inputs and Outputs
 The MapReduce framework operates on <key, value> pairs.
 The key and the value classes should be in serialized manner
by the framework and hence, need to implement the Writable
interface ??.
 the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework.
 Input and Output types of a MapReduce job:
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3,
v3>(Output). Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)

Hadoop Architecture
◦ EXAMPLE 1

Hadoop Architecture
◦ EXAMPLE 2

Hadoop Architecture

Hadoop Architecture
 Other Tools
◦ Hive
 Hadoop processing with SQL.
◦ Pig
 Hadoop processing with scripting.
◦ Cascading
 Pipe and Filter processing model.
◦ HBase
 Database model built on top of Hadoop.
◦ Flume
 Designed for large scale data movement.

Hadoop Architecture
 Why do these tools exist?
◦ MapReduce is very powerful, These tools
allow programmers who are familiar with
other programming styles to take
advantage of the power of MapReduce.

Hadoop Architecture
◦ Main Differences between Hadoop 1 & Hadoop 2
Hadoop 1 Hadoop 2
1- YARN not existing 1- YARN exist
2- have only Namenode 2- have Namenode and
Secondary Namenode for
recovery .

Advantages of Hadoop
 Scalable:
◦ is a highly scalable storage platform, because it can stores and
distribute very large data sets across hundreds of inexpensive
servers that operate in parallel. Unlike traditional relational
database systems (RDBMS).
 Cost effective:
◦ offers a cost effective storage solution for businesses’ exploding
data sets. The problem with traditional RDBMS is that it is high
cost to scale to such a degree in order to process such massive
volumes of data.
 Flexible:
◦ enables businesses to easily access new data sources and tap
into different types of data (both structured and unstructured) to
generate value from that data.

Advantages of Hadoop
 Fast:
◦ Hadoop’s unique storage method is based on a
distributed file system that basically ‘maps’ data
wherever it is located on a cluster. The tools for data
processing are often on the same servers where the
data is located, resulting in much faster data
processing. If you’re dealing with large volumes of
unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in
hours.
 Resilient to failure:
◦ A key advantage of using Hadoop is its fault tolerance.
When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that in
the event of failure. There is another copy available for use

Disadvantages of Hadoop
 Security Concept
◦ Just managing a complex applications such as Hadoop can be
challenging. A simple example can be seen in the Hadoop security model,
which is disabled by default due to sheer complexity. Hadoop is also
missing encryption at the storage and network levels, which is a major
selling point for government agencies and others that prefer to keep their
data.
 Vulnerable By Nature
◦ the very makeup of Hadoop makes running it a risky proposition. The framework is
written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by
cybercriminals and as a result, implicated in numerous security breaches.
 Potential Stability Issues
◦ Like all open source software, Hadoop has had its fair share of stability
issues. To avoid these issues, organizations are strongly recommended to
make sure they are running the latest stable version, or run it under a
third-party vendor equipped to handle such problems.
 Not Fit for Small Data
◦ While big data is not exclusively made for big businesses, not all big data

Question
Q/ What are differences
between
MapReduce
&
YARN
?????

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop