Hadoop

History of Hadoop
 Open source platform for storage and
processing of diverse data types
 that enables data-driven enterprises to rapidly
derive the complete value from all their data
 Created by Doug Cutting & Mike Cafarella
 Were building ‘Nutch” to create a large web
index
 MapReduce paper was published by Google that
is similar to solve the problem of large web index
(Nutch)
 The project was primarily initiated for Yahoo now
Cloudera Contd.,

Hadoop
 HDFS (Hadoop Distributed File System)
 MapReduce
Contd.,

 Hadoop streaming :
 enables using Map Reduce with any command line
script
 Hadoop Hive
 Evolved to provide DW capability to large datasets
 All the queries are executed HQL (Hive Query
Language)
 Hive is declarative like SQL
 Hadoop pig
 Similar to HIVE, but procedural
 Works well with data pipe line scenarios, used in ELT
 Hadoop Hbase
 HBase is called the Hadoop (column oriented
database) because it is a NoSQL database that runs
Hadoop- What it has?

Hadoop- What it does?
 Can store data in native format
 handle both structured and un-structured
data
 Supports complex analysis, detailed and
special purpose computation
 Handles variety of work loads like
 search,
 log processing,
 data warehousing,
 audio/video analysis Contd.,

 Wide variety of analysis, transformations of
data can be performed
 Can store data in terabytes/petabytes/exabytes
etc., on ease inexpensively (@ low-cost)
Contd.,

 Reliable, handles hardware and system
failures automatically without losing data or
interrupting data analysis
 Hadoop runs of clusters of commodity servers
(disposable/easily repairable)

Map Reduce
 It is an agent that distributes the work and
collects the results
 Map Reduce handles job failures
 By staring another instance of that task on
another server that has a copy of data
 It is designed to be fault tolerant, by sing
unknown hardware, where reliability is
unknown

Map Reduce Model
 Introduced by Google
 Solves problems with large clusters of
commodity machines
 Map Reduce Model is based on two distinct
steps
 MAP
 REDUCE

Map Reduce Model
 Map Reduce : input can be split into logical
chunks, each chunk is processed
independently by a map task
 MAP
 Ingestion and transformation step
 Individual input records can be processed in
parellel
 Map task is responsible for transforming the input
records into key/value pairs
 Mapper class, reads the input records and
transforms them into one key/value per record

Map Reduce Model
 REDUCE
 Aggregation or summarization takes place
 In which all associated records must be
processed together by a single entity
 Reducer Class will transform the key/value pairs
that the reduce method outputs into output
records

Eg:(Word Count) Map Reduce
Model

HDFS
 Hadoop Distributed File System
 No set up is required to store the data
 Storage system for a Hadoop Cluster
 Data is broken into pieces and distributes
those pieces among different servers
participating within the cluster
 Each server stores a small fragment of
complete data set
 Each data set is replicated on more than one
server
Contd.,

HDFS
 Disk drive failures or damaged data is
monitored and restored by HDFS “by
calling good replicas stored else-were on
the cluster”
 No Blueprint, simply dump data
 HDFS provides
 scalable,
 reliable
 Fault tolerant data services for data storage
and analysis at low cost Contd.,

Components of Hadoop Cluster
 HDFS has master and slave architecture
 Hadoop 1.x cluster has two types of nodes
 Master Node
 Name Node
 Secondary Name Node
 Job Tracker
 Slave Node
 Data Node
 Task Tracker
Contd.,

 Master Node
 Name Node
 Maintains metadata for each file stored in HDFS
 Meta data contains information about blocks of files
and their locations on Data nodes
 Secondary Name Node
 Not a backup for Name node
 Performs house keeping functions for Name node
 Job Tracker
 Manages the overall execution of a job
 Performs functions like scheduling & rescheduling
child tasks,
 Take care of health of each task and node

 Slave Node
 Data Node
 Stores the actual blocks of a file in the HDFS on its
own local disk
 Task Tracker
 Runs on individual datanodes
 Responsible for starting and managing individual
Map/Reduce tasks
 Communicates with the job trackers

 HDFS Write
 HDFS Read
 HDFS Delete
 Ensuring HDFS Reliability
 Secondary Name Node
 Task Tracker
 Job Tracker
Operations on HDFS

HDFS Write
 To write files into HDFS, client needs to
interact with namenode
 Namenode provides address of the slave on
which client will start writing the data
 As soon the client finishes the writing the
block, the slave starts copying the block to
another slave. (depends on the no. of the
slaves present)
 After necessary replicas of the data are
created , acknowledgement is sent to the client

HDFS Read
 To read files from HDFS, client needs to
interact with Namenode
 Namenode provides the address of the slaves
where it is stored
 Client node interacts with respective
Datanodes to read the file’
 Namenode also provide a token to the client
which it shows to datanode for authentication

HDFS Delete
 Namenode renames the file path to indicate
the file is moved to trash.
 Moved file in the trash (/trash
directory)remains for 6 hrs.
 The deleted file can be restored within these
time, else the Namenode deletes the files from
HDFS namespace
 As the files are deleted the system shows
increased available space

Ensuring HDFS Reliabiltiy
 Datanodes can fail: (i.e., Datanode periodically
sends heartbeat messages to the Namenode
in 3sec).
 If heartbeat message is not received in 3sec it
shows Datanode is failed.
 At this stage Namenode actively initiates the
replication blocks stored in lost node to a
healthy node

Ensuring HDFS Reliability
 Data can get corrupted due to a phenomenon
called bit rot.
 This condition occurs only during HDFS Read
operation due to “Checksum” mismatch.
 If the check sum of the block does not match
re-replication is initiated because block is
considered corrupted.
 In turn Name node actively tries to restore the
replication counter of the block

Secondary Namenode
 Secondary Namenode is not a failover node.
 fsimage: gets the information from each
Datanode during the system startup.
 edits: accumulates the changes durin he
syste operation
 Secondary Namenode: periodically merges the
contents of edits file in the fsimage file.
 It merges the fsimage file and the edits file into a new
fsimage file (where namenode executes it operation
without any brekarage)
 fstime: contains a time stamp of the
last checkpoint

Task Tracker
 Runs on each compute node of the Hadoop
cluster
 It is configured with a set of slots usually setup
as the total number of cores available on the
machine
 When a request is received from (Job
Tracker), Task Tracker initiates a new JVM.
 Task tracker is assigned a task depending on
how many free slots it has(total no. of tasks =
actual tasks running)
 Task tracker sends heart beat messages to job
tracker abut free available slots)

Job Tracker
 Responsible for launching and monitoring Map
Reduce Jobs
 Job tracker reqeusts the Namenode for a list of
Datanodes hosting the blocks (with files)
 Now Job tracker plans for the job Execution (Map
tasks, Reduce tasks) and schedules the task
close to the data blocks
 The job tracker submits tasks to each Task
Tracker node for execution and monitors the same
by sending ack.(heart beat) messages)
 Once the jobs are completed the Job Status is
updated (Success/Failure of job)

Types of Instllation
 Stand-Alone Mode
 Simplest mode of operation, most suitable for
debugging
 Hadoop process runs on single JVM
 Least efficient for performance, Most efficient for
development
 Pseudo-Distributed Cluster
 Runs on a single node in a Psuedo-Distributed
Manner
 All deamons runs in a separate Java Process
 Simulates a clustered environment
 Multi-Node Cluster Installation
 Hadoop is complex setup on cluster machines
 Identical to Pseudo-Distributed Cluster

Components of Map Reduce
 Client Java Program
 Client Mapper Class
 Custom Reducer Class
 Client-Side libraries
 Remote libraries
 Java Application Archive (JAR)

 Client Java Program
 Java program that is launched from client
node(edge node) in the cluster.
 This node has access to the Hadoop Cluster
 Client node can be one the datanode (some
times) in the cluster.

 Client Mapper Class
 It is a custom class.
 Instances of this class are executed on remote
task nodes except (Psuedo-distributed cluster)
 These nodes are different from the nodes where
Client Java Program launches the job.
 Client Reducer Class
 It is a custom class.
 Instances of this Mapper class are executed on
remote task nodes except (Psuedo-distributed
cluster)
 These nodes are different from the nodes where

Components of MapReduce
 Client-side Libraries
 Hadoop libraries needed by the client are
installed and configured into CLASSPATH by the
Hadoop Client Command.
 CLASS PATH details are found in $
HADOOP_HOME/bin/ .
 Client side libraries are configured by setting the
environment variables HADOOP_CLASSPATH.

 Remote Libraries
 Libraries needed for execution of custom Mapper
and Reducer classes
 Remote libraries exclude HADOOP libraries that
are already configured on the DATANODES
 Eg:. If the Mapper is using a specalized XM:
parse the libraries including the parser have been
transferred to the remote Datanodes that execute
the Mapper

 Java Application Achieve(JAR)
 Java Applications are packaged in JAR files,
 These JAR files contain
 Client Java Classes
 Custom Mapper Classes
 Custom Reducer Classes
 JAR files also include custom dependent classes
by the Client/Mapper/Reducer Classes

Hadoop

More Related Content

What's hot

Similar to Hadoop

Recently uploaded

Hadoop