HADOOP
History of Hadoop
 Open source platform for storage and
processing of diverse data types
 that enables data-driven enterprises to rapidly
derive the complete value from all their data
 Created by Doug Cutting & Mike Cafarella
 Were building ‘Nutch” to create a large web
index
 MapReduce paper was published by Google that
is similar to solve the problem of large web index
(Nutch)
 The project was primarily initiated for Yahoo now
Cloudera Contd.,
History of Hadoop
Hadoop
 HDFS (Hadoop Distributed File System)
 MapReduce
Contd.,
 Hadoop streaming :
 enables using Map Reduce with any command line
script
 Hadoop Hive
 Evolved to provide DW capability to large datasets
 All the queries are executed HQL (Hive Query
Language)
 Hive is declarative like SQL
 Hadoop pig
 Similar to HIVE, but procedural
 Works well with data pipe line scenarios, used in ELT
 Hadoop Hbase
 HBase is called the Hadoop (column oriented
database) because it is a NoSQL database that runs
Hadoop- What it has?
Hadoop- What it does?
 Can store data in native format
 handle both structured and un-structured
data
 Supports complex analysis, detailed and
special purpose computation
 Handles variety of work loads like
 search,
 log processing,
 data warehousing,
 audio/video analysis Contd.,
 Wide variety of analysis, transformations of
data can be performed
 Can store data in terabytes/petabytes/exabytes
etc., on ease inexpensively (@ low-cost)
Hadoop- What it does?
Contd.,
 Reliable, handles hardware and system
failures automatically without losing data or
interrupting data analysis
 Hadoop runs of clusters of commodity servers
(disposable/easily repairable)
Hadoop- What it does?
MAP REDUCE
Map Reduce
 It is an agent that distributes the work and
collects the results
 Map Reduce handles job failures
 By staring another instance of that task on
another server that has a copy of data
 It is designed to be fault tolerant, by sing
unknown hardware, where reliability is
unknown
Map Reduce Model
 Introduced by Google
 Solves problems with large clusters of
commodity machines
 Map Reduce Model is based on two distinct
steps
 MAP
 REDUCE
Map Reduce Model
 Map Reduce : input can be split into logical
chunks, each chunk is processed
independently by a map task
 MAP
 Ingestion and transformation step
 Individual input records can be processed in
parellel
 Map task is responsible for transforming the input
records into key/value pairs
 Mapper class, reads the input records and
transforms them into one key/value per record
Map Reduce Model
 REDUCE
 Aggregation or summarization takes place
 In which all associated records must be
processed together by a single entity
 Reducer Class will transform the key/value pairs
that the reduce method outputs into output
records
Map Reduce Model
Eg:(Word Count) Map Reduce
Model
HDFS
HDFS
 Hadoop Distributed File System
 No set up is required to store the data
 Storage system for a Hadoop Cluster
 Data is broken into pieces and distributes
those pieces among different servers
participating within the cluster
 Each server stores a small fragment of
complete data set
 Each data set is replicated on more than one
server
Contd.,
HDFS
 Disk drive failures or damaged data is
monitored and restored by HDFS “by
calling good replicas stored else-were on
the cluster”
 No Blueprint, simply dump data
 HDFS provides
 scalable,
 reliable
 Fault tolerant data services for data storage
and analysis at low cost Contd.,
Components of Hadoop Cluster
 HDFS has master and slave architecture
 Hadoop 1.x cluster has two types of nodes
 Master Node
 Name Node
 Secondary Name Node
 Job Tracker
 Slave Node
 Data Node
 Task Tracker
Contd.,
Components of Hadoop Cluster
 Master Node
 Name Node
 Maintains metadata for each file stored in HDFS
 Meta data contains information about blocks of files
and their locations on Data nodes
 Secondary Name Node
 Not a backup for Name node
 Performs house keeping functions for Name node
 Job Tracker
 Manages the overall execution of a job
 Performs functions like scheduling & rescheduling
child tasks,
 Take care of health of each task and node
Components of Hadoop Cluster
 Slave Node
 Data Node
 Stores the actual blocks of a file in the HDFS on its
own local disk
 Task Tracker
 Runs on individual datanodes
 Responsible for starting and managing individual
Map/Reduce tasks
 Communicates with the job trackers
HDFS Architecture
 HDFS Write
 HDFS Read
 HDFS Delete
 Ensuring HDFS Reliability
 Secondary Name Node
 Task Tracker
 Job Tracker
Operations on HDFS
HDFS Write
 To write files into HDFS, client needs to
interact with namenode
 Namenode provides address of the slave on
which client will start writing the data
 As soon the client finishes the writing the
block, the slave starts copying the block to
another slave. (depends on the no. of the
slaves present)
 After necessary replicas of the data are
created , acknowledgement is sent to the client
HDFS Write
HDFS Read
 To read files from HDFS, client needs to
interact with Namenode
 Namenode provides the address of the slaves
where it is stored
 Client node interacts with respective
Datanodes to read the file’
 Namenode also provide a token to the client
which it shows to datanode for authentication
HDFS Read
HDFS Delete
 Namenode renames the file path to indicate
the file is moved to trash.
 Moved file in the trash (/trash
directory)remains for 6 hrs.
 The deleted file can be restored within these
time, else the Namenode deletes the files from
HDFS namespace
 As the files are deleted the system shows
increased available space
Ensuring HDFS Reliabiltiy
 Datanodes can fail: (i.e., Datanode periodically
sends heartbeat messages to the Namenode
in 3sec).
 If heartbeat message is not received in 3sec it
shows Datanode is failed.
 At this stage Namenode actively initiates the
replication blocks stored in lost node to a
healthy node
Ensuring HDFS Reliability
 Data can get corrupted due to a phenomenon
called bit rot.
 This condition occurs only during HDFS Read
operation due to “Checksum” mismatch.
 If the check sum of the block does not match
re-replication is initiated because block is
considered corrupted.
 In turn Name node actively tries to restore the
replication counter of the block
Secondary Namenode
 Secondary Namenode is not a failover node.
 fsimage: gets the information from each
Datanode during the system startup.
 edits: accumulates the changes durin he
syste operation
 Secondary Namenode: periodically merges the
contents of edits file in the fsimage file.
 It merges the fsimage file and the edits file into a new
fsimage file (where namenode executes it operation
without any brekarage)
 fstime: contains a time stamp of the
last checkpoint
Task Tracker
 Runs on each compute node of the Hadoop
cluster
 It is configured with a set of slots usually setup
as the total number of cores available on the
machine
 When a request is received from (Job
Tracker), Task Tracker initiates a new JVM.
 Task tracker is assigned a task depending on
how many free slots it has(total no. of tasks =
actual tasks running)
 Task tracker sends heart beat messages to job
tracker abut free available slots)
Job Tracker
Job Tracker
 Responsible for launching and monitoring Map
Reduce Jobs
 Job tracker reqeusts the Namenode for a list of
Datanodes hosting the blocks (with files)
 Now Job tracker plans for the job Execution (Map
tasks, Reduce tasks) and schedules the task
close to the data blocks
 The job tracker submits tasks to each Task
Tracker node for execution and monitors the same
by sending ack.(heart beat) messages)
 Once the jobs are completed the Job Status is
updated (Success/Failure of job)
HADOOP FRAMEWORK
Types of Instllation
 Stand-Alone Mode
 Simplest mode of operation, most suitable for
debugging
 Hadoop process runs on single JVM
 Least efficient for performance, Most efficient for
development
 Pseudo-Distributed Cluster
 Runs on a single node in a Psuedo-Distributed
Manner
 All deamons runs in a separate Java Process
 Simulates a clustered environment
 Multi-Node Cluster Installation
 Hadoop is complex setup on cluster machines
 Identical to Pseudo-Distributed Cluster
Components of Map Reduce
 Client Java Program
 Client Mapper Class
 Custom Reducer Class
 Client-Side libraries
 Remote libraries
 Java Application Archive (JAR)
Components of Map Reduce
 Client Java Program
 Java program that is launched from client
node(edge node) in the cluster.
 This node has access to the Hadoop Cluster
 Client node can be one the datanode (some
times) in the cluster.
Components of Map Reduce
 Client Mapper Class
 It is a custom class.
 Instances of this class are executed on remote
task nodes except (Psuedo-distributed cluster)
 These nodes are different from the nodes where
Client Java Program launches the job.
 Client Reducer Class
 It is a custom class.
 Instances of this Mapper class are executed on
remote task nodes except (Psuedo-distributed
cluster)
 These nodes are different from the nodes where
Components of MapReduce
 Client-side Libraries
 Hadoop libraries needed by the client are
installed and configured into CLASSPATH by the
Hadoop Client Command.
 CLASS PATH details are found in $
HADOOP_HOME/bin/ .
 Client side libraries are configured by setting the
environment variables HADOOP_CLASSPATH.
Components of MapReduce
 Remote Libraries
 Libraries needed for execution of custom Mapper
and Reducer classes
 Remote libraries exclude HADOOP libraries that
are already configured on the DATANODES
 Eg:. If the Mapper is using a specalized XM:
parse the libraries including the parser have been
transferred to the remote Datanodes that execute
the Mapper
Components of MapReduce
 Java Application Achieve(JAR)
 Java Applications are packaged in JAR files,
 These JAR files contain
 Client Java Classes
 Custom Mapper Classes
 Custom Reducer Classes
 JAR files also include custom dependent classes
by the Client/Mapper/Reducer Classes

Hadoop

  • 1.
  • 2.
    History of Hadoop Open source platform for storage and processing of diverse data types  that enables data-driven enterprises to rapidly derive the complete value from all their data  Created by Doug Cutting & Mike Cafarella  Were building ‘Nutch” to create a large web index  MapReduce paper was published by Google that is similar to solve the problem of large web index (Nutch)  The project was primarily initiated for Yahoo now Cloudera Contd.,
  • 3.
  • 4.
    Hadoop  HDFS (HadoopDistributed File System)  MapReduce Contd.,
  • 5.
     Hadoop streaming:  enables using Map Reduce with any command line script  Hadoop Hive  Evolved to provide DW capability to large datasets  All the queries are executed HQL (Hive Query Language)  Hive is declarative like SQL  Hadoop pig  Similar to HIVE, but procedural  Works well with data pipe line scenarios, used in ELT  Hadoop Hbase  HBase is called the Hadoop (column oriented database) because it is a NoSQL database that runs Hadoop- What it has?
  • 6.
    Hadoop- What itdoes?  Can store data in native format  handle both structured and un-structured data  Supports complex analysis, detailed and special purpose computation  Handles variety of work loads like  search,  log processing,  data warehousing,  audio/video analysis Contd.,
  • 7.
     Wide varietyof analysis, transformations of data can be performed  Can store data in terabytes/petabytes/exabytes etc., on ease inexpensively (@ low-cost) Hadoop- What it does? Contd.,
  • 8.
     Reliable, handleshardware and system failures automatically without losing data or interrupting data analysis  Hadoop runs of clusters of commodity servers (disposable/easily repairable) Hadoop- What it does?
  • 9.
  • 10.
    Map Reduce  Itis an agent that distributes the work and collects the results  Map Reduce handles job failures  By staring another instance of that task on another server that has a copy of data  It is designed to be fault tolerant, by sing unknown hardware, where reliability is unknown
  • 11.
    Map Reduce Model Introduced by Google  Solves problems with large clusters of commodity machines  Map Reduce Model is based on two distinct steps  MAP  REDUCE
  • 12.
    Map Reduce Model Map Reduce : input can be split into logical chunks, each chunk is processed independently by a map task  MAP  Ingestion and transformation step  Individual input records can be processed in parellel  Map task is responsible for transforming the input records into key/value pairs  Mapper class, reads the input records and transforms them into one key/value per record
  • 13.
    Map Reduce Model REDUCE  Aggregation or summarization takes place  In which all associated records must be processed together by a single entity  Reducer Class will transform the key/value pairs that the reduce method outputs into output records
  • 14.
  • 15.
    Eg:(Word Count) MapReduce Model
  • 16.
  • 17.
    HDFS  Hadoop DistributedFile System  No set up is required to store the data  Storage system for a Hadoop Cluster  Data is broken into pieces and distributes those pieces among different servers participating within the cluster  Each server stores a small fragment of complete data set  Each data set is replicated on more than one server Contd.,
  • 18.
    HDFS  Disk drivefailures or damaged data is monitored and restored by HDFS “by calling good replicas stored else-were on the cluster”  No Blueprint, simply dump data  HDFS provides  scalable,  reliable  Fault tolerant data services for data storage and analysis at low cost Contd.,
  • 19.
    Components of HadoopCluster  HDFS has master and slave architecture  Hadoop 1.x cluster has two types of nodes  Master Node  Name Node  Secondary Name Node  Job Tracker  Slave Node  Data Node  Task Tracker Contd.,
  • 20.
    Components of HadoopCluster  Master Node  Name Node  Maintains metadata for each file stored in HDFS  Meta data contains information about blocks of files and their locations on Data nodes  Secondary Name Node  Not a backup for Name node  Performs house keeping functions for Name node  Job Tracker  Manages the overall execution of a job  Performs functions like scheduling & rescheduling child tasks,  Take care of health of each task and node
  • 21.
    Components of HadoopCluster  Slave Node  Data Node  Stores the actual blocks of a file in the HDFS on its own local disk  Task Tracker  Runs on individual datanodes  Responsible for starting and managing individual Map/Reduce tasks  Communicates with the job trackers
  • 22.
  • 23.
     HDFS Write HDFS Read  HDFS Delete  Ensuring HDFS Reliability  Secondary Name Node  Task Tracker  Job Tracker Operations on HDFS
  • 24.
    HDFS Write  Towrite files into HDFS, client needs to interact with namenode  Namenode provides address of the slave on which client will start writing the data  As soon the client finishes the writing the block, the slave starts copying the block to another slave. (depends on the no. of the slaves present)  After necessary replicas of the data are created , acknowledgement is sent to the client
  • 25.
  • 26.
    HDFS Read  Toread files from HDFS, client needs to interact with Namenode  Namenode provides the address of the slaves where it is stored  Client node interacts with respective Datanodes to read the file’  Namenode also provide a token to the client which it shows to datanode for authentication
  • 27.
  • 28.
    HDFS Delete  Namenoderenames the file path to indicate the file is moved to trash.  Moved file in the trash (/trash directory)remains for 6 hrs.  The deleted file can be restored within these time, else the Namenode deletes the files from HDFS namespace  As the files are deleted the system shows increased available space
  • 29.
    Ensuring HDFS Reliabiltiy Datanodes can fail: (i.e., Datanode periodically sends heartbeat messages to the Namenode in 3sec).  If heartbeat message is not received in 3sec it shows Datanode is failed.  At this stage Namenode actively initiates the replication blocks stored in lost node to a healthy node
  • 30.
    Ensuring HDFS Reliability Data can get corrupted due to a phenomenon called bit rot.  This condition occurs only during HDFS Read operation due to “Checksum” mismatch.  If the check sum of the block does not match re-replication is initiated because block is considered corrupted.  In turn Name node actively tries to restore the replication counter of the block
  • 31.
    Secondary Namenode  SecondaryNamenode is not a failover node.  fsimage: gets the information from each Datanode during the system startup.  edits: accumulates the changes durin he syste operation  Secondary Namenode: periodically merges the contents of edits file in the fsimage file.  It merges the fsimage file and the edits file into a new fsimage file (where namenode executes it operation without any brekarage)  fstime: contains a time stamp of the last checkpoint
  • 32.
    Task Tracker  Runson each compute node of the Hadoop cluster  It is configured with a set of slots usually setup as the total number of cores available on the machine  When a request is received from (Job Tracker), Task Tracker initiates a new JVM.  Task tracker is assigned a task depending on how many free slots it has(total no. of tasks = actual tasks running)  Task tracker sends heart beat messages to job tracker abut free available slots)
  • 33.
  • 34.
    Job Tracker  Responsiblefor launching and monitoring Map Reduce Jobs  Job tracker reqeusts the Namenode for a list of Datanodes hosting the blocks (with files)  Now Job tracker plans for the job Execution (Map tasks, Reduce tasks) and schedules the task close to the data blocks  The job tracker submits tasks to each Task Tracker node for execution and monitors the same by sending ack.(heart beat) messages)  Once the jobs are completed the Job Status is updated (Success/Failure of job)
  • 35.
  • 36.
    Types of Instllation Stand-Alone Mode  Simplest mode of operation, most suitable for debugging  Hadoop process runs on single JVM  Least efficient for performance, Most efficient for development  Pseudo-Distributed Cluster  Runs on a single node in a Psuedo-Distributed Manner  All deamons runs in a separate Java Process  Simulates a clustered environment  Multi-Node Cluster Installation  Hadoop is complex setup on cluster machines  Identical to Pseudo-Distributed Cluster
  • 37.
    Components of MapReduce  Client Java Program  Client Mapper Class  Custom Reducer Class  Client-Side libraries  Remote libraries  Java Application Archive (JAR)
  • 38.
    Components of MapReduce  Client Java Program  Java program that is launched from client node(edge node) in the cluster.  This node has access to the Hadoop Cluster  Client node can be one the datanode (some times) in the cluster.
  • 39.
    Components of MapReduce  Client Mapper Class  It is a custom class.  Instances of this class are executed on remote task nodes except (Psuedo-distributed cluster)  These nodes are different from the nodes where Client Java Program launches the job.  Client Reducer Class  It is a custom class.  Instances of this Mapper class are executed on remote task nodes except (Psuedo-distributed cluster)  These nodes are different from the nodes where
  • 40.
    Components of MapReduce Client-side Libraries  Hadoop libraries needed by the client are installed and configured into CLASSPATH by the Hadoop Client Command.  CLASS PATH details are found in $ HADOOP_HOME/bin/ .  Client side libraries are configured by setting the environment variables HADOOP_CLASSPATH.
  • 41.
    Components of MapReduce Remote Libraries  Libraries needed for execution of custom Mapper and Reducer classes  Remote libraries exclude HADOOP libraries that are already configured on the DATANODES  Eg:. If the Mapper is using a specalized XM: parse the libraries including the parser have been transferred to the remote Datanodes that execute the Mapper
  • 42.
    Components of MapReduce Java Application Achieve(JAR)  Java Applications are packaged in JAR files,  These JAR files contain  Client Java Classes  Custom Mapper Classes  Custom Reducer Classes  JAR files also include custom dependent classes by the Client/Mapper/Reducer Classes