Hadoop

24/08/181
Apache Hadoop

Software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.

Created by Doug Cutting & Written in java

Hadoop Components:-

Hadoop Distributed File System (HDFS) - Storage

Map-Reduce – Processing or other processing engine

YARN (Yet Another Resource Negotiator) – Resource Mangement

24/08/182
Apache Hadoop - Features

Open Source
– Apache Software Foundation

Distributed Storage & Processing
– HDFS – Hadoop Distributed File
system
– MapReduce – Parallel Processing

Fault tolerance
– Replication (By default 3 replicas
of each block & it can be changed
also as per the requirement)

Reliability
– data is reliably stored on the
cluster of machine despite
machine failures.

Scalability
– Dynamically add new nodes
– increase data size

Easy to use
– No need of client to deal with
distributed computing

Data locality
– Computation to data
– Data to computation

High availability
– Data is high availabile &
accessible despite hardware
failure due to multiple copies of
data.

24/08/183
Apache Hadoop versions
Hadoop 1.0
Storage – HDFS (Replication)
Processing – MapReduce
Hadoop 2.0
Storage – HDFS (Replication)
Processing – MapReduce or other
Resource Management - YARN
Hadoop 3.0
Storage – HDFS (Erasure code – reduce storage space)
Processing – MapReduce or other
Resource Management – YARN v.2

24/08/184
Apache Hadoop – Limitations

Issue with small files
Too many small files, then the
NameNode will be overloaded

Slow processing speed
Lot of time to perform MapReduce tasks

Support for batch processing only
Does not process streamed data

Not Real-time data processing

No iteration
a chain of stages in which each output of
the previous stage is the input to the
next stage

Lengthy Line of Code
number of bugs & take more time to execute
the program.

Latency
designed to support different format,
structure and huge volume of data

Not easy to use
developers need to hand code for each and
every operation

Security
missing encryption

No Abstraction
developers need to hand code for each and
every operation

No caching

Uncertainty
unable to guarantee when the job will be
complete.

24/08/185
Hadoop Distributed File System (HDFS)
 When a dataset outgrows the storage capacity of a single physical
machine, it becomes necessary to partition it across a number of
separate machines.
 Filesystems that manage the storage across a network of machines
are called distributed filesystems.

24/08/186
Support on HDFS
 Very large files
Store petabytes of data
 Streaming data access
write-once, read-many-times
 Commodity hardware
Not support on HDFS
 Low-latency data access
tens of milliseconds range
 Lots of small files
Namenode holds filesystem metadata
in memory
 Multiple writers – to modification of
files
Writes are always made at the end
of the file, in append-only
fashion

24/08/187
Hadoop Distributed File System (HDFS) – 3 nodes
Namenode:-

Namespace & Metadata
• List of Files, blocks each
file, data nodes for each
blocks, file attributes
Datanode:-

Store data

Periodic validation of
checksums

Sent report on existing blocks
to name node
Secondary Namenode:-

Check point in HDFS

Merging editlogs with fsimage from the namenode

Helper node for namenode

24/08/188
Hadoop Services
Master Node
Slave2
Slave1 HDFS – service on
hadoop
YARN – service on
hadoop
Job history
details

24/08/189
Hadoop Distributed File System (HDFS) – Namenode

24/08/1810
Hadoop Distributed File System (HDFS) – Namenode
Location
Active
nodes
Log edit progress

24/08/1811
Hadoop Distributed File System (HDFS) – Datanode

24/08/1812
Master
Blocks are
stored
Namenode

24/08/1813
slave1
Namenode
Blocks are
stored

24/08/1814
Slave2
Blocks are
stored

24/08/1815
FilenameBlocks
metadata

24/08/1816
Namenode
Metadata (Name, replicas,..)
/user/input/name.txt, ....
Blocks (128 mb)
Replication (3)
Datanodes Datanodes

Hadoop Distributed File System (HDFS) – Commands

Return usage

hadoop fs -help

List out directory

hadoop fs -ls /

Create a directory

hadoop fs -mkdir /<dir_name>

Move file

hadoop fs -moveFromLocal <src> <dst>

hadoop fs -moveToLocal <src> <dst>

Hadoop Distributed File System (HDFS) – Comments

Copy files

hadoop fs -copyFromLocal <src> <dst>

hadoop fs -copyToLocal <src> <dst>

Multiple dst

hadoop fs -put <src> <dst_1>.......<dst_n>

Delete file

hadoop fs -rm -r <dir>

hadoop fs -rmdir <dir>

hadoop fs -expunge (permanent delete file)

Count no of file

hadoop fs -count <path>

Hadoop Distributed File System (HDFS) – Comments

Change permission

hadoop fs -chmod -R <dir>

Check sum

hadoop fs -checksum <URI>

Merge files

hadoop fs -getmerge -nl <src> <dst>

Display content

hadoop fs -cat <path>

24/08/1820
MapReduce
 Programming model for data processing
 Batch processing
 Processing unit:-
 Mapper – Map Task & Reducer – Reduce Task
 Shuffle & Sort – in between map & reduce phase
 Input & Output : Key-Value pairs
 Tasks are scheduled by YARNs
 Data locality optimization

24/08/1821
MapReduce job execution flow

24/08/1822
MapReduce job in - Mapper

24/08/1823
MapReduce – InputFormat
 FileInputFormat
 Path containing files to
read
 TextInputFormat
 Line of each input as a
seperate record
 NlineInputFormat
 No. of lines of input that
mapper receives.
 DBInputFormat
 Read data from RDBMS
 KeyValueTextInputFormat
 Similar to TextInputFormat
 SequenceFileInputFormat
 Read sequence of file
 SequenceFileAsTextInputFormat
 Input for streaming
SequenceFileAsBinaryInputForma
t
 Binary object

24/08/1824
MapReduce – InputFormat
InputFormat No. of files

24/08/1825
MapReduce – InputSplit
Files loaded from HDFS Store
 Created by InputFormat
 By default breaks a file into 128mb
 By setting mapred.min.split.size parameter in mapred-site.xml –
custom InputFormat
 No. of map task = No. of InputSplits

24/08/1826
InputSplitNo. of map task = No. of InputSplits

24/08/1827
No. of map task
run

24/08/1828
MapReduce – RecordReader
 Load’s data from its source & converts into key – value pairs
suitable for reading by the mapper
By default TextInputFormat for converting data into Key-Value
pair.

24/08/1829
MapReduce – Partitioner & Combiner
Partitioner:-
 Partitioning of the keys of the intermediate map output is
controlled.
 By hash function, key is used to derive the partition.
Combiner:-
 Process the outdata from the mapper, before passing reducer
 Mini-Reducer
 Reduce network congestion

24/08/1830
MapReduce job in - Reducer

24/08/1831
MapReduce – Shuffling & sorting
Shuffling:-
 Process of transfering data from the mappers to reducers
 Necessary for reducers, otherwise, they would not have any
input
 Shuffling can start even before map phase finished
Sorting:-
 Merging & sorting of map outputs
 Reducer – distinguish when a new reduce task should start
 Secondary sorting – sort the values ( ascending or
descending order) passed to each reducer

24/08/1832
MapReduce – Shuffling & sorting
Shuffling
Sorting

24/08/1833
MapReduce – OutputFormat
 LazyOutputFormat
 Create output files
 TextOutputFormat
 Line of each output as a
seperate record key-value
 MultipleOutputs
 Writing data to files
 DBOutputFormat
 Output to the SQL table
 MapFileOutputFormat
 Emits keys in sorted order
 SequenceFileOutputFormat
 Write Sequence of file for Output
SequenceFileAsBinaryOutputFormat
 Write to key-values to sequence
of file for Output

24/08/1834
MapReduce – OutputFormat

24/08/1835
MapReduce - Data locality optimization
Rack
Node
HDFS data blockHDFS data block Map task

24/08/1836
MapReduce Data flow with a single reduce task
Split 0
Split 1
Split 2
Map
Map
Map
Input HDFS
Reduce Part-0
merge
Output HDFS
HDFS
Replicatio
Sort
Mapper
Reducer
Intermediate Key-value pair
Map: (K1,V1) --> list(K2,V2)
Reduce: {K2,list(v2) --> list(K3,V3)}

24/08/1837
MapReduce Data flow with a multiple reduce task
Split 0
Split 1
Split 2
Map
Map
Map
Input HDFS
Reduce Part-0
merge Output HDFS
HDFS
Replicatio
Reduce Part-1
merge
HDFS
Replicatio
Sort

24/08/1838
MapReduce Data flow with no reduce task
Split 0 Map Part -0
Split 1 Map Part -1
Split 2 Map Part -2
Input HDFS Output HDFS
HDFS
replication
HDFS
replication
HDFS
replication

24/08/1839
MapReduce – Speculative Execution
 A mapreduce job is dominated by the slowest task
 Mapreduce attempts to located slow task (struggler) and run
redundant (speculative) tasks that will optimistically commit before
the coresponding stragglers
 Only one copy of a struggler is allowed to be speculated
 Whichever copy (amoung two copies) of task commits first, it
becomes the definitive copy, and other copy is killed.

24/08/1840
Struggler task
Speculative task
Task can be Failed because of
1. Task throws a runtime exception
2. Sudden exit of the child JVM
3. Timeout exceeding mapred.task.timeout
Speculative task

24/08/1841
Struggler task
Speculative task

24/08/1842
MapReduce – Counters
 Ways to measure the progress or the no. Operations that occurs
within map/reduce job
Name – Enum & value - long
 Validate:-
 No. Of bytes was read & write
 No. Of tasks was lanuched and successfully run.
 Amount of CPU & memory consumed – job & cluster
nodes.

24/08/1843
Types of MapReduce – Counters
Two types:-
 Built-in counters
 User-defined
counters
User-defined counters
 Dynamic counters
Defined at compile time, can not
create new counter run time
enums

24/08/1844
Types of MapReduce – Counters

Built-in counters

MapReduce Task Counter

no. of record read & write

FileSystem Counters

no. of bytes read & write by FS

FileInputFormatCounter

no. of bytes read by map task

FileOutputFormatCounter

no. of bytes write by reduce task

Mapreduce Job Counter

Count no of map task lanuched (including tasks that failed)

24/08/1845
MapReduce – Counters
No of FS counter run:- 10
No of job counter run:- 15
No of MRF counter run:- 20
No of job counter run
SE – 6, IF – 1, OF - 1

24/08/1846
YARN Functionalities
Resource Manager
 Authority of arbirates
resource amoung all
applications
 Scheduler
 Application manager
Node Manager
 Monitoring resource usage
 Responsible for container
 Reporting the same to the
resourcemanager
Application Master
Scheduler per application
Tracking their status per application
Monitoring for progress per application

24/08/1849
Resource Manager – run on applications
No. of applications submited
No. of active data nodes
Applications id

24/08/1853
Application Master
Application master

24/08/1854
Map Task
Map task
Run map task into
nodes

24/08/1855
Reduce Task
Reduce task
Run reduce task into
nodes

24/08/1857
Limitations of MapReduce
 MapReduce is great at one-pass computation, but inefficient for
multi-pass algorithm
 No efficient primitives for data sharing
 State between steps goes to distributed file system
 Slow due to replication & disk storage
 No control of data partitioning across steps

24/08/1858
Iterative MapReduce
Iter-1 Iter-2 .............
FS
read
FS
write
FS
write
FS
read
Commonly spend 90% of time doing I/O

24/08/1859
Problem
 To find the shortest paths from a source node to all other
nodes in the graph using the Dijkstra's algorithm.

24/08/1861
Iterative MapReduce
Iteration – 1
Iteration – 2
Iteration – 3

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop