SlideShare a Scribd company logo
24/08/181
Apache Hadoop

Software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.

Created by Doug Cutting & Written in java

Hadoop Components:-

Hadoop Distributed File System (HDFS) - Storage

Map-Reduce – Processing or other processing engine

YARN (Yet Another Resource Negotiator) – Resource Mangement
24/08/182
Apache Hadoop - Features

Open Source
– Apache Software Foundation

Distributed Storage & Processing
– HDFS – Hadoop Distributed File
system
– MapReduce – Parallel Processing

Fault tolerance
– Replication (By default 3 replicas
of each block & it can be changed
also as per the requirement)

Reliability
– data is reliably stored on the
cluster of machine despite
machine failures.

Scalability
– Dynamically add new nodes
– increase data size

Easy to use
– No need of client to deal with
distributed computing

Data locality
– Computation to data
– Data to computation

High availability
– Data is high availabile &
accessible despite hardware
failure due to multiple copies of
data.
24/08/183
Apache Hadoop versions
Hadoop 1.0
Storage – HDFS (Replication)
Processing – MapReduce
Hadoop 2.0
Storage – HDFS (Replication)
Processing – MapReduce or other
Resource Management - YARN
Hadoop 3.0
Storage – HDFS (Erasure code – reduce storage space)
Processing – MapReduce or other
Resource Management – YARN v.2
24/08/184
Apache Hadoop – Limitations

Issue with small files
Too many small files, then the
NameNode will be overloaded

Slow processing speed
Lot of time to perform MapReduce tasks

Support for batch processing only
Does not process streamed data

Not Real-time data processing

No iteration
a chain of stages in which each output of
the previous stage is the input to the
next stage

Lengthy Line of Code
number of bugs & take more time to execute
the program.

Latency
designed to support different format,
structure and huge volume of data

Not easy to use
developers need to hand code for each and
every operation

Security
missing encryption

No Abstraction
developers need to hand code for each and
every operation

No caching

Uncertainty
unable to guarantee when the job will be
complete.
24/08/185
Hadoop Distributed File System (HDFS)
 When a dataset outgrows the storage capacity of a single physical
machine, it becomes necessary to partition it across a number of
separate machines.
 Filesystems that manage the storage across a network of machines
are called distributed filesystems.
24/08/186
Hadoop Distributed File System (HDFS)
Support on HDFS
 Very large files
Store petabytes of data
 Streaming data access
write-once, read-many-times
 Commodity hardware
Not support on HDFS
 Low-latency data access
tens of milliseconds range
 Lots of small files
Namenode holds filesystem metadata
in memory
 Multiple writers – to modification of
files
Writes are always made at the end
of the file, in append-only
fashion
24/08/187
Hadoop Distributed File System (HDFS) – 3 nodes
Namenode:-

Namespace & Metadata
• List of Files, blocks each
file, data nodes for each
blocks, file attributes
Datanode:-

Store data

Periodic validation of
checksums

Sent report on existing blocks
to name node
Secondary Namenode:-

Check point in HDFS

Merging editlogs with fsimage from the namenode

Helper node for namenode
24/08/188
Hadoop Services
Master Node
Slave2
Slave1 HDFS – service on
hadoop
YARN – service on
hadoop
Job history
details
24/08/189
Hadoop Distributed File System (HDFS) – Namenode
24/08/1810
Hadoop Distributed File System (HDFS) – Namenode
Location
Active
nodes
Log edit progress
24/08/1811
Hadoop Distributed File System (HDFS) – Datanode
24/08/1812
Hadoop Distributed File System (HDFS) – Datanode
Master
Blocks are
stored
Namenode
24/08/1813
Hadoop Distributed File System (HDFS) – Datanode
slave1
Namenode
Blocks are
stored
24/08/1814
Hadoop Distributed File System (HDFS) – Datanode
Slave2
Blocks are
stored
24/08/1815
Hadoop Distributed File System (HDFS) – Datanode
FilenameBlocks
metadata
24/08/1816
Hadoop Distributed File System (HDFS)
Namenode
Metadata (Name, replicas,..)
/user/input/name.txt, ....
Blocks (128 mb)
Replication (3)
Datanodes Datanodes
Hadoop Distributed File System (HDFS) – Commands

Return usage

hadoop fs -help

List out directory

hadoop fs -ls /

Create a directory

hadoop fs -mkdir /<dir_name>

Move file

hadoop fs -moveFromLocal <src> <dst>

hadoop fs -moveToLocal <src> <dst>
Hadoop Distributed File System (HDFS) – Comments

Copy files

hadoop fs -copyFromLocal <src> <dst>

hadoop fs -copyToLocal <src> <dst>

Multiple dst

hadoop fs -put <src> <dst_1>.......<dst_n>

Delete file

hadoop fs -rm -r <dir>

hadoop fs -rmdir <dir>

hadoop fs -expunge (permanent delete file)

Count no of file

hadoop fs -count <path>
Hadoop Distributed File System (HDFS) – Comments

Change permission

hadoop fs -chmod -R <dir>

Check sum

hadoop fs -checksum <URI>

Merge files

hadoop fs -getmerge -nl <src> <dst>

Display content

hadoop fs -cat <path>
24/08/1820
MapReduce
 Programming model for data processing
 Batch processing
 Processing unit:-
 Mapper – Map Task & Reducer – Reduce Task
 Shuffle & Sort – in between map & reduce phase
 Input & Output : Key-Value pairs
 Tasks are scheduled by YARNs
 Data locality optimization
24/08/1821
MapReduce job execution flow
24/08/1822
MapReduce job in - Mapper
24/08/1823
MapReduce – InputFormat
 FileInputFormat
 Path containing files to
read
 TextInputFormat
 Line of each input as a
seperate record
 NlineInputFormat
 No. of lines of input that
mapper receives.
 DBInputFormat
 Read data from RDBMS
 KeyValueTextInputFormat
 Similar to TextInputFormat
 SequenceFileInputFormat
 Read sequence of file
 SequenceFileAsTextInputFormat
 Input for streaming
SequenceFileAsBinaryInputForma
t
 Binary object
24/08/1824
MapReduce – InputFormat
InputFormat No. of files
24/08/1825
MapReduce – InputSplit
Files loaded from HDFS Store
 Created by InputFormat
 By default breaks a file into 128mb
 By setting mapred.min.split.size parameter in mapred-site.xml –
custom InputFormat
 No. of map task = No. of InputSplits
24/08/1826
MapReduce – InputSplit
InputSplitNo. of map task = No. of InputSplits
24/08/1827
MapReduce – InputSplit
No. of map task
run
24/08/1828
MapReduce – RecordReader
 Load’s data from its source & converts into key – value pairs
suitable for reading by the mapper
By default TextInputFormat for converting data into Key-Value
pair.
24/08/1829
MapReduce – Partitioner & Combiner
Partitioner:-
 Partitioning of the keys of the intermediate map output is
controlled.
 By hash function, key is used to derive the partition.
Combiner:-
 Process the outdata from the mapper, before passing reducer
 Mini-Reducer
 Reduce network congestion
24/08/1830
MapReduce job in - Reducer
24/08/1831
MapReduce – Shuffling & sorting
Shuffling:-
 Process of transfering data from the mappers to reducers
 Necessary for reducers, otherwise, they would not have any
input
 Shuffling can start even before map phase finished
Sorting:-
 Merging & sorting of map outputs
 Reducer – distinguish when a new reduce task should start
 Secondary sorting – sort the values ( ascending or
descending order) passed to each reducer
24/08/1832
MapReduce – Shuffling & sorting
Shuffling
Sorting
24/08/1833
MapReduce – OutputFormat
 LazyOutputFormat
 Create output files
 TextOutputFormat
 Line of each output as a
seperate record key-value
 MultipleOutputs
 Writing data to files
 DBOutputFormat
 Output to the SQL table
 MapFileOutputFormat
 Emits keys in sorted order
 SequenceFileOutputFormat
 Write Sequence of file for Output
SequenceFileAsBinaryOutputFormat
 Write to key-values to sequence
of file for Output
24/08/1834
MapReduce – OutputFormat
24/08/1835
MapReduce - Data locality optimization
Rack
Node
HDFS data blockHDFS data block Map task
24/08/1836
MapReduce Data flow with a single reduce task
Split 0
Split 1
Split 2
Map
Map
Map
Input HDFS
Reduce Part-0
merge
Output HDFS
HDFS
Replicatio
Sort
Mapper
Reducer
Intermediate Key-value pair
Map: (K1,V1) --> list(K2,V2)
Reduce: {K2,list(v2) --> list(K3,V3)}
24/08/1837
MapReduce Data flow with a multiple reduce task
Split 0
Split 1
Split 2
Map
Map
Map
Input HDFS
Reduce Part-0
merge Output HDFS
HDFS
Replicatio
Reduce Part-1
merge
HDFS
Replicatio
Sort
24/08/1838
MapReduce Data flow with no reduce task
Split 0 Map Part -0
Split 1 Map Part -1
Split 2 Map Part -2
Input HDFS Output HDFS
HDFS
replication
HDFS
replication
HDFS
replication
24/08/1839
MapReduce – Speculative Execution
 A mapreduce job is dominated by the slowest task
 Mapreduce attempts to located slow task (struggler) and run
redundant (speculative) tasks that will optimistically commit before
the coresponding stragglers
 Only one copy of a struggler is allowed to be speculated
 Whichever copy (amoung two copies) of task commits first, it
becomes the definitive copy, and other copy is killed.
24/08/1840
MapReduce – Speculative Execution
Struggler task
Speculative task
Task can be Failed because of
1. Task throws a runtime exception
2. Sudden exit of the child JVM
3. Timeout exceeding mapred.task.timeout
Speculative task
24/08/1841
MapReduce – Speculative Execution
Struggler task
Speculative task
24/08/1842
MapReduce – Counters
 Ways to measure the progress or the no. Operations that occurs
within map/reduce job
Name – Enum & value - long
 Validate:-
 No. Of bytes was read & write
 No. Of tasks was lanuched and successfully run.
 Amount of CPU & memory consumed – job & cluster
nodes.
24/08/1843
Types of MapReduce – Counters
Two types:-
 Built-in counters
 User-defined
counters
User-defined counters
 Dynamic counters
Defined at compile time, can not
create new counter run time
enums
24/08/1844
Types of MapReduce – Counters

Built-in counters

MapReduce Task Counter

no. of record read & write

FileSystem Counters

no. of bytes read & write by FS

FileInputFormatCounter

no. of bytes read by map task

FileOutputFormatCounter

no. of bytes write by reduce task

Mapreduce Job Counter

Count no of map task lanuched (including tasks that failed)
24/08/1845
MapReduce – Counters
No of FS counter run:- 10
No of job counter run:- 15
No of MRF counter run:- 20
No of job counter run
SE – 6, IF – 1, OF - 1
24/08/1846
YARN Functionalities
Resource Manager
 Authority of arbirates
resource amoung all
applications
 Scheduler
 Application manager
Node Manager
 Monitoring resource usage
 Responsible for container
 Reporting the same to the
resourcemanager
Application Master
Scheduler per application
Tracking their status per application
Monitoring for progress per application
24/08/1847
Resource Manager
24/08/1848
Resource Manager
24/08/1849
Resource Manager – run on applications
No. of applications submited
No. of active data nodes
Applications id
24/08/1850
Node Manager
24/08/1851
Node Manager
24/08/1852
Node Manager
24/08/1853
Application Master
Application master
24/08/1854
Map Task
Map task
Run map task into
nodes
24/08/1855
Reduce Task
Reduce task
Run reduce task into
nodes
24/08/1856
Outputs
24/08/1857
Limitations of MapReduce
 MapReduce is great at one-pass computation, but inefficient for
multi-pass algorithm
 No efficient primitives for data sharing
 State between steps goes to distributed file system
 Slow due to replication & disk storage
 No control of data partitioning across steps
24/08/1858
Iterative MapReduce
Iter-1 Iter-2 .............
FS
read
FS
write
FS
write
FS
read
Commonly spend 90% of time doing I/O
24/08/1859
Problem
 To find the shortest paths from a source node to all other
nodes in the graph using the Dijkstra's algorithm.
24/08/1860
Input – data
24/08/1861
Iterative MapReduce
Iteration – 1
Iteration – 2
Iteration – 3
24/08/1862
Thanking you

More Related Content

What's hot

Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
Abhishek Mukherjee
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Nitesh Ghosh
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Delhi/NCR HUG
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Edureka!
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 

What's hot (20)

Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 

Similar to Hadoop

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
Edward Capriolo
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
HADOOP
HADOOPHADOOP
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hadoop
HadoopHadoop
Hadoop
Dinakar nk
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 

Similar to Hadoop (20)

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Training
TrainingTraining
Training
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 

Recently uploaded

Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 

Recently uploaded (20)

Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 

Hadoop

  • 1. 24/08/181 Apache Hadoop  Software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  Created by Doug Cutting & Written in java  Hadoop Components:-  Hadoop Distributed File System (HDFS) - Storage  Map-Reduce – Processing or other processing engine  YARN (Yet Another Resource Negotiator) – Resource Mangement
  • 2. 24/08/182 Apache Hadoop - Features  Open Source – Apache Software Foundation  Distributed Storage & Processing – HDFS – Hadoop Distributed File system – MapReduce – Parallel Processing  Fault tolerance – Replication (By default 3 replicas of each block & it can be changed also as per the requirement)  Reliability – data is reliably stored on the cluster of machine despite machine failures.  Scalability – Dynamically add new nodes – increase data size  Easy to use – No need of client to deal with distributed computing  Data locality – Computation to data – Data to computation  High availability – Data is high availabile & accessible despite hardware failure due to multiple copies of data.
  • 3. 24/08/183 Apache Hadoop versions Hadoop 1.0 Storage – HDFS (Replication) Processing – MapReduce Hadoop 2.0 Storage – HDFS (Replication) Processing – MapReduce or other Resource Management - YARN Hadoop 3.0 Storage – HDFS (Erasure code – reduce storage space) Processing – MapReduce or other Resource Management – YARN v.2
  • 4. 24/08/184 Apache Hadoop – Limitations  Issue with small files Too many small files, then the NameNode will be overloaded  Slow processing speed Lot of time to perform MapReduce tasks  Support for batch processing only Does not process streamed data  Not Real-time data processing  No iteration a chain of stages in which each output of the previous stage is the input to the next stage  Lengthy Line of Code number of bugs & take more time to execute the program.  Latency designed to support different format, structure and huge volume of data  Not easy to use developers need to hand code for each and every operation  Security missing encryption  No Abstraction developers need to hand code for each and every operation  No caching  Uncertainty unable to guarantee when the job will be complete.
  • 5. 24/08/185 Hadoop Distributed File System (HDFS)  When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.  Filesystems that manage the storage across a network of machines are called distributed filesystems.
  • 6. 24/08/186 Hadoop Distributed File System (HDFS) Support on HDFS  Very large files Store petabytes of data  Streaming data access write-once, read-many-times  Commodity hardware Not support on HDFS  Low-latency data access tens of milliseconds range  Lots of small files Namenode holds filesystem metadata in memory  Multiple writers – to modification of files Writes are always made at the end of the file, in append-only fashion
  • 7. 24/08/187 Hadoop Distributed File System (HDFS) – 3 nodes Namenode:-  Namespace & Metadata • List of Files, blocks each file, data nodes for each blocks, file attributes Datanode:-  Store data  Periodic validation of checksums  Sent report on existing blocks to name node Secondary Namenode:-  Check point in HDFS  Merging editlogs with fsimage from the namenode  Helper node for namenode
  • 8. 24/08/188 Hadoop Services Master Node Slave2 Slave1 HDFS – service on hadoop YARN – service on hadoop Job history details
  • 9. 24/08/189 Hadoop Distributed File System (HDFS) – Namenode
  • 10. 24/08/1810 Hadoop Distributed File System (HDFS) – Namenode Location Active nodes Log edit progress
  • 11. 24/08/1811 Hadoop Distributed File System (HDFS) – Datanode
  • 12. 24/08/1812 Hadoop Distributed File System (HDFS) – Datanode Master Blocks are stored Namenode
  • 13. 24/08/1813 Hadoop Distributed File System (HDFS) – Datanode slave1 Namenode Blocks are stored
  • 14. 24/08/1814 Hadoop Distributed File System (HDFS) – Datanode Slave2 Blocks are stored
  • 15. 24/08/1815 Hadoop Distributed File System (HDFS) – Datanode FilenameBlocks metadata
  • 16. 24/08/1816 Hadoop Distributed File System (HDFS) Namenode Metadata (Name, replicas,..) /user/input/name.txt, .... Blocks (128 mb) Replication (3) Datanodes Datanodes
  • 17. Hadoop Distributed File System (HDFS) – Commands  Return usage  hadoop fs -help  List out directory  hadoop fs -ls /  Create a directory  hadoop fs -mkdir /<dir_name>  Move file  hadoop fs -moveFromLocal <src> <dst>  hadoop fs -moveToLocal <src> <dst>
  • 18. Hadoop Distributed File System (HDFS) – Comments  Copy files  hadoop fs -copyFromLocal <src> <dst>  hadoop fs -copyToLocal <src> <dst>  Multiple dst  hadoop fs -put <src> <dst_1>.......<dst_n>  Delete file  hadoop fs -rm -r <dir>  hadoop fs -rmdir <dir>  hadoop fs -expunge (permanent delete file)  Count no of file  hadoop fs -count <path>
  • 19. Hadoop Distributed File System (HDFS) – Comments  Change permission  hadoop fs -chmod -R <dir>  Check sum  hadoop fs -checksum <URI>  Merge files  hadoop fs -getmerge -nl <src> <dst>  Display content  hadoop fs -cat <path>
  • 20. 24/08/1820 MapReduce  Programming model for data processing  Batch processing  Processing unit:-  Mapper – Map Task & Reducer – Reduce Task  Shuffle & Sort – in between map & reduce phase  Input & Output : Key-Value pairs  Tasks are scheduled by YARNs  Data locality optimization
  • 23. 24/08/1823 MapReduce – InputFormat  FileInputFormat  Path containing files to read  TextInputFormat  Line of each input as a seperate record  NlineInputFormat  No. of lines of input that mapper receives.  DBInputFormat  Read data from RDBMS  KeyValueTextInputFormat  Similar to TextInputFormat  SequenceFileInputFormat  Read sequence of file  SequenceFileAsTextInputFormat  Input for streaming SequenceFileAsBinaryInputForma t  Binary object
  • 25. 24/08/1825 MapReduce – InputSplit Files loaded from HDFS Store  Created by InputFormat  By default breaks a file into 128mb  By setting mapred.min.split.size parameter in mapred-site.xml – custom InputFormat  No. of map task = No. of InputSplits
  • 26. 24/08/1826 MapReduce – InputSplit InputSplitNo. of map task = No. of InputSplits
  • 28. 24/08/1828 MapReduce – RecordReader  Load’s data from its source & converts into key – value pairs suitable for reading by the mapper By default TextInputFormat for converting data into Key-Value pair.
  • 29. 24/08/1829 MapReduce – Partitioner & Combiner Partitioner:-  Partitioning of the keys of the intermediate map output is controlled.  By hash function, key is used to derive the partition. Combiner:-  Process the outdata from the mapper, before passing reducer  Mini-Reducer  Reduce network congestion
  • 31. 24/08/1831 MapReduce – Shuffling & sorting Shuffling:-  Process of transfering data from the mappers to reducers  Necessary for reducers, otherwise, they would not have any input  Shuffling can start even before map phase finished Sorting:-  Merging & sorting of map outputs  Reducer – distinguish when a new reduce task should start  Secondary sorting – sort the values ( ascending or descending order) passed to each reducer
  • 32. 24/08/1832 MapReduce – Shuffling & sorting Shuffling Sorting
  • 33. 24/08/1833 MapReduce – OutputFormat  LazyOutputFormat  Create output files  TextOutputFormat  Line of each output as a seperate record key-value  MultipleOutputs  Writing data to files  DBOutputFormat  Output to the SQL table  MapFileOutputFormat  Emits keys in sorted order  SequenceFileOutputFormat  Write Sequence of file for Output SequenceFileAsBinaryOutputFormat  Write to key-values to sequence of file for Output
  • 35. 24/08/1835 MapReduce - Data locality optimization Rack Node HDFS data blockHDFS data block Map task
  • 36. 24/08/1836 MapReduce Data flow with a single reduce task Split 0 Split 1 Split 2 Map Map Map Input HDFS Reduce Part-0 merge Output HDFS HDFS Replicatio Sort Mapper Reducer Intermediate Key-value pair Map: (K1,V1) --> list(K2,V2) Reduce: {K2,list(v2) --> list(K3,V3)}
  • 37. 24/08/1837 MapReduce Data flow with a multiple reduce task Split 0 Split 1 Split 2 Map Map Map Input HDFS Reduce Part-0 merge Output HDFS HDFS Replicatio Reduce Part-1 merge HDFS Replicatio Sort
  • 38. 24/08/1838 MapReduce Data flow with no reduce task Split 0 Map Part -0 Split 1 Map Part -1 Split 2 Map Part -2 Input HDFS Output HDFS HDFS replication HDFS replication HDFS replication
  • 39. 24/08/1839 MapReduce – Speculative Execution  A mapreduce job is dominated by the slowest task  Mapreduce attempts to located slow task (struggler) and run redundant (speculative) tasks that will optimistically commit before the coresponding stragglers  Only one copy of a struggler is allowed to be speculated  Whichever copy (amoung two copies) of task commits first, it becomes the definitive copy, and other copy is killed.
  • 40. 24/08/1840 MapReduce – Speculative Execution Struggler task Speculative task Task can be Failed because of 1. Task throws a runtime exception 2. Sudden exit of the child JVM 3. Timeout exceeding mapred.task.timeout Speculative task
  • 41. 24/08/1841 MapReduce – Speculative Execution Struggler task Speculative task
  • 42. 24/08/1842 MapReduce – Counters  Ways to measure the progress or the no. Operations that occurs within map/reduce job Name – Enum & value - long  Validate:-  No. Of bytes was read & write  No. Of tasks was lanuched and successfully run.  Amount of CPU & memory consumed – job & cluster nodes.
  • 43. 24/08/1843 Types of MapReduce – Counters Two types:-  Built-in counters  User-defined counters User-defined counters  Dynamic counters Defined at compile time, can not create new counter run time enums
  • 44. 24/08/1844 Types of MapReduce – Counters  Built-in counters  MapReduce Task Counter  no. of record read & write  FileSystem Counters  no. of bytes read & write by FS  FileInputFormatCounter  no. of bytes read by map task  FileOutputFormatCounter  no. of bytes write by reduce task  Mapreduce Job Counter  Count no of map task lanuched (including tasks that failed)
  • 45. 24/08/1845 MapReduce – Counters No of FS counter run:- 10 No of job counter run:- 15 No of MRF counter run:- 20 No of job counter run SE – 6, IF – 1, OF - 1
  • 46. 24/08/1846 YARN Functionalities Resource Manager  Authority of arbirates resource amoung all applications  Scheduler  Application manager Node Manager  Monitoring resource usage  Responsible for container  Reporting the same to the resourcemanager Application Master Scheduler per application Tracking their status per application Monitoring for progress per application
  • 49. 24/08/1849 Resource Manager – run on applications No. of applications submited No. of active data nodes Applications id
  • 54. 24/08/1854 Map Task Map task Run map task into nodes
  • 55. 24/08/1855 Reduce Task Reduce task Run reduce task into nodes
  • 57. 24/08/1857 Limitations of MapReduce  MapReduce is great at one-pass computation, but inefficient for multi-pass algorithm  No efficient primitives for data sharing  State between steps goes to distributed file system  Slow due to replication & disk storage  No control of data partitioning across steps
  • 58. 24/08/1858 Iterative MapReduce Iter-1 Iter-2 ............. FS read FS write FS write FS read Commonly spend 90% of time doing I/O
  • 59. 24/08/1859 Problem  To find the shortest paths from a source node to all other nodes in the graph using the Dijkstra's algorithm.
  • 61. 24/08/1861 Iterative MapReduce Iteration – 1 Iteration – 2 Iteration – 3