SlideShare a Scribd company logo
Hadoop Architecture
Agenda
• Different Hadoop daemons & its roles
• How does a Hadoop cluster look like
• Under the Hood:- How does it write a file
• Under the Hood:- How does it read a file
• Under the Hood:- How does it replicate the file
• Under the Hood:- How does it run a job
• How to balance an un-balanced hadoop cluster
Hadoop – A bit of background
• It’s an open source project
• Based on 2 technical papers published by Google
• A well known platform for distributed applications
• Easy to scale-out
• Works well with commodity hard wares(not entirely true)
• Very good for background applications
Hadoop Architecture
• Two Primary components
 Distributed File System (HDFS): It deals with file
operations like read, write, delete & etc
 Map Reduce Engine: It deals with parallel computation
Hadoop Distributed File System
• Runs on top of existing file system
• A file broken into pre-defined equal sized blocks & stored
individually
• Designed to handle very large files
• Not good for huge number of small files
Map Reduce Engine
• A Map Reduce Program consists of map and reduce functions
• A Map Reduce job is broken into tasks that run in parallel
• Prefers local processing if possible
Hadoop Server Roles
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
slaves
masters
Clients
Name NodeJob Tracker
Secondary
Name Node
Map Reduce HDFS
Distributed Data Analytics Distributed Data Storage
Hadoop ClusterHadoop Cluster
Rack 1
DN + TT
DN + TT
DN + TT
DN + TT
Name Node
Rack 2
DN + TT
DN + TT
DN + TT
DN + TT
Job Tracker
Rack 3
DN + TT
DN + TT
DN + TT
DN + TT
Secondary NN
Rack 4
DN + TT
DN + TT
DN + TT
DN + TT
Client
Rack N
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
World
switch switch switch switch switch
switch switch
BRAD HEDLUND .com
Typical Workflow
Typical Workflow
• Load data into the cluster (HDFS writes)
• Analyze the data (Map Reduce)
• Store results in the cluster (HDFS writes)
• Read the results from the cluster (HDFS reads)
How many times did our customers type the word
“Fraud” into emails sent to customer service?
Sample Scenario:
File.txt
Huge file containing all emails sent
to customer service
BRAD HEDLUND .com
Writing files to HDFS
• Client consults Name Node
• Client writes block directly to one Data Node
• Data Nodes replicates block
• Cycle repeats for next block
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Client
I want to write
Blocks A,B,C of
File.txt
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Blk A Blk B Blk C
switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write
File.txt
Block A
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready
Data Nodes
5,6
Ready?
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
Ready
Data Node 6
Rack aware
Ready! • Name Node picks
two nodes in the
same rack, one
node in a different
rack
• Data protection
• Locality for M/R
Ready!
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
A A
A
• Data Nodes 1
& 2 pass data
along as its
received
• TCP 50010
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 2
Data Node 3
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 2
Data Node 3
A A
A
Block received
Success
File.txt
Blk A:
DN1, DN2, DN3
BRAD HEDLUND .com
Multi-block Replication Pipeline
Data Node 1 Data Node X
Data Node 3
Rack 1
switch
switch
Client
switch
Blk A Blk A
Blk A
Rack 4
Data Node 2
Rack 5
switch
Data Node Y
Data Node Z
Blk B
Blk B
Blk B
Data Node W
Blk C
Blk C
Blk C
Blk A Blk B Blk C
File.txt 1TB File =
3TB storage
3TB network traffic
BRAD HEDLUND .com
Client writes Span the HDFS Cluster
Client
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
File.txt
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
• Block size
• File Size
Factors:
More blocks = Wider spread
BRAD HEDLUND .com
Hadoop Rack Awareness – Why?
Name Node
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node 5
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node 7
Data Node 8
switch
Rack 9
Data Node 9
Data Node 10
Data Node 11
Data Node 12
switch
• Never loose all data if entire rack fails
• Keep bulky flows in-rack when possible
• Assumption that in-rack is higher bandwidth,
lower latency
A
A
BB
B
C C
C
switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Data Node 6
Data Node 7
Rack aware
BRAD HEDLUND .com
Name Node
• Data Node sends Heartbeats
• Every 10th heartbeat is a Block report
• Name Node builds metadata from Block reports
• TCP – every 3 seconds
• If Name Node is down, HDFS is down
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node N
A AA CC
DN1: A,C
DN2: A,C
DN3: A,C
metadata
File.txt = A,C
C
File system
Awesome!
Thanks.
I’m 
alive!
I have
blocks:
A, C
BRAD HEDLUND .com
Re-replicating missing replicas
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node 8
A AA CC
DN1: A,C
DN2: A,C
DN3: A, C
metadata
Rack1: DN1, DN2
Rack5: DN3,
Rack9: DN8
C
Rack Awareness
Uh Oh!
Missing
replicas
Copy
blocks A,C
to Node 8
A C
• Missing Heartbeats signify lost Nodes
• Name Node consults metadata, finds affected data
• Name Node consults Rack Awareness script
• Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
Secondary Name Node
• Not a hot standby for the Name Node
• Connects to Name Node every hour*
• Housekeeping, backup of Name Node metadata
• Saved metadata can rebuild a failed Name Node
Name Node
metadata
File.txt = A,C
File system
Secondary
Name Node
Its been an hour,
give me your
metadata
BRAD HEDLUND .com
Client reading files from HDFS
Name Node
Client
Tell me the
block locations
of Results.txt
Blk A = 1,5,6
Blk B = 8,1,2
Blk C = 5,8,9
Results.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
A
A
BB
B
C C
C
BRAD HEDLUND .com
Data Processing: Map
• Map: “Run this computation on  your local data”
• Job Tracker delivers Java code to Nodes with local data
Map Task Map Task Map Task
A B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
File.txt
Fraud = 3 Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1 Data Node 5 Data Node 9
BRAD HEDLUND .com
switchswitchswitch
What if d ta isn’t local?
• Job Tracker tries to select Node in same rack as data
• Name Node rack awareness
“I need block A”
Map Task Map Task
B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1
Data Node 5 Data Node 9
“no Map tasks left”
A
Data Node 2
Rack 1 Rack 5 Rack 9
BRAD HEDLUND .com
Data Node reading files from HDFS
Name Node
Block A = 1,5,6
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Name Node provides rack local Nodes first
• Leverage in-rack bandwidth, single hop
A
A
BB
B
C C
C
Tell me the
locations of
Block A of
File.txt switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Rack aware
BRAD HEDLUND .com
Data Processing: Reduce
• Reduce: “Run this computation across Map results”
• Map Tasks deliver output data over the network
• Reduce Task data output written to and read from HDFS
Fraud = 0
Job Tracker
Reduce Task
Sum
“Fraud”
Results.txt
Fraud = 14
Map Task Map Task Map Task
A B C
Client
HDFS
X Y Z
Data Node 1 Data Node 5 Data Node 9
Data Node 3
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Cluster BalancingCluster Balancing
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Balancer utility (if used) runs in the background
• Does not interfere with Map Reduce or HDFS
• Default speed limit 1 MB/s
brad@cloudera-1:~$hadoop balancer
BRAD HEDLUND .com
Quiz
• If you had written a file of size 1TB into
HDFS with replication factor 2, What is
the actual size required by the HDFS to
store this file?
• True/False? Even if Name node goes
down, I still will be able to read files from
HDFS.
Quiz
• True/False? In Hadoop Cluster, We can
have a secondary Job Tracker to enhance
the fault tolerance.
• True/False? If Job Tracker goes down, You
will not be able to write any file into
HDFS.
Quiz
• True/False? Name node stores the actual
data itself.
• True/False? Name node can be re-built using
the secondary name node.
• True/False? If a data node goes down,
Hadoop takes care of re-replicating the
affected data block.
Quiz
• In which scenario, one data node tries to
read data from another data node?
• What are the benefits of Name node’s rack-
awareness?
• True/False? HDFS is well suited for
applications which write huge number of
small files.
Quiz
• True/False? Hadoop takes care of
balancing the cluster automatically?
• True/False? Output of Map tasks are
written to HDFS file?
• True/False? Output of Reduce tasks are
written to HDFS file?
Quiz
• True/False? In production cluster,
commodity hardware can be used to
setup Name node.
• Thank You

More Related Content

What's hot

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
Jeff Hammerbacher
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop
HadoopHadoop
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
its_skm
 

What's hot (18)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

Similar to Hadoop architecture meetup

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
bradhedlund
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
Thirunavukkarasu Ps
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
InMobi Technology
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
Ben Stopford
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
oscon2007
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
Deltares
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 

Similar to Hadoop architecture meetup (20)

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Recently uploaded

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

Hadoop architecture meetup

  • 2. Agenda • Different Hadoop daemons & its roles • How does a Hadoop cluster look like • Under the Hood:- How does it write a file • Under the Hood:- How does it read a file • Under the Hood:- How does it replicate the file • Under the Hood:- How does it run a job • How to balance an un-balanced hadoop cluster
  • 3. Hadoop – A bit of background • It’s an open source project • Based on 2 technical papers published by Google • A well known platform for distributed applications • Easy to scale-out • Works well with commodity hard wares(not entirely true) • Very good for background applications
  • 4. Hadoop Architecture • Two Primary components  Distributed File System (HDFS): It deals with file operations like read, write, delete & etc  Map Reduce Engine: It deals with parallel computation
  • 5. Hadoop Distributed File System • Runs on top of existing file system • A file broken into pre-defined equal sized blocks & stored individually • Designed to handle very large files • Not good for huge number of small files
  • 6. Map Reduce Engine • A Map Reduce Program consists of map and reduce functions • A Map Reduce job is broken into tasks that run in parallel • Prefers local processing if possible
  • 7. Hadoop Server Roles Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker slaves masters Clients Name NodeJob Tracker Secondary Name Node Map Reduce HDFS Distributed Data Analytics Distributed Data Storage
  • 8. Hadoop ClusterHadoop Cluster Rack 1 DN + TT DN + TT DN + TT DN + TT Name Node Rack 2 DN + TT DN + TT DN + TT DN + TT Job Tracker Rack 3 DN + TT DN + TT DN + TT DN + TT Secondary NN Rack 4 DN + TT DN + TT DN + TT DN + TT Client Rack N DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT World switch switch switch switch switch switch switch BRAD HEDLUND .com
  • 9. Typical Workflow Typical Workflow • Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) How many times did our customers type the word “Fraud” into emails sent to customer service? Sample Scenario: File.txt Huge file containing all emails sent to customer service BRAD HEDLUND .com
  • 10. Writing files to HDFS • Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Client I want to write Blocks A,B,C of File.txt OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Blk A Blk B Blk C
  • 11. switch switchswitch Preparing HDFS writes Name Node Data Node 1 Data Node 5 Data Node 6 Client I want to write File.txt Block A OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Ready Data Nodes 5,6 Ready? Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Ready Data Node 6 Rack aware Ready! • Name Node picks two nodes in the same rack, one node in a different rack • Data protection • Locality for M/R Ready! BRAD HEDLUND .com
  • 12. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 5 Data Node 6 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 A A A • Data Nodes 1 & 2 pass data along as its received • TCP 50010 Rack aware BRAD HEDLUND .com
  • 13. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 2 Data Node 3 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 2 Data Node 3 A A A Block received Success File.txt Blk A: DN1, DN2, DN3 BRAD HEDLUND .com
  • 14. Multi-block Replication Pipeline Data Node 1 Data Node X Data Node 3 Rack 1 switch switch Client switch Blk A Blk A Blk A Rack 4 Data Node 2 Rack 5 switch Data Node Y Data Node Z Blk B Blk B Blk B Data Node W Blk C Blk C Blk C Blk A Blk B Blk C File.txt 1TB File = 3TB storage 3TB network traffic BRAD HEDLUND .com
  • 15. Client writes Span the HDFS Cluster Client Rack N Data Node Data Node Data Node Data Node Data Node Data Node switch File.txt Rack 4 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 3 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch • Block size • File Size Factors: More blocks = Wider spread BRAD HEDLUND .com
  • 16. Hadoop Rack Awareness – Why? Name Node File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node 5 switch A Rack 5 Data Node 5 Data Node 6 Data Node 7 Data Node 8 switch Rack 9 Data Node 9 Data Node 10 Data Node 11 Data Node 12 switch • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency A A BB B C C C switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 Rack aware BRAD HEDLUND .com
  • 17. Name Node • Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down Name Node Data Node 1 Data Node 2 Data Node 3 Data Node N A AA CC DN1: A,C DN2: A,C DN3: A,C metadata File.txt = A,C C File system Awesome! Thanks. I’m  alive! I have blocks: A, C BRAD HEDLUND .com
  • 18. Re-replicating missing replicas Name Node Data Node 1 Data Node 2 Data Node 3 Data Node 8 A AA CC DN1: A,C DN2: A,C DN3: A, C metadata Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 C Rack Awareness Uh Oh! Missing replicas Copy blocks A,C to Node 8 A C • Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com
  • 19. Secondary Name Node • Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node Name Node metadata File.txt = A,C File system Secondary Name Node Its been an hour, give me your metadata BRAD HEDLUND .com
  • 20. Client reading files from HDFS Name Node Client Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Results.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially A A BB B C C C BRAD HEDLUND .com
  • 21. Data Processing: Map • Map: “Run this computation on  your local data” • Job Tracker delivers Java code to Nodes with local data Map Task Map Task Map Task A B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C File.txt Fraud = 3 Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 BRAD HEDLUND .com
  • 22. switchswitchswitch What if d ta isn’t local? • Job Tracker tries to select Node in same rack as data • Name Node rack awareness “I need block A” Map Task Map Task B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 “no Map tasks left” A Data Node 2 Rack 1 Rack 5 Rack 9 BRAD HEDLUND .com
  • 23. Data Node reading files from HDFS Name Node Block A = 1,5,6 File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop A A BB B C C C Tell me the locations of Block A of File.txt switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack aware BRAD HEDLUND .com
  • 24. Data Processing: Reduce • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS Fraud = 0 Job Tracker Reduce Task Sum “Fraud” Results.txt Fraud = 14 Map Task Map Task Map Task A B C Client HDFS X Y Z Data Node 1 Data Node 5 Data Node 9 Data Node 3 BRAD HEDLUND .com
  • 25. Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com
  • 26. Cluster BalancingCluster Balancing New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Balancer utility (if used) runs in the background • Does not interfere with Map Reduce or HDFS • Default speed limit 1 MB/s brad@cloudera-1:~$hadoop balancer BRAD HEDLUND .com
  • 27. Quiz • If you had written a file of size 1TB into HDFS with replication factor 2, What is the actual size required by the HDFS to store this file? • True/False? Even if Name node goes down, I still will be able to read files from HDFS.
  • 28. Quiz • True/False? In Hadoop Cluster, We can have a secondary Job Tracker to enhance the fault tolerance. • True/False? If Job Tracker goes down, You will not be able to write any file into HDFS.
  • 29. Quiz • True/False? Name node stores the actual data itself. • True/False? Name node can be re-built using the secondary name node. • True/False? If a data node goes down, Hadoop takes care of re-replicating the affected data block.
  • 30. Quiz • In which scenario, one data node tries to read data from another data node? • What are the benefits of Name node’s rack- awareness? • True/False? HDFS is well suited for applications which write huge number of small files.
  • 31. Quiz • True/False? Hadoop takes care of balancing the cluster automatically? • True/False? Output of Map tasks are written to HDFS file? • True/False? Output of Reduce tasks are written to HDFS file?
  • 32. Quiz • True/False? In production cluster, commodity hardware can be used to setup Name node. • Thank You