APACHE HADOOP
Miraj Godha
April 22,
2014
MIRAJ GODHA 1
MIRAJ GODHA
LEARN ANY THING FROM ANY WHERE.
We are one of the fastest growing Online learning destination for
Instructor led Online live courses.
Every one of our courses, written by experts in their respective
fields. And therefore are crafted to help you grow and advance
your career.
We try our best to make you connect to the real life examples with
real business practices.
Learn and apply to your work.
We bring you the most cutting edge and industry relevant
courses
MIRAJ GODHA 2
COURSE DETAILS
The Motivation for Hadoop
Hadoop: Basic Concepts
Writing a MapReduce Program
Common MapReduce Algorithms
PIG Concepts
Hive Concepts
Working with Sqoop
OOZIE Concepts
HUE Concepts
Data Visualization & Analytics
Final Project
MIRAJ GODHA 3
APACHE HADOOP
THE MOTIVATION FOR
HADOOP
Miraj Godha
April 22, 2014
MIRAJ GODHA 4
HOW DATA COMES?
MIRAJ GODHA 5
MACHINE GENERATED AND
HISTORICAL DATA
MIRAJ GODHA 6
THREE V’S OF BIGDATA
Volume
Velocity
Variety
MIRAJ GODHA 7
VOLUME .. AMOUNT OF DATA
~3 ZB of data
exist in the
digital universe
today.
>300 TB of
data in U.S.
Library of
Congress.
Facebook has
30+ PB.
~2.5 PB of data
in DWH.
+10PB DWH
size.
MIRAJ GODHA 8
VELOCITY .. HOW RAPIDLY
DATA IS GROWING
48 hours of new
video every
minute
571 new
websites every
minute
500+ TB to
Facebook.
175 million
tweets every
day
1+ million
customer
transactions
every hour
Data production
will be 44 times
greater in 2020
than it was in
2009.
MIRAJ GODHA 9
VARIETY.. HOW RAPIDLY
DATA IS GROWING
Structured
•Traditional Databases
•Numeric data
Semi - structured
•Json
•XML
Unstructured
•Text documents
•Email
•Video
•Audio
•Machine Generated
MIRAJ GODHA 10
HOW COMPANIES MINTING
ON BIGDATA!
Predict exactly what customers want before they ask for it
Marketing Campaign
Improve customer service
Fraud Detection
Get customers excited about their own data
Identify customer pain points and solve them
Reduce health care costs and improve treatment
Social Graph Analysis & Sentiment Analysis
Research and development
MIRAJ GODHA 11
HOW DATA IS USED BY SOME
BIG COMPANIES FOR
DIFFERENT BUSINESS
ANALYSIS.
MIRAJ GODHA 12
BIG DATA MARKET FORECAST
MIRAJ GODHA 13
CAREER OPTIONS
MIRAJ GODHA 14
HADOOP & HIVE HISTORY
Dec 2004 – Google GFS paper published
July 2005 – Nutch uses MapReduce
Feb 2006 – Becomes Lucene subproject
Apr 2007 – Yahoo! on 1000-node cluster
Jan 2008 – An Apache Top Level Project
Jul 2008 – A 4000 node test cluster
Sept 2008 – Hive becomes a Hadoop subproject
MIRAJ GODHA 15
PROBLEMS WITH
CURRENT SYSTEMS
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
MIRAJ GODHA 16
~45
mins
APACHE HADOOP WINS TERABYTE SORT BENCHMARK
(JULY 2008)
Yahoo's sorted 1 TB data in 209 seconds
Beat the previous record of 297 seconds of Google.
The sort used 1800 mappers and 1800 reduces
Cluster configuration used for benchmark sort
 910 nodes
 2 quad core Xeons @ 2.0ghz per node
 8G RAM per a node;
MIRAJ GODHA 17
WHY HADOOP?
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
10 Machines
4 I/O operations
100 Mbps
MIRAJ GODHA 18
~45
mins
~4.5
mins
DISTRIBUTED FILE SYSTEM
(DFS)
Miraj Godhaproject
Miraj
Godha.global.inhomeproject
Miraj
Godha.global.inhomeimages
Miraj
Godha.global.inhomesoftware
Miraj
Godha.global.inhomewebsites
MIRAJ GODHA 19
Miraj Godhasoftware
Miraj Godhaimages
Miraj Godhawebsites
Namespace
Miraj
Godha.global.in
WHO USES HADOOP?
MIRAJ GODHA 20
42,000
nodes as on
July 2011
4100 nodes
1400
nodes
WHAT IS HADOOP
Hadoop is a framework for distributed processing of large datasets
across large clusters of commodity computers using simple
programing model.
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
MIRAJ GODHA 21
WHAT MAKES IT ESPECIALLY
USEFUL
 Scalable: It can reliably store and process petabytes.
 Economical: It distributes the data and processing across clusters of commonly
available computers (in thousands).
 Efficient: By distributing the data, it can process it in parallel on the nodes where
the data is located.
 Reliable: It automatically maintains multiple copies of data and automatically
redeploys computing tasks based on failures.
MIRAJ GODHA 22
HADOOP: ASSUMPTIONS
Hardware will fail.
Applications need a write-once-read-many access model.
Data transfer and I/o is bottleneck
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Move logic rather than data
MIRAJ GODHA 23
Secondary
NameNode
Client
HDFS Architecture
NameNode
Data Nodes
Metadata
NameNode : Contains information about data
DataNode : Contains physical data
SecondaryNameNode: Keeps reading data from NN
MIRAJ GODHA 24
DISTRIBUTED FILE SYSTEM
Single Namespace for entire cluster
Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
Files are broken up into blocks
– Typically 64 MB block size
– Each block replicated on multiple DataNodes
Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
MIRAJ GODHA 25
HADOOP COMPLEX
QUERIES COMPARISON
WITH TRADITIONAL DB’S
26
WHICH HADOOP
DISTRIBUTION?
Type Distribution Pros Cons
Apache/Open
Source
Hortonworks 100% Open source version
Integration/Services focused
Extensive partnership network
Slower interactive queries
Cloudera Widely used distribution
Faster interactive queries
Extensive tooling
Proprietary extensions like
Impala
Commercial version only
MapR Enterprise and Production ready
focused
Works with NFS & Native Unix
commands
Less focused on using new
Hadoop features such as Yarn,
etc
Proprietary PivotalHD Faster interactive query support with
Greenplum
Integrates with CloudFoundry PaaS
platform
Proprietary extensions
Not easy to decouple
IBM Offer open source without branch
version
Integrated with PaaS and IBM tools
Limited releases
Expensive
May not be easy to decouple
27
MIRAJ GODHA 28
Disk 9
1 3
2
Racks
Disk 10
Disk 11
Disk 12
Disk 8
Disk 7
Disk 6
Disk 4
Disk 3
Disk 2
Disk 5
Disk 1
1
1
1
2
2 2
3
3
3
Data blocks
Rack 1 Rack 2 Rack 3
1 2 3 4 5
File F
Blocks (64 M
BLOCK PLACEMENT
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
MIRAJ GODHA 29
MAIN PROPERTIES OF HDFS
Large: A HDFS instance may consist of thousands of server machines, each
storing part of the file system’s data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS
 Datanodes send heartbeats to Name node
MIRAJ GODHA 30
MIRAJ GODHA 31
NAMENODE METADATA
Meta-data in Memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
MIRAJ GODHA 32
HA CLUSTER
Two separate machines are configured as NameNodes.
At any point in time, exactly one of the NameNodes is in an Active state, and the other is in
a Standby state.
The Active NameNode is responsible for all client operations in the cluster
In order for the Standby node to keep its state synchronized with the Active node, both nodes
communicate with a group of separate daemons called "JournalNodes" (JNs).
When any namespace modification is performed by the Active node, it durably logs a record of the
modification to a majority of these JNs.
The Standby node is capable of reading the edits from the JNs, and is constantly watching them for
changes to the edit log.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes
before promoting itself to the Active state.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date
information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are
configured with the location of both NameNodes, and send block location information and heartbeats
to both.
During a failover, the NameNode which is to become active will simply take over the role of writing to
the JournalNodes.
DATANODE
A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data to Clients
Block Report
– Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
MIRAJ GODHA 34
HADOOP MASTER/SLAVE
ARCHITECTURE
Hadoop is designed as a master-slave shared-nothing
architecture
MIRAJ GODHA 35
Master node (single node)
Many slave nodes
JOB SUBMISSION
MIRAJ GODHA 36
User
DFS
Copy job resources(
Jar file,
configuration file,
computed input
splits) in jobID
directory)
Job Submitter
Submit
job
Input Splits
Job Tracker
Submit
Job
Computes
Input Splits
Get new
JobID
JOB TRACKER
MIRAJ GODHA 37
Job
Submitte
r
DFS
Job Tracker
Submit
job
Put job
Job.XML
Job.jar
Create Map
& Reduce
Internal Job Queue
M
R
S
S
S
S
S
S
No of maps =
Input splits
Read Job
Files, split
information
TASK ASSIGNMENT
MIRAJ GODHA 38
Job Tracker
Task Tracker
Picks Job
Heart
Beat
Job Queue
Assign
Tasks from
Job
Initialize Job
TASK EXECUTION
MIRAJ GODHA 39
Task Tracker
Job Tracker
Read
from
local
Disk
DFS
Assign
Task
JVM JVM
JVM JVM
Job.xml
Job.jar
Heart
Beat
JOBTRACKER
Master node runs JobTracker instance, which accepts Job requests
from clients
There is only one JobTracker daemon running per hadoop cluster
Determine the execution plan by determining which files to process
Assigns Nodes to different task
Monitor all tasks as they are running
MIRAJ GODHA 40
TASKTRACKER
Manages execution of individual tasks on each data node
One TaskTracker each data node
Each TaskTracker can spawn multiple JVM’s to handle many map or
reduce task in parallel
TaskTracker constantly communicate with job tracker
JobTracker fails to receive heartbeat from TaskTracker in specified
amount of time, it assumes the task tracker has crashed. In such a
scenario, job tracker will resubmit the task to some other
TaskTracker.
MIRAJ GODHA 41
HEARTBEATS
DataNodes send hearbeat to the NameNode
NameNode uses heartbeats to detect DataNode failure
MIRAJ GODHA 42
REPLICATION ENGINE
NameNode detects DataNode failures
 Chooses new DataNodes for new replicas
 Balances disk usage
 Balances communication traffic to DataNodes
MIRAJ GODHA 43
DATA PIPELINE & WRITE ANATOMY
MIRAJ GODHA 44
HDFS Client
Add
Block
Name Node
Data Node
Data Node
Data Node
Write
Ack
Complet
e
DATA PIPELINING
Client retrieves a list of DataNodes on which to place replicas of
a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in
the Pipeline
When all replicas are written, the Client moves on to write the
next block in file
MIRAJ GODHA 45
READ ANATOMY
MIRAJ GODHA 46
HDFS Client Get Block Name Node
Data Node
Data Node Data Node
Read
Read
DATA CORRECTNESS
Use Checksums to validate data
– Use CRC32
File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
MIRAJ GODHA 47

Apache Hadoop- Hadoop Basics.pptx

  • 1.
    APACHE HADOOP Miraj Godha April22, 2014 MIRAJ GODHA 1
  • 2.
    MIRAJ GODHA LEARN ANYTHING FROM ANY WHERE. We are one of the fastest growing Online learning destination for Instructor led Online live courses. Every one of our courses, written by experts in their respective fields. And therefore are crafted to help you grow and advance your career. We try our best to make you connect to the real life examples with real business practices. Learn and apply to your work. We bring you the most cutting edge and industry relevant courses MIRAJ GODHA 2
  • 3.
    COURSE DETAILS The Motivationfor Hadoop Hadoop: Basic Concepts Writing a MapReduce Program Common MapReduce Algorithms PIG Concepts Hive Concepts Working with Sqoop OOZIE Concepts HUE Concepts Data Visualization & Analytics Final Project MIRAJ GODHA 3
  • 4.
    APACHE HADOOP THE MOTIVATIONFOR HADOOP Miraj Godha April 22, 2014 MIRAJ GODHA 4
  • 5.
  • 6.
  • 7.
    THREE V’S OFBIGDATA Volume Velocity Variety MIRAJ GODHA 7
  • 8.
    VOLUME .. AMOUNTOF DATA ~3 ZB of data exist in the digital universe today. >300 TB of data in U.S. Library of Congress. Facebook has 30+ PB. ~2.5 PB of data in DWH. +10PB DWH size. MIRAJ GODHA 8
  • 9.
    VELOCITY .. HOWRAPIDLY DATA IS GROWING 48 hours of new video every minute 571 new websites every minute 500+ TB to Facebook. 175 million tweets every day 1+ million customer transactions every hour Data production will be 44 times greater in 2020 than it was in 2009. MIRAJ GODHA 9
  • 10.
    VARIETY.. HOW RAPIDLY DATAIS GROWING Structured •Traditional Databases •Numeric data Semi - structured •Json •XML Unstructured •Text documents •Email •Video •Audio •Machine Generated MIRAJ GODHA 10
  • 11.
    HOW COMPANIES MINTING ONBIGDATA! Predict exactly what customers want before they ask for it Marketing Campaign Improve customer service Fraud Detection Get customers excited about their own data Identify customer pain points and solve them Reduce health care costs and improve treatment Social Graph Analysis & Sentiment Analysis Research and development MIRAJ GODHA 11
  • 12.
    HOW DATA ISUSED BY SOME BIG COMPANIES FOR DIFFERENT BUSINESS ANALYSIS. MIRAJ GODHA 12
  • 13.
    BIG DATA MARKETFORECAST MIRAJ GODHA 13
  • 14.
  • 15.
    HADOOP & HIVEHISTORY Dec 2004 – Google GFS paper published July 2005 – Nutch uses MapReduce Feb 2006 – Becomes Lucene subproject Apr 2007 – Yahoo! on 1000-node cluster Jan 2008 – An Apache Top Level Project Jul 2008 – A 4000 node test cluster Sept 2008 – Hive becomes a Hadoop subproject MIRAJ GODHA 15
  • 16.
    PROBLEMS WITH CURRENT SYSTEMS 1Machine • Read 1 TB data • 4 I/O operations • 100 Mbps MIRAJ GODHA 16 ~45 mins
  • 17.
    APACHE HADOOP WINSTERABYTE SORT BENCHMARK (JULY 2008) Yahoo's sorted 1 TB data in 209 seconds Beat the previous record of 297 seconds of Google. The sort used 1800 mappers and 1800 reduces Cluster configuration used for benchmark sort  910 nodes  2 quad core Xeons @ 2.0ghz per node  8G RAM per a node; MIRAJ GODHA 17
  • 18.
    WHY HADOOP? 1 Machine •Read 1 TB data • 4 I/O operations • 100 Mbps 10 Machines 4 I/O operations 100 Mbps MIRAJ GODHA 18 ~45 mins ~4.5 mins
  • 19.
    DISTRIBUTED FILE SYSTEM (DFS) MirajGodhaproject Miraj Godha.global.inhomeproject Miraj Godha.global.inhomeimages Miraj Godha.global.inhomesoftware Miraj Godha.global.inhomewebsites MIRAJ GODHA 19 Miraj Godhasoftware Miraj Godhaimages Miraj Godhawebsites Namespace Miraj Godha.global.in
  • 20.
    WHO USES HADOOP? MIRAJGODHA 20 42,000 nodes as on July 2011 4100 nodes 1400 nodes
  • 21.
    WHAT IS HADOOP Hadoopis a framework for distributed processing of large datasets across large clusters of commodity computers using simple programing model.  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model, any data will fit MIRAJ GODHA 21
  • 22.
    WHAT MAKES ITESPECIALLY USEFUL  Scalable: It can reliably store and process petabytes.  Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).  Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.  Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. MIRAJ GODHA 22
  • 23.
    HADOOP: ASSUMPTIONS Hardware willfail. Applications need a write-once-read-many access model. Data transfer and I/o is bottleneck Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Move logic rather than data MIRAJ GODHA 23
  • 24.
    Secondary NameNode Client HDFS Architecture NameNode Data Nodes Metadata NameNode: Contains information about data DataNode : Contains physical data SecondaryNameNode: Keeps reading data from NN MIRAJ GODHA 24
  • 25.
    DISTRIBUTED FILE SYSTEM SingleNamespace for entire cluster Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 64 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode MIRAJ GODHA 25
  • 26.
  • 27.
    WHICH HADOOP DISTRIBUTION? Type DistributionPros Cons Apache/Open Source Hortonworks 100% Open source version Integration/Services focused Extensive partnership network Slower interactive queries Cloudera Widely used distribution Faster interactive queries Extensive tooling Proprietary extensions like Impala Commercial version only MapR Enterprise and Production ready focused Works with NFS & Native Unix commands Less focused on using new Hadoop features such as Yarn, etc Proprietary PivotalHD Faster interactive query support with Greenplum Integrates with CloudFoundry PaaS platform Proprietary extensions Not easy to decouple IBM Offer open source without branch version Integrated with PaaS and IBM tools Limited releases Expensive May not be easy to decouple 27
  • 28.
    MIRAJ GODHA 28 Disk9 1 3 2 Racks Disk 10 Disk 11 Disk 12 Disk 8 Disk 7 Disk 6 Disk 4 Disk 3 Disk 2 Disk 5 Disk 1 1 1 1 2 2 2 3 3 3 Data blocks Rack 1 Rack 2 Rack 3 1 2 3 4 5 File F Blocks (64 M
  • 29.
    BLOCK PLACEMENT Current Strategy --One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica MIRAJ GODHA 29
  • 30.
    MAIN PROPERTIES OFHDFS Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Datanodes send heartbeats to Name node MIRAJ GODHA 30
  • 31.
  • 32.
    NAMENODE METADATA Meta-data inMemory Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc MIRAJ GODHA 32
  • 33.
    HA CLUSTER Two separatemachines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes.
  • 34.
    DATANODE A Block Server –Stores data in the local file system – Stores meta-data of a block – Serves data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes MIRAJ GODHA 34
  • 35.
    HADOOP MASTER/SLAVE ARCHITECTURE Hadoop isdesigned as a master-slave shared-nothing architecture MIRAJ GODHA 35 Master node (single node) Many slave nodes
  • 36.
    JOB SUBMISSION MIRAJ GODHA36 User DFS Copy job resources( Jar file, configuration file, computed input splits) in jobID directory) Job Submitter Submit job Input Splits Job Tracker Submit Job Computes Input Splits Get new JobID
  • 37.
    JOB TRACKER MIRAJ GODHA37 Job Submitte r DFS Job Tracker Submit job Put job Job.XML Job.jar Create Map & Reduce Internal Job Queue M R S S S S S S No of maps = Input splits Read Job Files, split information
  • 38.
    TASK ASSIGNMENT MIRAJ GODHA38 Job Tracker Task Tracker Picks Job Heart Beat Job Queue Assign Tasks from Job Initialize Job
  • 39.
    TASK EXECUTION MIRAJ GODHA39 Task Tracker Job Tracker Read from local Disk DFS Assign Task JVM JVM JVM JVM Job.xml Job.jar Heart Beat
  • 40.
    JOBTRACKER Master node runsJobTracker instance, which accepts Job requests from clients There is only one JobTracker daemon running per hadoop cluster Determine the execution plan by determining which files to process Assigns Nodes to different task Monitor all tasks as they are running MIRAJ GODHA 40
  • 41.
    TASKTRACKER Manages execution ofindividual tasks on each data node One TaskTracker each data node Each TaskTracker can spawn multiple JVM’s to handle many map or reduce task in parallel TaskTracker constantly communicate with job tracker JobTracker fails to receive heartbeat from TaskTracker in specified amount of time, it assumes the task tracker has crashed. In such a scenario, job tracker will resubmit the task to some other TaskTracker. MIRAJ GODHA 41
  • 42.
    HEARTBEATS DataNodes send hearbeatto the NameNode NameNode uses heartbeats to detect DataNode failure MIRAJ GODHA 42
  • 43.
    REPLICATION ENGINE NameNode detectsDataNode failures  Chooses new DataNodes for new replicas  Balances disk usage  Balances communication traffic to DataNodes MIRAJ GODHA 43
  • 44.
    DATA PIPELINE &WRITE ANATOMY MIRAJ GODHA 44 HDFS Client Add Block Name Node Data Node Data Node Data Node Write Ack Complet e
  • 45.
    DATA PIPELINING Client retrievesa list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file MIRAJ GODHA 45
  • 46.
    READ ANATOMY MIRAJ GODHA46 HDFS Client Get Block Name Node Data Node Data Node Data Node Read Read
  • 47.
    DATA CORRECTNESS Use Checksumsto validate data – Use CRC32 File Creation – Client computes checksum per 512 byte – DataNode stores the checksum File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas MIRAJ GODHA 47