SlideShare a Scribd company logo
Fault Tolerance in
Apache Hadoop
Ravindra Bandara
CSS 534 - Spring 2014
Outline
• Overview of Apache Hadoop
• Fault Tolerant features in
• HDFS layer
• MapReduce layer
• Summary
• Questions
Apache Hadoop
• Apache Hadoop's MapReduce and HDFS
components originally derived from
• Google File System (GFS)
1
– 2003
• Google's MapReduce2
- 2004
• Data is broken in splits that are processed in
different machines.
• Industry wide standard for processing Big Data.
1 - http://research.google.com/archive/gfs.html
2 - http://research.google.com/archive/mapreduce.html
Fault Tolerance in Hadoop
• One of the primary reasons to use Hadoop is due to its high
degree of fault tolerance.
• Individual nodes or network components may experience high
rates of failure.
• Hadoop can guide jobs toward a successful completion.
Overview of Hadoop
• Basic components of Hadoop are:
• Map Reduce Layer
• Job tracker (master) -which coordinates the execution of jobs;
• Task trackers (slaves)- which control the execution of map and reduce tasks in the
machines that do the processing;
• HDFS Layer- which stores files.
• Name Node (master)- manages the file system, keeps metadata for all the files and
directories in the tree
• Data Nodes (slaves)- work horses of the file system. Store and retrieve blocks when they
are told to ( by clients or name node ) and report back to name node periodically
Overview of Hadoop contd.
Job Tracker - coordinates the
execution of jobs
Task Tracker – control the
execution of map and reduce
tasks in slave machines
Data Node – Follow the
instructions from name node
- stores, retrieves data
Name Node – Manages the
file system, keeps metadata
Hadoop Versions
• MapReduce 2 runtime and HDFS HA was introduced in Hadoop
2.x
Fault Tolerance in HDFS layer
Fault Tolerance in HDFS layer
• Hardware failure is the norm rather than the exception
• Detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS.
• Master Slave Architecture with NameNode (master) and DataNode
(slave)
• Common types of failures
• NameNode failures
• DataNode failures
Handling Data Node Failure
• Each DataNode sends a Heartbeat message to the NameNode
periodically
• If the namenode does not receive a heartbeat from a particular data
node for 10 minutes, then it considers that data node to be dead/out
of service.
• Name Node initiates replication of blocks which were hosted on that
data node to be hosted on some other data node.
Handling Name Node Failure
• Single Name Node per cluster.
• Prior to Hadoop 2.0.0, the NameNode was a single point of failure
(SPOF) in an HDFS cluster.
• If NameNode becomes unavailable, the cluster as a whole would be
unavailable
• NameNode has to be restarted
• Brought up on a separate machine.
HDFS High Availability
• Provides an option of
running two redundant
NameNodes in the same
cluster
• Active/Passive configuration
with a hot standby.
• Fast failover to a new
NameNode in the case that
a machine crashes
• Graceful administrator-
initiated failover for the
purpose of planned
maintenance.
Fault Tolerance in MapReduce
layer
Classic MapReduce (v1)
• Job Tracker
• Manage Cluster Resources and Job
Scheduling
• Task Tracker
• Per-node agent
• Manage Tasks
• Jobs can fail
• While running the task ( Task Failure )
• Task Tracker failure
• Job Tracker failure
Handling Task Failure
• User code bug in map/reduce
• Throws a RunTimeException
• Child JVM reports a failure back to the parent task tracker before it exits.
• Sudden exit of the child JVM
• Bug that causes the JVM to exit for the conditions exposed by map/reduce
code.
• Task tracker marks the task attempt as failed, makes room available
to another task.
Task Tracker Failure
• Task tracker will stop sending the heartbeat to the Job Tracker
• Job Tracker notices this failure
• Hasn’t received a heart beat from 10 mins
• Can be configured via mapred.tasktracker.expiry.interval
property
• Job Tracker removes this task from the task pool
• Rerun the Job even if map task has ran completely
• Intermediate output resides in the failed task trackers local file system which
is not accessible by the reduce tasks.
Job Tracker Failure
• This is serious than the other two modes of failure.
• Single point of failure.
• In this case all jobs will fail.
• After restarting Job Tracker all the jobs running at the time of the
failure needs to be resubmitted.
• Good News – YARN has eliminated this.
• One of YARN’s design goals is to eliminate single point of failure in Map
Reduce.
YARN - Yet Another Resource Negotiator
• Next version of MapReduce or MapReduce 2.0 (MRv2)
• In 2010 group at Yahoo! Began to design the next generation of MR
YARN architecture
• Resource Manager
• Central Agent – Manages and
allocates cluster resources
• Node Manager
• Per-node agent – Manages and
enforces node resource allocations
• Application Master
• Per Application
• Manages application life cycle and
task scheduling
YARN – Resource Manager Failure
• After a crash a new Resource Manager instance needs to brought up (
by an administrator )
• It recovers from saved state
• State consists of
• node managers in the systems
• running applications
• State to manage is much more manageable than that of Job Tracker.
• Tasks are not part of Resource Managers state.
• They are handled by the application master.
Summary
• Overview of Hadoop MapReduce
• Discussed Fault Tolerant features of
• HDFS
• MapReduce v1
• MapReducev2 - YARN (briefly)
References
• Google MapReduce paper
• MapReduce: Simplified Data Processing on Large Clusters(Jeffrey Dean and Sanjay
Ghemawat Google, Inc. 2004)
• Apache Hadoop Web site (http://hadoop.apache.org/)
• Yahoo Developer Network Tutorial on MapReduce
(https://developer.yahoo.com/hadoop/tutorial/module4.html)
• Hadoop The Definitive Guide( Tom White - O’REILLY )
• Cloudera blog (http://blog.cloudera.com/blog/2011/02/hadoop-availability/)
• HortonWorks YARN Tutorial(http://hortonworks.com/hadoop/yarn/)
Questions?
Backup Slides
More Fault Tolerant features provided in
HDFS
• Rack awareness
• Take a node's physical location into account while scheduling tasks and
allocating storage.
• NameNode tries to place replicas of block on multiple racks for improved fault
tolerance
• Upgrade and rollback
• After a software upgrade, it is possible to rollback to HDFS' state before the
upgrade in case of unexpected problems.
• Secondary NameNode
• Performs periodic checkpoints of the namespace and helps keep the size of
file containing log of HDFS modifications within certain limits at the
NameNode.

More Related Content

What's hot

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Majid Hajibaba
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
Niranjana Ambadi
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
Rajesh Ananda Kumar
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 ReliabilityAli Usman
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
Gyanmanjari Institute Of Technology
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
Yuval Carmel
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta
 

What's hot (20)

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 Reliability
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 

Viewers also liked

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
Christian Ariza Porras
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
Athanasios Anastasiou
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
Aneesh Pulickal Karunakaran
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
Hortonworks
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
gethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
GetInData
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 

Viewers also liked (20)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 

Similar to Hadoop fault-tolerance

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
Nandan Kumar
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
Hadoop
HadoopHadoop
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
Amjith Singh
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
Namuk Park
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
Uwe Printz
 
Bigdata and Hadoop with Docker
Bigdata and Hadoop with DockerBigdata and Hadoop with Docker
Bigdata and Hadoop with Docker
haridasnss
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
NIKHILGR3
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 

Similar to Hadoop fault-tolerance (20)

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Bigdata and Hadoop with Docker
Bigdata and Hadoop with DockerBigdata and Hadoop with Docker
Bigdata and Hadoop with Docker
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Hadoop fault-tolerance

  • 1. Fault Tolerance in Apache Hadoop Ravindra Bandara CSS 534 - Spring 2014
  • 2. Outline • Overview of Apache Hadoop • Fault Tolerant features in • HDFS layer • MapReduce layer • Summary • Questions
  • 3. Apache Hadoop • Apache Hadoop's MapReduce and HDFS components originally derived from • Google File System (GFS) 1 – 2003 • Google's MapReduce2 - 2004 • Data is broken in splits that are processed in different machines. • Industry wide standard for processing Big Data. 1 - http://research.google.com/archive/gfs.html 2 - http://research.google.com/archive/mapreduce.html
  • 4. Fault Tolerance in Hadoop • One of the primary reasons to use Hadoop is due to its high degree of fault tolerance. • Individual nodes or network components may experience high rates of failure. • Hadoop can guide jobs toward a successful completion.
  • 5. Overview of Hadoop • Basic components of Hadoop are: • Map Reduce Layer • Job tracker (master) -which coordinates the execution of jobs; • Task trackers (slaves)- which control the execution of map and reduce tasks in the machines that do the processing; • HDFS Layer- which stores files. • Name Node (master)- manages the file system, keeps metadata for all the files and directories in the tree • Data Nodes (slaves)- work horses of the file system. Store and retrieve blocks when they are told to ( by clients or name node ) and report back to name node periodically
  • 6. Overview of Hadoop contd. Job Tracker - coordinates the execution of jobs Task Tracker – control the execution of map and reduce tasks in slave machines Data Node – Follow the instructions from name node - stores, retrieves data Name Node – Manages the file system, keeps metadata
  • 7. Hadoop Versions • MapReduce 2 runtime and HDFS HA was introduced in Hadoop 2.x
  • 8. Fault Tolerance in HDFS layer
  • 9. Fault Tolerance in HDFS layer • Hardware failure is the norm rather than the exception • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. • Master Slave Architecture with NameNode (master) and DataNode (slave) • Common types of failures • NameNode failures • DataNode failures
  • 10. Handling Data Node Failure • Each DataNode sends a Heartbeat message to the NameNode periodically • If the namenode does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead/out of service. • Name Node initiates replication of blocks which were hosted on that data node to be hosted on some other data node.
  • 11. Handling Name Node Failure • Single Name Node per cluster. • Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. • If NameNode becomes unavailable, the cluster as a whole would be unavailable • NameNode has to be restarted • Brought up on a separate machine.
  • 12. HDFS High Availability • Provides an option of running two redundant NameNodes in the same cluster • Active/Passive configuration with a hot standby. • Fast failover to a new NameNode in the case that a machine crashes • Graceful administrator- initiated failover for the purpose of planned maintenance.
  • 13. Fault Tolerance in MapReduce layer
  • 14. Classic MapReduce (v1) • Job Tracker • Manage Cluster Resources and Job Scheduling • Task Tracker • Per-node agent • Manage Tasks • Jobs can fail • While running the task ( Task Failure ) • Task Tracker failure • Job Tracker failure
  • 15. Handling Task Failure • User code bug in map/reduce • Throws a RunTimeException • Child JVM reports a failure back to the parent task tracker before it exits. • Sudden exit of the child JVM • Bug that causes the JVM to exit for the conditions exposed by map/reduce code. • Task tracker marks the task attempt as failed, makes room available to another task.
  • 16. Task Tracker Failure • Task tracker will stop sending the heartbeat to the Job Tracker • Job Tracker notices this failure • Hasn’t received a heart beat from 10 mins • Can be configured via mapred.tasktracker.expiry.interval property • Job Tracker removes this task from the task pool • Rerun the Job even if map task has ran completely • Intermediate output resides in the failed task trackers local file system which is not accessible by the reduce tasks.
  • 17. Job Tracker Failure • This is serious than the other two modes of failure. • Single point of failure. • In this case all jobs will fail. • After restarting Job Tracker all the jobs running at the time of the failure needs to be resubmitted. • Good News – YARN has eliminated this. • One of YARN’s design goals is to eliminate single point of failure in Map Reduce.
  • 18. YARN - Yet Another Resource Negotiator • Next version of MapReduce or MapReduce 2.0 (MRv2) • In 2010 group at Yahoo! Began to design the next generation of MR
  • 19. YARN architecture • Resource Manager • Central Agent – Manages and allocates cluster resources • Node Manager • Per-node agent – Manages and enforces node resource allocations • Application Master • Per Application • Manages application life cycle and task scheduling
  • 20. YARN – Resource Manager Failure • After a crash a new Resource Manager instance needs to brought up ( by an administrator ) • It recovers from saved state • State consists of • node managers in the systems • running applications • State to manage is much more manageable than that of Job Tracker. • Tasks are not part of Resource Managers state. • They are handled by the application master.
  • 21. Summary • Overview of Hadoop MapReduce • Discussed Fault Tolerant features of • HDFS • MapReduce v1 • MapReducev2 - YARN (briefly)
  • 22. References • Google MapReduce paper • MapReduce: Simplified Data Processing on Large Clusters(Jeffrey Dean and Sanjay Ghemawat Google, Inc. 2004) • Apache Hadoop Web site (http://hadoop.apache.org/) • Yahoo Developer Network Tutorial on MapReduce (https://developer.yahoo.com/hadoop/tutorial/module4.html) • Hadoop The Definitive Guide( Tom White - O’REILLY ) • Cloudera blog (http://blog.cloudera.com/blog/2011/02/hadoop-availability/) • HortonWorks YARN Tutorial(http://hortonworks.com/hadoop/yarn/)
  • 25. More Fault Tolerant features provided in HDFS • Rack awareness • Take a node's physical location into account while scheduling tasks and allocating storage. • NameNode tries to place replicas of block on multiple racks for improved fault tolerance • Upgrade and rollback • After a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems. • Secondary NameNode • Performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.

Editor's Notes

  1. Talk about the articles I read. Apache Hadoop Web site HDFS architecture guide MapReduce original paper from Google
  2. Task tracker crashing or running very slowly Stop or send the heart beat very infrequently Job tracker removes this task from the pool of task trackers
  3. Author mentions this possibility is quite low, because of a particular machine failing is low. Can’t agree with that though.
  4. Scalability bottlenecks in the case of very large number of nodes ( > 4000 ) Splitting the responsibilities of Job Tracker in to separate entities.
  5. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
  6. Tasks are not part of resource manager’s state. Since they are managed by the application master. The amount of state to be managed is much more manageable than that of the Job Tracker.
  7. When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc.) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in. Availability is the proportion of time a system is functioning [1], which is commonly referred to as “uptime” (vs downtime, when the system is not functioning). Note that availability is a stricter requirement than fault tolerance – the ability for a system to perform as designed and degrade gracefully in the presence of failures. A system that requires an hour to restart (eg for a configuration change or software upgrade) but has no single point of failure is fault tolerant but not highly available (HA). Adding redundancy in all SPOFs is a common way to improve fault tolerance, which helps [2], but is just a part of, improving Hadoop availability. Note also that fault tolerance is distinct from durability, even though the NameNode is a SPOF no single failure results in data loss as copies of NameNode persistent state (the image and edit log) are replicated both within and across hosts. Availability is also often conflated with reliability. Reliability in distributed systems is a more general issue than availability [3]. A truly reliable distributed system must be highly available, fault tolerant, secure, scalable, and perform predictably, etc. I’ll limit this post to Hadoop availability.