SlideShare a Scribd company logo
1 of 5
HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview
HDFS Architecture
 HDFS stands for Hadoop Distributed File System
 HDFS was originally built as infrastructure forthe Apache Nutch web search engine project
 HDFS is now an Apache Hadoop sub project
 A typical file in HDFS is gigabytes to terabytes in size
 HDFS applications need a write-once-read-many access model for files. This assumption
simplifies data coherency issues and enables high throughput data access
 HDFS has master/slave architecture: NameNode/Datanode
 An HDFS cluster consists of a single NameNode and a number of Datanode
 The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines.
These machines typically run a GNU/Linux operating system (OS)
 Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that
they run on. They are responsible for serving read and write requests from the file system’s
clients. They also perform blockcreation, deletion, and replication upon instruction fromthe
NameNode.
 HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file
is split into one or more blocks and these blocksare stored in a set of Datanode
NameNode
Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When
in use, all this information is stored in main memory. But this information also stored in disk for
persistence storage.
The above image shows how Name Node stores information in disk.
Twodifferent files are
 fsimage - It’s the snapshot of the file system when Namenode started
 Edit logs - It’s the sequence of changes made tothe file system after Namenode started
Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file
system. But Namenode restart are rare in production clusters which means edit logs can grow very
large for the clusters where Namenode runs for a long period of time. The following issues we will
encounter in this situation.
 Edit log become very large , which willbe challenging to manage it
 Namenode restart takes long time because lot of changes has to be merged
 In the case of crash, we willlost huge amount of metadata since fsimage is very old
So to overcome this issues we need a mechanism which will help us reduce the edit log size which is
manageable and have up to date fsimage ,so that load on Namenode reduces . It’s very similar to
Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong,
we can fall back to the last restore point.
So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what
is this all have to withSecondary Namenode?
Secondary Namenode
Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit
logs withfsimage from the Namenode.
The above figure shows the workingof Secondary Namenode
 It gets the edit logs from the Namenode in regular intervals and applies to fsimage
 Once it has new fsimage, it copies back to Namenode
 Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It’s just a helper node for
Namenode. That’s why it also known as checkpoint node inside the community.
So we now understood all Secondary Namenode does put a checkpoint in file system which will help
Namenode to function better. It’s not the replacement or backup for the Namenode. So from now on
make a habit of calling it as a checkpoint node.
MapReduce
Mapreduce is a framework using which we can write applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a reliable manner.
Mapreduce is a processing technique and a program model for distributed computing based on java.
The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the
reduce task is always performed after the map job.
Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
 Map stage: The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
 Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes. Most of the computing takes place
on nodes with data on local disks that reduces the network traffic. After completion of the given tasks,
the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
Timelines
Year Month Event
2003 October Google File System paper released
2006 January Hadoop is born from Nutch 197
2006 February Hadoop is named after Cutting's son's yellow plush toy
2006 April Hadoop 0.1.0 released
2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2008 March First Hadoop Summit
2008 April
Hadoop world record fastest system to sort a terabyte of data. Running on a 910-
node cluster, Hadoop sorted one terabyte in 209 seconds
2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
2008 July Hadoop wins Terabyte Sort Benchmark
2008 November Google MapReduce implementation sorted one terabyte in 68 seconds
2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds
2012 November Apache Hadoop 1.0 Available
Hadoop1VsHadoop2
S
No
Hadoop1 Hadoop2
2 MR does both processing and cluster-
resource management.
YARN (YetAnother Resource Negotiator) does
cluster resource management and processing is
done using different processing models.
3 Has limited scaling of nodes. Limited to 4000
nodes per cluster
Has better scalability. Scalable up to 10000
nodes per cluster
4 Works on concepts of slots – slots can run
either a Map task or a Reduce task only.
Works on concepts of containers. Using
containers can run generic tasks.
5 A single Namenode to manage the entire
namespace.
Multiple Namenode servers manage multiple
namespaces.
6 Has Single-Point-of-Failure (SPOF) – because
of single Namenode- and in the case
of Namenode failure, needs manual
intervention to overcome.
Has to feature to overcomeSPOF witha standby
Namenode and in the case of Namenode failure,
it is configured forautomatic recovery.
7 MR API is compatible withHadoop1x. A
program written in Hadoop1 executes
in Hadoop1x without any additional files.
MR API requires additional files for a program
written in Hadoop1x to execute in Hadoop2x.
9 A Namenode failure affectsthe stack. The Hadoop stack – Hive, Pig, HBase etc. are all
equipped to handle Namenode failure.

More Related Content

What's hot

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to FlumeRupak Roy
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase clientShashwat Shriparv
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradationShashwat Shriparv
 

What's hot (19)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Cppt
CpptCppt
Cppt
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
 

Similar to Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similar to Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview (20)

hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
hadoop.pptx
hadoop.pptxhadoop.pptx
hadoop.pptx
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Recently uploaded

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationNeilDeclaro1
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 

Recently uploaded (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 

Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview

  • 1. HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview HDFS Architecture  HDFS stands for Hadoop Distributed File System  HDFS was originally built as infrastructure forthe Apache Nutch web search engine project  HDFS is now an Apache Hadoop sub project  A typical file in HDFS is gigabytes to terabytes in size  HDFS applications need a write-once-read-many access model for files. This assumption simplifies data coherency issues and enables high throughput data access  HDFS has master/slave architecture: NameNode/Datanode  An HDFS cluster consists of a single NameNode and a number of Datanode  The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines. These machines typically run a GNU/Linux operating system (OS)  Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that they run on. They are responsible for serving read and write requests from the file system’s clients. They also perform blockcreation, deletion, and replication upon instruction fromthe NameNode.  HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file is split into one or more blocks and these blocksare stored in a set of Datanode NameNode Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But this information also stored in disk for persistence storage.
  • 2. The above image shows how Name Node stores information in disk. Twodifferent files are  fsimage - It’s the snapshot of the file system when Namenode started  Edit logs - It’s the sequence of changes made tothe file system after Namenode started Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file system. But Namenode restart are rare in production clusters which means edit logs can grow very large for the clusters where Namenode runs for a long period of time. The following issues we will encounter in this situation.  Edit log become very large , which willbe challenging to manage it  Namenode restart takes long time because lot of changes has to be merged  In the case of crash, we willlost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on Namenode reduces . It’s very similar to Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong, we can fall back to the last restore point. So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what is this all have to withSecondary Namenode? Secondary Namenode Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit logs withfsimage from the Namenode. The above figure shows the workingof Secondary Namenode  It gets the edit logs from the Namenode in regular intervals and applies to fsimage  Once it has new fsimage, it copies back to Namenode  Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
  • 3. Secondary Namenode whole purpose is to have a checkpoint in HDFS. It’s just a helper node for Namenode. That’s why it also known as checkpoint node inside the community. So we now understood all Secondary Namenode does put a checkpoint in file system which will help Namenode to function better. It’s not the replacement or backup for the Namenode. So from now on make a habit of calling it as a checkpoint node. MapReduce Mapreduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Mapreduce is a processing technique and a program model for distributed computing based on java. The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the reduce task is always performed after the map job. Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.  Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.  Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 4. Timelines Year Month Event 2003 October Google File System paper released 2006 January Hadoop is born from Nutch 197 2006 February Hadoop is named after Cutting's son's yellow plush toy 2006 April Hadoop 0.1.0 released 2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours 2008 March First Hadoop Summit 2008 April Hadoop world record fastest system to sort a terabyte of data. Running on a 910- node cluster, Hadoop sorted one terabyte in 209 seconds 2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org) 2008 July Hadoop wins Terabyte Sort Benchmark 2008 November Google MapReduce implementation sorted one terabyte in 68 seconds 2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds 2012 November Apache Hadoop 1.0 Available Hadoop1VsHadoop2
  • 5. S No Hadoop1 Hadoop2 2 MR does both processing and cluster- resource management. YARN (YetAnother Resource Negotiator) does cluster resource management and processing is done using different processing models. 3 Has limited scaling of nodes. Limited to 4000 nodes per cluster Has better scalability. Scalable up to 10000 nodes per cluster 4 Works on concepts of slots – slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks. 5 A single Namenode to manage the entire namespace. Multiple Namenode servers manage multiple namespaces. 6 Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in the case of Namenode failure, needs manual intervention to overcome. Has to feature to overcomeSPOF witha standby Namenode and in the case of Namenode failure, it is configured forautomatic recovery. 7 MR API is compatible withHadoop1x. A program written in Hadoop1 executes in Hadoop1x without any additional files. MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x. 9 A Namenode failure affectsthe stack. The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.