SlideShare a Scribd company logo
HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview
HDFS Architecture
 HDFS stands for Hadoop Distributed File System
 HDFS was originally built as infrastructure forthe Apache Nutch web search engine project
 HDFS is now an Apache Hadoop sub project
 A typical file in HDFS is gigabytes to terabytes in size
 HDFS applications need a write-once-read-many access model for files. This assumption
simplifies data coherency issues and enables high throughput data access
 HDFS has master/slave architecture: NameNode/Datanode
 An HDFS cluster consists of a single NameNode and a number of Datanode
 The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines.
These machines typically run a GNU/Linux operating system (OS)
 Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that
they run on. They are responsible for serving read and write requests from the file system’s
clients. They also perform blockcreation, deletion, and replication upon instruction fromthe
NameNode.
 HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file
is split into one or more blocks and these blocksare stored in a set of Datanode
NameNode
Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When
in use, all this information is stored in main memory. But this information also stored in disk for
persistence storage.
The above image shows how Name Node stores information in disk.
Twodifferent files are
 fsimage - It’s the snapshot of the file system when Namenode started
 Edit logs - It’s the sequence of changes made tothe file system after Namenode started
Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file
system. But Namenode restart are rare in production clusters which means edit logs can grow very
large for the clusters where Namenode runs for a long period of time. The following issues we will
encounter in this situation.
 Edit log become very large , which willbe challenging to manage it
 Namenode restart takes long time because lot of changes has to be merged
 In the case of crash, we willlost huge amount of metadata since fsimage is very old
So to overcome this issues we need a mechanism which will help us reduce the edit log size which is
manageable and have up to date fsimage ,so that load on Namenode reduces . It’s very similar to
Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong,
we can fall back to the last restore point.
So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what
is this all have to withSecondary Namenode?
Secondary Namenode
Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit
logs withfsimage from the Namenode.
The above figure shows the workingof Secondary Namenode
 It gets the edit logs from the Namenode in regular intervals and applies to fsimage
 Once it has new fsimage, it copies back to Namenode
 Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It’s just a helper node for
Namenode. That’s why it also known as checkpoint node inside the community.
So we now understood all Secondary Namenode does put a checkpoint in file system which will help
Namenode to function better. It’s not the replacement or backup for the Namenode. So from now on
make a habit of calling it as a checkpoint node.
MapReduce
Mapreduce is a framework using which we can write applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a reliable manner.
Mapreduce is a processing technique and a program model for distributed computing based on java.
The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the
reduce task is always performed after the map job.
Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
 Map stage: The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
 Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes. Most of the computing takes place
on nodes with data on local disks that reduces the network traffic. After completion of the given tasks,
the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
Timelines
Year Month Event
2003 October Google File System paper released
2006 January Hadoop is born from Nutch 197
2006 February Hadoop is named after Cutting's son's yellow plush toy
2006 April Hadoop 0.1.0 released
2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2008 March First Hadoop Summit
2008 April
Hadoop world record fastest system to sort a terabyte of data. Running on a 910-
node cluster, Hadoop sorted one terabyte in 209 seconds
2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
2008 July Hadoop wins Terabyte Sort Benchmark
2008 November Google MapReduce implementation sorted one terabyte in 68 seconds
2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds
2012 November Apache Hadoop 1.0 Available
Hadoop1VsHadoop2
S
No
Hadoop1 Hadoop2
2 MR does both processing and cluster-
resource management.
YARN (YetAnother Resource Negotiator) does
cluster resource management and processing is
done using different processing models.
3 Has limited scaling of nodes. Limited to 4000
nodes per cluster
Has better scalability. Scalable up to 10000
nodes per cluster
4 Works on concepts of slots – slots can run
either a Map task or a Reduce task only.
Works on concepts of containers. Using
containers can run generic tasks.
5 A single Namenode to manage the entire
namespace.
Multiple Namenode servers manage multiple
namespaces.
6 Has Single-Point-of-Failure (SPOF) – because
of single Namenode- and in the case
of Namenode failure, needs manual
intervention to overcome.
Has to feature to overcomeSPOF witha standby
Namenode and in the case of Namenode failure,
it is configured forautomatic recovery.
7 MR API is compatible withHadoop1x. A
program written in Hadoop1 executes
in Hadoop1x without any additional files.
MR API requires additional files for a program
written in Hadoop1x to execute in Hadoop2x.
9 A Namenode failure affectsthe stack. The Hadoop stack – Hive, Pig, HBase etc. are all
equipped to handle Namenode failure.

More Related Content

What's hot

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to FlumeRupak Roy
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase clientShashwat Shriparv
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradationShashwat Shriparv
 

What's hot (19)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Cppt
CpptCppt
Cppt
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
 

Similar to Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similar to Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview (20)

hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
hadoop.pptx
hadoop.pptxhadoop.pptx
hadoop.pptx
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Recently uploaded

How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPCeline George
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxDenish Jangid
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePedroFerreira53928
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online PresentationGDSCYCCE
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resourcesdimpy50
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxricssacare
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxShibin Azad
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfTamralipta Mahavidyalaya
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfbu07226
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxssuserbdd3e8
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345beazzy04
 

Recently uploaded (20)

Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 

Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview

  • 1. HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview HDFS Architecture  HDFS stands for Hadoop Distributed File System  HDFS was originally built as infrastructure forthe Apache Nutch web search engine project  HDFS is now an Apache Hadoop sub project  A typical file in HDFS is gigabytes to terabytes in size  HDFS applications need a write-once-read-many access model for files. This assumption simplifies data coherency issues and enables high throughput data access  HDFS has master/slave architecture: NameNode/Datanode  An HDFS cluster consists of a single NameNode and a number of Datanode  The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines. These machines typically run a GNU/Linux operating system (OS)  Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that they run on. They are responsible for serving read and write requests from the file system’s clients. They also perform blockcreation, deletion, and replication upon instruction fromthe NameNode.  HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file is split into one or more blocks and these blocksare stored in a set of Datanode NameNode Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But this information also stored in disk for persistence storage.
  • 2. The above image shows how Name Node stores information in disk. Twodifferent files are  fsimage - It’s the snapshot of the file system when Namenode started  Edit logs - It’s the sequence of changes made tothe file system after Namenode started Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file system. But Namenode restart are rare in production clusters which means edit logs can grow very large for the clusters where Namenode runs for a long period of time. The following issues we will encounter in this situation.  Edit log become very large , which willbe challenging to manage it  Namenode restart takes long time because lot of changes has to be merged  In the case of crash, we willlost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on Namenode reduces . It’s very similar to Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong, we can fall back to the last restore point. So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what is this all have to withSecondary Namenode? Secondary Namenode Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit logs withfsimage from the Namenode. The above figure shows the workingof Secondary Namenode  It gets the edit logs from the Namenode in regular intervals and applies to fsimage  Once it has new fsimage, it copies back to Namenode  Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
  • 3. Secondary Namenode whole purpose is to have a checkpoint in HDFS. It’s just a helper node for Namenode. That’s why it also known as checkpoint node inside the community. So we now understood all Secondary Namenode does put a checkpoint in file system which will help Namenode to function better. It’s not the replacement or backup for the Namenode. So from now on make a habit of calling it as a checkpoint node. MapReduce Mapreduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Mapreduce is a processing technique and a program model for distributed computing based on java. The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the reduce task is always performed after the map job. Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.  Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.  Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 4. Timelines Year Month Event 2003 October Google File System paper released 2006 January Hadoop is born from Nutch 197 2006 February Hadoop is named after Cutting's son's yellow plush toy 2006 April Hadoop 0.1.0 released 2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours 2008 March First Hadoop Summit 2008 April Hadoop world record fastest system to sort a terabyte of data. Running on a 910- node cluster, Hadoop sorted one terabyte in 209 seconds 2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org) 2008 July Hadoop wins Terabyte Sort Benchmark 2008 November Google MapReduce implementation sorted one terabyte in 68 seconds 2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds 2012 November Apache Hadoop 1.0 Available Hadoop1VsHadoop2
  • 5. S No Hadoop1 Hadoop2 2 MR does both processing and cluster- resource management. YARN (YetAnother Resource Negotiator) does cluster resource management and processing is done using different processing models. 3 Has limited scaling of nodes. Limited to 4000 nodes per cluster Has better scalability. Scalable up to 10000 nodes per cluster 4 Works on concepts of slots – slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks. 5 A single Namenode to manage the entire namespace. Multiple Namenode servers manage multiple namespaces. 6 Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in the case of Namenode failure, needs manual intervention to overcome. Has to feature to overcomeSPOF witha standby Namenode and in the case of Namenode failure, it is configured forautomatic recovery. 7 MR API is compatible withHadoop1x. A program written in Hadoop1 executes in Hadoop1x without any additional files. MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x. 9 A Namenode failure affectsthe stack. The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.