Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0
 

Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0

on

  • 841 views

This presentation explains the new Hadoop 2.0 features in detail and clarifies many prevalent doubts about Hadoop 2.0. Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:

This presentation explains the new Hadoop 2.0 features in detail and clarifies many prevalent doubts about Hadoop 2.0. Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:

HDFS Federation – horizontal scalability of NameNode
NameNode High Availability – NameNode is no longer a Single Point of Failure
YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH
Resource Manager – splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMaster
There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop adoption in the Industry to solve Big Data problems.

Statistics

Views

Total Views
841
Slideshare-icon Views on SlideShare
829
Embed Views
12

Actions

Likes
2
Downloads
76
Comments
0

2 Embeds 12

http://www.slideee.com 11
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0 Introduction to Hadoop 2.0 and How it overcomes the Limitations of Hadoop 1.0 Presentation Transcript

    • Introduction to Hadoop 2.0 Architecture www.edureka.in/hadoop
    • Objectives of this Session • Un • The Big Data Problem • How the Hadoop Ecosystem comes to rescue? • Hadoop 1.0 Architecture and limitations • How Hadoop 2.0 Architecture overcomes the challenges? • Quiz to reinforce your learning www.edureka.in/hadoop For Further Queries and class recording: #askedureka Follow us on Twitter @edurekaIN Like us on Facebook /edurekaIN
    • Big Data Use Cases www.edureka.in/hadoop Tweet Trend Analysis Telecom – Service Usage Analysise-Governance – Social Welfare Banks and Financial Services Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Growing Interest in Hadoop www.edureka.in/hadoop
    • www.edureka.in/hadoopSlide 5 Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Mahout Machine Learning Hive DW System MapReduce Framework HBase Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data Hadoop Eco-System ETL/DW Professionals Developers / Programmers DBA / Administrators Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Hadoop 1.0 – In Summary Client HDFS Map Reduce Secondary NameNode Data BlocksDataNode NameNode Job Tracker Task Tracker Map Reduce DataNode Task Tracker Map Reduce…. DataNode DataNodeTask Tracker Map Reduce Task Tracker Map Reduce www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Limitations of Hadoop 1.x • No horizontal scalability of NameNode • Does not support NameNode High Availability • Overburdened JobTracker • Not possible to run Non-MapReduce Big Data Applications on HDFS • Does not support Multi-tenancy www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Challenges: • Meta is stored in NameNode memory • Bottleneck after ~4000 nodes • Results in cascading failures Data Node Data Node Data Node …. Client Block Management NameNode NS Challenges with Horizontal Scale www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Name Node – Single Point of Failure Secondary NameNode NameNode metadata metadata www.edureka.in/hadoop Secondary NameNode: • “Not a hot standby” for the NameNode • Connects to NameNode regularly • Housekeeping, backup of NameNode metadata • Saved metadata can build a failed NameNode Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Job Tracker – Overburdened CPU  Spends a very significant portion of time and effort managing the life cycle of applications Network  Single Listener Thread to communicate with thousands of Map and Reduce Jobs Task Tracker Task Tracker Task Tracker…. Job Tracker www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Unutilized Data in HDFS Challenges:  Only MapReduce processing can be achieved  Alternate Data Storage is needed for other processing such as Real-time or Graph analysis  Doesn’t support Multi-Tenacy www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Introducing Hadoop 2.0 Most Important Features: • HDFS Federation • Support for NameNode High Availability • YARN – Yet Another Resource Negotiator • Better Processing Control • Support for non Map Reduce type of processing • Support for Multi-tenancy www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html Hadoop 2.0 Cluster Architecture - Federation Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 14 NameNode High Availability Next Generation MapReduce Hadoop 2.0 – In Summary HDFS YARN Resource Manager Standby NameNode Active NameNode DataNode Node Manager Container App Master ……. Masters Slaves Node Manager DataNode Container App Master DataNode Node Manager Container App Master Shared edit logs Scheduler Applications Manager (AsM) www.edureka.in/hadoop Write Read Client Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 15 NameNode High Availability Next Generation MapReduce Hadoop 2.0 – High Availability HDFS YARN Resource Manager Standby NameNode Active NameNode DataNode Node Manager Container App Master ……. Masters Slaves Node Manager DataNode Container App Master DataNode Node Manager Container App Master Shared edit logs Scheduler Applications Manager (AsM) www.edureka.in/hadoop Write Read Client • Read/Write logs apply to its own namespace • All name space edits logged to shared NFS storage; single writer (fencing) Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 16 NameNode High Availability Next Generation MapReduce Hadoop 2.0 – Resource Management HDFS YARN Resource Manager Standby NameNode Active NameNode DataNode Node Manager Container App Master ……. Masters Slaves Node Manager DataNode Container App Master DataNode Node Manager Container App Master Shared edit logs Scheduler Applications Manager (AsM) www.edureka.in/hadoop Write Read Client Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • www.edureka.in/hadoop Features:  Different types of jobs are organized in different queues  Queue shares as %’s of cluster  Each queue has an associated priority  FIFO scheduling within each queue  Security ensured between applications Multi-Tenancy - Capacity Scheduler Batch Interactive Streaming Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 19 Annie’s Question NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0? a) Single Point Of Failure Of NameNode b) To run classic MapReduce c) Too much burden on Job Tracker www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 20 Annie’s Answer Single Point of Failure of NameNode. www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 21 Annie’s Question YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? a) Single Point Of Failure Of NameNode b) Only one version can be run in classic MapReduce c) Too much burden on Job Tracker www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 22 Annie’s Answer Too much burden on Job Tracker and to support Multi-Tenacy www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 23 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Which of the following is (are) a significant disadvantage in Hadoop 1.0? - ‘Single Point Of Failure’ of NameNode - It can run only one version in classic MapReduce - Too much burden on Job Tracker Annie’s Question www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 24 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. A Hadoop 1.x cluster can have multiple HDFS Namespaces. - True - False Annie’s Question www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 25 False. Not possible with Hadoop 1.x. Annie’s Answer www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 26 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you use Hadoop 2.0 for Real-time processing? - Yes - No Annie’s Question www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 27 No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing. Annie’s Answer www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • www.edureka.in/hadoop cluster. Annie’s Question How does HDFS Federation help HDFS Scale horizontally? A) Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace. B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster. Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • www.edureka.in/hadoop Annie’s Answer (A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other. Twitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 30 Annie’s Question You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory? www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 31 Annie’s Answer The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”. www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 32 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x? - Yes - No Annie’s Question www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Slide 33 No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload. Annie’s Answer www.edureka.in/hadoopTwitter @edurekaIN, Facebook /edurekaIN, use #askedureka for Questions
    • Thank You See You in Next Class www.edureka.in/hadoop For Further Queries and class recording: #askedureka Follow us on Twitter @edurekaIN Like us on Facebook /edurekaIN