Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

on

  • 1,391 views

 

Statistics

Views

Total Views
1,391
Views on SlideShare
1,226
Embed Views
165

Actions

Likes
6
Downloads
105
Comments
0

5 Embeds 165

http://hadoophelp.edureka.in 143
http://www.slideee.com 9
http://support.edureka.in 8
http://cassandra.edureka.in 4
http://hadoophelp.edureka.co 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Presentation Transcript

  • Slide 1 www.edureka.in/hadoop
  • Slide 2 Hello There!! My name is Annie. Let me test your Hadoop 1.x knowledge? Annie’s Introduction
  • Slide 3 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you store 1 billion files in a Hadoop 1.x cluster? - Yes - No Annie’s Question
  • Slide 4 No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x. Annie’s Answer
  • Slide 5 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. A Hadoop 1.x cluster can have multiple HDFS Namespaces. - True - False Annie’s Question
  • Slide 6 False. Not possible with Hadoop 1.x. Annie’s Answer
  • Slide 7 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Which of the following is (are) a significant disadvantage in Hadoop 1.0? - ‘Single Point Of Failure’ of NameNode - Too much burden on Job Tracker Annie’s Question
  • Slide 8 Single Point of Failure of NameNode and too much burden on Job Tracker. Annie’s Answer
  • Slide 9 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x? - Yes - No Annie’s Question
  • Slide 10 No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload. Annie’s Answer
  • Slide 11 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you use Hadoop for Real-time processing? - Yes - No Annie’s Question
  • Slide 12 No. Hadoop is designed and developer for massively parallel batch processing. Annie’s Answer
  • Limitations of Hadoop 1.x  No horizontal scalability of NameNode  Does not support NameNode High Availability  Overburdened JobTracker  Not possible to run Non-MapReduce Big Data Applications on HDFS  Does not support Multi-tenancy
  • Slide 14 www.edureka.in/hadoop Hadoop 1.x – In Summary Client HDFS Map Reduce Secondary NameNode Data BlocksDataNode NameNode Job Tracker Task Tracker Map Reduce DataNode Task Tracker Map Reduce…. DataNode DataNodeTask Tracker Map Reduce Task Tracker Map Reduce
  • Slide 15 www.edureka.in/hadoop Problem Description NameNode – No Horizontal Scalability Single NameNode and Single Namespace, limited by NameNode RAM NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle of Applications MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc. Hadoop 1.x - Challenges
  • NameNode - No High Availability NameNode - No Horizontal Scale Data Node Data Node Data Node …. Client Get Block Locations Block Management Read Data NameNode NS Slide 16 www.edureka.in/hadoop NameNode – Scale and HA
  • Slide 17 www.edureka.in/hadoop Name Node –Single Point of Failure  Secondary NameNode:  “Not a hot standby” for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NameNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Single Point Failure Secondary NameNode NameNode metadata metadata
  • Slide 18 www.edureka.in/hadoop Job Tracker – Overburdened CPU  Spends a very significant portion of time and effort managing the life cycle of applications Network  Single Listener Thread to communicate with thousands of Map and Reduce Jobs Task Tracker Task Tracker Task Tracker…. Job Tracker
  • Slide 19 www.edureka.in/hadoop MRv1 – Unpredictability in Large Clusters As the cluster size grow and reaches to 4000 Nodes  Cascading Failures  The DataNode failures results in a serious deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.  Multi-tenancy  As clusters increase in size, you may want to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important.
  • Unutilized Data in HDFS  Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing Slide 11 www.edureka.in/hadoop
  • Introducing Hadoop 2.0 Features Hadoop 1.x Hadoop 2.0 HDFS Federation One NameNode and a Namespace Multiple NameNode and Namespaces NameNode High Availability Not present Highly Available YARN - Processing Control and Multi-tenancy JobTracker, TaskTracker Resource Manager, Node Manager, App Master, Capacity Scheduler Other important Hadoop 2.0 features  HDFS Snapshots  NFSv3 access to data in HDFS  Support for running Hadoop on MS Windows  Binary Compatibility for MapReduce applications built on Hadoop 1.0  Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem Slide 12 www.edureka.in/hadoop
  • Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 Slide 22 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html Hadoop 2.0 Cluster Architecture - Federation
  • Slide 23 www.edureka.in/hadoop cluster. Annie’s Question How does HDFS Federation help HDFS Scale horizontally? A) Reduces the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace. B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
  • Slide 24 www.edureka.in/hadoop Annie’s Answer (A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.
  • Slide 25 Annie’s Question You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory? www.edureka.in/hadoop
  • Slide 26 Annie’s Answer The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”. www.edureka.in/hadoop
  • Slide 27 Node Manager HDFS YARN Resource Manager Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Data Node Standby NameNode Active NameNode Container App Master Node Manager Data Node Container App Master Data Node Client Data Node Container App Master Node Manager Data Node Container App Master Node Manager Hadoop 2.0 Cluster Architecture - HA NameNode High Availability Next Generation MapReduce HDFS HIGH AVAILABILITY http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
  • Slide 28 Hadoop 2.0 Cluster Architecture - HA www.edureka.in/hadoop High Availability in Hadoop 2.0 NameNode recovery in Hadoop 1.0 Secondary NameNode Standby NameNode Active NameNode Secondary NameNode NameNode Edit logs Meta-Data Automatic failover to Standby NameNode Manually Recover using Secondary NameNode FSImage
  • Slide 29 Annie’s Question NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0? a) Single Point Of Failure of NameNode b) Too much burden on Job Tracker www.edureka.in/hadoop
  • Slide 30 Annie’s Answer Single Point of Failure of NameNode. www.edureka.in/hadoop
  • Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (MPI, GIRAPH) Slide 23 www.edureka.in/hadoop YARN Cluster Resource Management YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework YARN and Hadoop Ecosystem
  • BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) Slide 32 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce
  • Slide 33 www.edureka.in/hadoop  Organizes jobs into queues  Queue shares as %’s of cluster  FIFO scheduling within each queue  Data locality-aware Scheduling  Hierarchical Queues To manage the resource within an organization.  Capacity Guarantees A fraction to the total available capacity allocated to each Queue.  Security To safeguard applications from other users.  Elasticity Resources are available in a predictable and elastic manner to queues.  Multi-tenancy Set of limit to prevent over-utilization of resources by a single application.  Operability Runtime configuration of Queues.  Resource-based scheduling If needed, Applications can request more resources than the default. Multi-tenancy - Capacity Scheduler
  • Slide 34 Annie’s Question YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? a) Single Point Of Failure Of NameNode b) Too much burden on Job Tracker www.edureka.in/hadoop
  • Slide 35 Annie’s Answer Too much burden on Job Tracker. www.edureka.in/hadoop
  • Slide 36 NameNode High Availability Next Generation MapReduce Hadoop 2.0 – In Summary Client HDFS YARN Resource ManagerStandby NameNode Active NameNode Distributed Data Storage Distributed Data Processing DataNode Node Manager Container App Master ……. Masters Slaves Node Manager DataNode Container App Master DataNode Node Manager Container App Master Shared edit logs OR Journal Node Scheduler Applications Manager (AsM) www.edureka.in/hadoop
  • Slide 37 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can you use Hadoop 2.0 for Real-time processing? - Yes - No Annie’s Question
  • Slide 38 No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing. Annie’s Answer
  • Slide 39 www.edureka.in/hadoop What about Real-time Processing? Hadoop is good for Batch but How do I process Big Data in Real- time?
  • Slide 40 www.edureka.in/hadoop Storm is coming…. APACHE STORM The Real-time Hadoop • Continuous commutation system Distributed, Reliable, Fault-tolerant, Scalable and Robust • Suitable for Big Data processing • Guarantees no data loss Programming Language agnostic • JSON-based for Ruby, Python etc. Use case • Stream processing • Distributed RPC • Continuous Computation
  • Hadoop Vs. Storm Hadoop Storm Differences Fundamentally as Batch processing system Real-time processing, process unterminated streams (e.g. twitter feeds) of data, process data as it arrives MapReduce Jobs run to completion Topologies (Computation Graph) run forever Stateful Nodes Stateless Nodes Hadoop Storm Similarities Scalable Scalable Guarantees no data loss Guarantees no data loss Open Source Open Source
  • Storm Use Cases  Data Normalization • Groupon uses Storm to build real-time data integration systems.  Analytics • Storm powers Twitter’s publisher analytics product, processing every tweet and click that happens on Twitter to provide analytics for Twitter's publisher partners. • Flipboard use Storm across a wide range of services ranging from Content Search to real-time analytics, to generating custom magazine fields.  Log processing • Alibaba uses Storm to process the application log and data change in databases to supply real-time data stats for data apps. • NaviSite uses Storm in its server log monitoring and auditing system.
  • Thank You See You in Next Class