• Save
Hadoop Developer
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop Developer

on

  • 1,308 views

 

Statistics

Views

Total Views
1,308
Views on SlideShare
560
Embed Views
748

Actions

Likes
4
Downloads
0
Comments
0

4 Embeds 748

http://www.edureka.co 499
http://www.edureka.in 186
http://localhost 50
http://www.brain4ce.com 13

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop Developer Presentation Transcript

  • 1. Slide 1 www.edureka.in/hadoop
  • 2. Course Topics Slide 2 www.edureka.in/hadoop  Module 1  Understanding Big Data  Hadoop Architecture  Module 2  Hadoop Cluster Configuration  Data loading Techniques  Hadoop Project Environment  Module 3  Hadoop MapReduce framework  Programming in Map Reduce  Module 4  Advance MapReduce  MRUnit testing framework  Module 5  Analytics using Pig  Understanding Pig Latin  Module 6  Analytics using Hive  Understanding HIVE QL  Module 7  Advance Hive  NoSQL Databases and HBASE  Module 8  Advance HBASE  Zookeeper Service  Module 9  Hadoop 2.0 – New Features  Programming in MRv2  Module 10  Apache Oozie  Real world Datasets and Analysis  Project Discussion
  • 3. Topics for Today Slide 3 www.edureka.in/hadoop  Revision  Hadoop 2.0 New Features  HDFS High Availability  HDFS Federation  YARN or MRv2  YARN and Hadoop ecosystem  YARN Components  YARN Application Execution  Running an Application with YARN  Writing a YARN Application
  • 4. Lets’s Revise – HBase and ZooKeeper Master HFile RegionServers ReRgeigoinoSneSrevrevresrs memstore WAL /hbase/region 1 /hbase/region 2 ….. ….. /hbase/region Zookeeper HDFS Slide 4 www.edureka.in/hadoop
  • 5. Client HDFS Map Reduce Secondary Name Node Data Blocks …. Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce Hadoop 1.0 – In Summary Data Node Data NodeTask Tracker Map Reduce Task Tracker Map Reduce Slide 5 www.edureka.in/hadoop
  • 6. Hadoop 1.0 - Challenges Slide 6 www.edureka.in/hadoop Problem Description NameNode – No Horizontal Scalability Single NameNode and Single Namespaces, limited by NameNode RAM NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle of applications MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.
  • 7. NameNode – Scale and HA NameNode - No High Availability NameNode - No Horizontal Scale Data Node Data Node Data Node …. Client get Block Locations Block Management Read Data NameNode NS Slide 7 www.edureka.in/hadoop
  • 8.  Secondary NameNode:  “Not a hot standby” for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Secondary NameNode Single Point of FailureNameNode metadata Slide 8 www.edureka.in/hadoop metadata Name Node –Single Point of Failure
  • 9. Job Tracker – Overburdened CPU  Spends a very significant portion of time and effort managing the life cycle of applications Network  Single Listener Thread to communicate thousands of Map and Reduce Jobs with Task Tracker Task Tracker Task Tracker…. Job Tracker Slide 9 www.edureka.in/hadoop
  • 10. MRv1 – Unpredictability in Large Clusters As the cluster size grow and reaches to 4000 Nodes  Cascading Failures  The DataNode failures results in a serious deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.  Multi-tenancy  As clusters increase in size, you may want to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important. Slide 10 www.edureka.in/hadoop
  • 11.  Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing Unutilized Data in HDFS www.edureka.in/hadoop
  • 12. Hadoop 2.0 New Features Slide 12 www.edureka.in/hadoop Other important Hadoop 2.0 features  HDFS Snapshots  NFSv3 access to data in HDFS  Support for running Hadoop on MS Windows  Binary Compatibility for MapReduce applications built on Hadoop 1.0  Substantial amount of Integration testing with rest of the projects (such as Hadoop ecosystem PIG, HIVE) in Property Hadoop 1.0 Hadoop 2.0 Federation One Namenode and Namespaces Multiple Namenode and Namespaces High Availability Not present Highly Available YARN - Processing Control and Multi- tenancy JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler
  • 13. Hadoop 2.0 Cluster Architecture - Federation Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 Slide 13 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
  • 14. - provides cross-data center (non- Hlo eca lll o) s Tup hp eo rr et !fo !r HDFS, allowing Slide 14 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. How does HDFS Federation help HDFS Scale horizontally? - reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the filesystem namespace a cluster administrator to split the Block Storage outside the local cluster. Annie’s Question
  • 15. A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. Slide 15 www.edureka.in/hadoop Annie’s Answer
  • 16. /finance respectively. What will happ Hen elif loyo Tu htr ey reto !!put a file in Slide 16 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. You have configured two name nodes to manage /marketing and /accounting directory? Annie’s Question
  • 17. The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error. Slide 17 www.edureka.in/hadoop Annie’s Answer
  • 18. Node Manager HDFS YARN Resource Manager Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Standby NameNode Active NameNode Container App Master Node Manager Data Node Container App Master Data Node Container Data Node NDoadtae NMoadneager Client App Master Node Manager Container App Master Hadoop 2.0 Cluster Architecture - HA NameNode High Availability Slide 18 www.edureka.in/hadoop Next Generation MapReduce HDFS HIGH AVAILABILITY http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html *Not necessary to configure Secondary NameNode
  • 19. NameNode Recovery Vs. Failover High Availability in Hadoop 2.0 NameNode recovery in Hadoop 1.0 Secondary Name Node Standby NameNode Active NameNode Secondary Name Node NameNode Edit logs Meta-Data Automatic failover to Standby NameNode Manually Recover using Secondary NameNode FSImage www.edureka.in/hadoop
  • 20. HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 20 www.edureka.in/hadoop Annie’s Question
  • 21. Single Point of Failure of NameNode. Slide 21 www.edureka.in/hadoop Annie’s Answer
  • 22. Hadoop 2.0 – In Summary Client HDFS Distributed Data Storage Active NameNode YARN Resource Manager Applications Secondary Name Node Standby NameNode Distributed Data Processing Data Node Node Manager Container App Master ……. Masters Slaves Node Manager Data Node Container App Master Data Node Node Manager Container App Master Shared edit logs Scheduler Manager (AsM) NameNode High Availability www.edureka.in/hadoop Next Generation MapReduce
  • 23. YARN and Hadoop Ecosystem Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (MPI, GIRAPH) framework Slide 23 www.edureka.in/hadoop YARN Cluster Resource Management YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop
  • 24. BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) YARN – Moving beyond MapReduce Slide 24 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
  • 25. Multi-tenancy - Capacity Scheduler Slide 25 www.edureka.in/hadoop  Organizes jobs into queues  Queue shares as %’s of cluster  FIFO scheduling within each queue  Data locality-aware Scheduling  Hierarchical Queues To manage the resource within an organization.  Capacity Guarantees A fraction to the total available capacity allocated to each Queue.  Security To safeguard applications from other users.  Elasticity Resources are available in a predictable and elastic manner to queues.  Multi-tenancy Set of limit to prevent over-utilization of resources by a single application.  Operability Runtime configuration of Queues.  Resource-based scheduling If needed, Applications can request more resources than the default.
  • 26. Executing MapReduce Application on YARN www.edureka.in/hadoop MapReduce Application Execution
  • 27. YARN MR Application Execution Flow Slide 27 www.edureka.in/hadoop  MapReduce Job Execution  Job Submission  Job Initialization  Tasks Assignment  Tasks’ Memory  Status Updates  Failure Recovery  Client  Submit a MapReduce Job  Resource Manager  Manage the resource utilization across Hadoop Cluster  Node Manager  Runs on each Data Node  Creates execution container  Monitors Container’s usage  MapReduce Application Master  Coordinates and Manages MapReduce Jobs  Negotiates with Resource Manager to schedule asks  The tasks are started by NodeManager(s)  HDFS  shares resources and Job’s artefacts among YARN components
  • 28. YARN = Yet Another Resource Negotiator Node Manager Container Container Node Manager App Master Container Node Manager Container App Master Resource Manager Client Client MapReduce Status Job Submission Node Status Resource Request  Resource Manager  Cluster Level resource manager  Long Life, High Quality Hardware  Node Manager  One per Data Node  Monitors resources on Data Node  Application Master  One per application  Short life  Manages task /scheduling Hadoop 2.0 - YARN JobHistory Server Slide 28 www.edureka.in/hadoop
  • 29. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YARNChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application Map or Reduce Task HDFS MR AppMaster DataNode Slide 29 www.edureka.in/hadoop Management Node Task JVM
  • 30. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YarnChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container Map or Reduce Task HDFS MR AppMaster DataNode Slide 30 www.edureka.in/hadoop 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container Management Node Task JVM
  • 31. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container DataNode 12. Execute 10. Create Container 11. Acquire Job Resources Slide 31 www.edureka.in/hadoop Management Node Task JVM
  • 32. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object Client JVM DataNode Management Node Task JVM Poll for Status Slide 32 www.edureka.in/hadoop Update Status
  • 33. Hadoop 2.0 : YARN Workflow Resource Manager Node Manager Node Manager App Master 2 Node Manager Node Manager Node Manager Container 2.2 Node Manager Container 2.3 Node Manager Node Manager Container 1.1 Container 2.1 Node Manager Container 1.2 Scheduler Applications Manager (AsM) Node Manager Node Manager Node Manager App Master 1 www.edureka.in/hadoop
  • 34. YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 34 www.edureka.in/hadoop Annie’s Question
  • 35. Too much burden on Job Tracker. Slide 35 www.edureka.in/hadoop Annie’s Answer
  • 36. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, the functionality of JobTracker has been replaced by which of the following YARN features: Slide 36 www.edureka.in/hadoop - Job Scheduling - Task Monitoring - Resource Management - Node management Annie’s Question
  • 37. Task Monitoring and Resource Management. The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring. Slide 37 www.edureka.in/hadoop Annie’s Answer
  • 38. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?: Slide 38 www.edureka.in/hadoop - Node Manager - Job Tracker - Task tracker - Application Master - Resource manager Annie’s Question
  • 39. ApplicationMaster. Slide 39 www.edureka.in/hadoop Annie’s Answer
  • 40. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster? Slide 40 www.edureka.in/hadoop - Yes - No Annie’s Question
  • 41. Yes. You need to recompile the Jobs in MRv2 after enabling YARN to run the Job successfully in a YARN enabled Hadoop Cluster. Slide 41 www.edureka.in/hadoop Annie’s Answer
  • 42. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Which of the following YARN/MRv2 daemon is responsible for launching the tasks? Slide 42 www.edureka.in/hadoop - Task Tracker - Resource Manager - ApplicationMaster - ApplicationMasterServer Annie’s Question
  • 43. ApplicationMaster. Slide 43 www.edureka.in/hadoop Annie’s Answer
  • 44.  Write the MapReduce programs using MRv2 for your Module-3 assignments.  Assignment Programs – WordCount , Patents, Temperature, and Alphabets  Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.  Execute the ‘teragen’ example to test the Cluster Slide 44 www.edureka.in/hadoop Module-10 Pre-work
  • 45. Thank You See You in Class Next Week
  • 46. www.edureka.in/hadoop NN High Availability – Quorum Journal Manager Standby Name Node Active Name Node Data Node Data Node Data Node Data Node …. Data Blocks Journal Node Write namespace modification to edit logs; only Active Name Node is the writer Journal Node Journal Node Read edit logs and applies to its own namespace Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.