Slide 1 www.edureka.in/hadoop
Course Topics
Slide 2 www.edureka.in/hadoop
 Module 1
 Understanding Big Data
 Hadoop Architecture
 Module 2
 Hadoop ...
Topics for Today
Slide 3 www.edureka.in/hadoop
 Revision
 Hadoop 2.0 New Features
 HDFS High Availability
 HDFS Federa...
Lets’s Revise – HBase and ZooKeeper
Master
HFile
RegionServers
ReRgeigoinoSneSrevrevresrs
memstore
WAL
/hbase/region
1
/hb...
Client
HDFS Map Reduce
Secondary
Name Node
Data
Blocks
….
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Nod...
Hadoop 1.0 - Challenges
Slide 6 www.edureka.in/hadoop
Problem Description
NameNode – No Horizontal
Scalability
Single Name...
NameNode – Scale and HA
NameNode - No High Availability
NameNode - No Horizontal Scale
Data
Node
Data
Node
Data
Node
….
Cl...
 Secondary NameNode:
 “Not a hot standby” for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of ...
Job Tracker – Overburdened
CPU
 Spends a very significant portion of time and
effort managing the life cycle of applicati...
MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
 Cascading Failures
 The Da...
 Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing
Unutilized Data in HDFS
www.edureka.in...
Hadoop 2.0 New Features
Slide 12 www.edureka.in/hadoop
Other important Hadoop 2.0 features
 HDFS Snapshots
 NFSv3 access...
Hadoop 2.0 Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage...
- provides cross-data center (non-
Hlo
eca
lll
o) s
Tup
hp
eo
rr
et
!fo
!r HDFS, allowing
Slide 14 www.edureka.in/hadoop
M...
A. In order to scale the name service horizontally, HDFS federation
uses multiple independent Namenodes. The Namenodes are...
/finance respectively. What will happ
Hen
elif
loyo
Tu
htr
ey
reto
!!put a file in
Slide 16 www.edureka.in/hadoop
My name ...
The put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or direc...
Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
...
NameNode Recovery Vs. Failover
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
Name Node
Standby...
HDFS HA was developed to overcome the following disadvantage in
Hadoop 1.0?
- Single Point Of Failure Of Name•Node
- Only ...
Single Point of Failure of NameNode.
Slide 21 www.edureka.in/hadoop
Annie’s Answer
Hadoop 2.0 – In Summary
Client
HDFS
Distributed Data Storage
Active
NameNode
YARN
Resource Manager
Applications
Secondary
...
YARN and Hadoop Ecosystem
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW Sy...
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(Ope...
Multi-tenancy - Capacity Scheduler
Slide 25 www.edureka.in/hadoop
 Organizes jobs into queues
 Queue shares as %’s of cl...
Executing MapReduce Application on YARN
www.edureka.in/hadoop
MapReduce Application Execution
YARN MR Application Execution Flow
Slide 27 www.edureka.in/hadoop
 MapReduce Job Execution
 Job Submission
 Job Initial...
YARN = Yet Another Resource Negotiator
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Con...
YARN MR Application Execution Flow
Client
Resource
ManagerApplication
Node Manager
YARNChild
Job Object
1. Run Job 2. Get ...
YARN MR Application Execution Flow
Client
Resource
ManagerApplication
Node Manager
YarnChild
Job Object
1. Run Job 2. Get ...
YARN MR Application Execution Flow
Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Red...
YARN MR Application Execution Flow
Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Red...
Hadoop 2.0 : YARN Workflow
Resource
Manager
Node Manager
Node Manager
App
Master 2
Node Manager
Node Manager
Node Manager
...
YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
- Single Point Of Failure Of ...
Too much burden on Job Tracker.
Slide 35 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
I...
Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
...
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
I...
ApplicationMaster.
Slide 39 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
C...
Yes. You need to recompile the Jobs in MRv2 after enabling YARN to
run the Job successfully in a YARN enabled Hadoop Clust...
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
W...
ApplicationMaster.
Slide 43 www.edureka.in/hadoop
Annie’s Answer
 Write the MapReduce programs using MRv2 for your Module-3 assignments.
 Assignment Programs – WordCount , Patents, Temp...
Thank You
See You in Class Next Week
www.edureka.in/hadoop
NN High Availability – Quorum Journal Manager
Standby Name
Node
Active Name
Node
Data Node Data Node...
Upcoming SlideShare
Loading in...5
×

Hadoop Developer

2,449

Published on

Published in: Education, Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,449
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Hadoop Developer

  1. 1. Slide 1 www.edureka.in/hadoop
  2. 2. Course Topics Slide 2 www.edureka.in/hadoop  Module 1  Understanding Big Data  Hadoop Architecture  Module 2  Hadoop Cluster Configuration  Data loading Techniques  Hadoop Project Environment  Module 3  Hadoop MapReduce framework  Programming in Map Reduce  Module 4  Advance MapReduce  MRUnit testing framework  Module 5  Analytics using Pig  Understanding Pig Latin  Module 6  Analytics using Hive  Understanding HIVE QL  Module 7  Advance Hive  NoSQL Databases and HBASE  Module 8  Advance HBASE  Zookeeper Service  Module 9  Hadoop 2.0 – New Features  Programming in MRv2  Module 10  Apache Oozie  Real world Datasets and Analysis  Project Discussion
  3. 3. Topics for Today Slide 3 www.edureka.in/hadoop  Revision  Hadoop 2.0 New Features  HDFS High Availability  HDFS Federation  YARN or MRv2  YARN and Hadoop ecosystem  YARN Components  YARN Application Execution  Running an Application with YARN  Writing a YARN Application
  4. 4. Lets’s Revise – HBase and ZooKeeper Master HFile RegionServers ReRgeigoinoSneSrevrevresrs memstore WAL /hbase/region 1 /hbase/region 2 ….. ….. /hbase/region Zookeeper HDFS Slide 4 www.edureka.in/hadoop
  5. 5. Client HDFS Map Reduce Secondary Name Node Data Blocks …. Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce Hadoop 1.0 – In Summary Data Node Data NodeTask Tracker Map Reduce Task Tracker Map Reduce Slide 5 www.edureka.in/hadoop
  6. 6. Hadoop 1.0 - Challenges Slide 6 www.edureka.in/hadoop Problem Description NameNode – No Horizontal Scalability Single NameNode and Single Namespaces, limited by NameNode RAM NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle of applications MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.
  7. 7. NameNode – Scale and HA NameNode - No High Availability NameNode - No Horizontal Scale Data Node Data Node Data Node …. Client get Block Locations Block Management Read Data NameNode NS Slide 7 www.edureka.in/hadoop
  8. 8.  Secondary NameNode:  “Not a hot standby” for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Secondary NameNode Single Point of FailureNameNode metadata Slide 8 www.edureka.in/hadoop metadata Name Node –Single Point of Failure
  9. 9. Job Tracker – Overburdened CPU  Spends a very significant portion of time and effort managing the life cycle of applications Network  Single Listener Thread to communicate thousands of Map and Reduce Jobs with Task Tracker Task Tracker Task Tracker…. Job Tracker Slide 9 www.edureka.in/hadoop
  10. 10. MRv1 – Unpredictability in Large Clusters As the cluster size grow and reaches to 4000 Nodes  Cascading Failures  The DataNode failures results in a serious deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.  Multi-tenancy  As clusters increase in size, you may want to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important. Slide 10 www.edureka.in/hadoop
  11. 11.  Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing Unutilized Data in HDFS www.edureka.in/hadoop
  12. 12. Hadoop 2.0 New Features Slide 12 www.edureka.in/hadoop Other important Hadoop 2.0 features  HDFS Snapshots  NFSv3 access to data in HDFS  Support for running Hadoop on MS Windows  Binary Compatibility for MapReduce applications built on Hadoop 1.0  Substantial amount of Integration testing with rest of the projects (such as Hadoop ecosystem PIG, HIVE) in Property Hadoop 1.0 Hadoop 2.0 Federation One Namenode and Namespaces Multiple Namenode and Namespaces High Availability Not present Highly Available YARN - Processing Control and Multi- tenancy JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler
  13. 13. Hadoop 2.0 Cluster Architecture - Federation Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 Slide 13 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
  14. 14. - provides cross-data center (non- Hlo eca lll o) s Tup hp eo rr et !fo !r HDFS, allowing Slide 14 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. How does HDFS Federation help HDFS Scale horizontally? - reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the filesystem namespace a cluster administrator to split the Block Storage outside the local cluster. Annie’s Question
  15. 15. A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. Slide 15 www.edureka.in/hadoop Annie’s Answer
  16. 16. /finance respectively. What will happ Hen elif loyo Tu htr ey reto !!put a file in Slide 16 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. You have configured two name nodes to manage /marketing and /accounting directory? Annie’s Question
  17. 17. The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error. Slide 17 www.edureka.in/hadoop Annie’s Answer
  18. 18. Node Manager HDFS YARN Resource Manager Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Standby NameNode Active NameNode Container App Master Node Manager Data Node Container App Master Data Node Container Data Node NDoadtae NMoadneager Client App Master Node Manager Container App Master Hadoop 2.0 Cluster Architecture - HA NameNode High Availability Slide 18 www.edureka.in/hadoop Next Generation MapReduce HDFS HIGH AVAILABILITY http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html *Not necessary to configure Secondary NameNode
  19. 19. NameNode Recovery Vs. Failover High Availability in Hadoop 2.0 NameNode recovery in Hadoop 1.0 Secondary Name Node Standby NameNode Active NameNode Secondary Name Node NameNode Edit logs Meta-Data Automatic failover to Standby NameNode Manually Recover using Secondary NameNode FSImage www.edureka.in/hadoop
  20. 20. HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 20 www.edureka.in/hadoop Annie’s Question
  21. 21. Single Point of Failure of NameNode. Slide 21 www.edureka.in/hadoop Annie’s Answer
  22. 22. Hadoop 2.0 – In Summary Client HDFS Distributed Data Storage Active NameNode YARN Resource Manager Applications Secondary Name Node Standby NameNode Distributed Data Processing Data Node Node Manager Container App Master ……. Masters Slaves Node Manager Data Node Container App Master Data Node Node Manager Container App Master Shared edit logs Scheduler Manager (AsM) NameNode High Availability www.edureka.in/hadoop Next Generation MapReduce
  23. 23. YARN and Hadoop Ecosystem Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (MPI, GIRAPH) framework Slide 23 www.edureka.in/hadoop YARN Cluster Resource Management YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop
  24. 24. BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) YARN – Moving beyond MapReduce Slide 24 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
  25. 25. Multi-tenancy - Capacity Scheduler Slide 25 www.edureka.in/hadoop  Organizes jobs into queues  Queue shares as %’s of cluster  FIFO scheduling within each queue  Data locality-aware Scheduling  Hierarchical Queues To manage the resource within an organization.  Capacity Guarantees A fraction to the total available capacity allocated to each Queue.  Security To safeguard applications from other users.  Elasticity Resources are available in a predictable and elastic manner to queues.  Multi-tenancy Set of limit to prevent over-utilization of resources by a single application.  Operability Runtime configuration of Queues.  Resource-based scheduling If needed, Applications can request more resources than the default.
  26. 26. Executing MapReduce Application on YARN www.edureka.in/hadoop MapReduce Application Execution
  27. 27. YARN MR Application Execution Flow Slide 27 www.edureka.in/hadoop  MapReduce Job Execution  Job Submission  Job Initialization  Tasks Assignment  Tasks’ Memory  Status Updates  Failure Recovery  Client  Submit a MapReduce Job  Resource Manager  Manage the resource utilization across Hadoop Cluster  Node Manager  Runs on each Data Node  Creates execution container  Monitors Container’s usage  MapReduce Application Master  Coordinates and Manages MapReduce Jobs  Negotiates with Resource Manager to schedule asks  The tasks are started by NodeManager(s)  HDFS  shares resources and Job’s artefacts among YARN components
  28. 28. YARN = Yet Another Resource Negotiator Node Manager Container Container Node Manager App Master Container Node Manager Container App Master Resource Manager Client Client MapReduce Status Job Submission Node Status Resource Request  Resource Manager  Cluster Level resource manager  Long Life, High Quality Hardware  Node Manager  One per Data Node  Monitors resources on Data Node  Application Master  One per application  Short life  Manages task /scheduling Hadoop 2.0 - YARN JobHistory Server Slide 28 www.edureka.in/hadoop
  29. 29. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YARNChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application Map or Reduce Task HDFS MR AppMaster DataNode Slide 29 www.edureka.in/hadoop Management Node Task JVM
  30. 30. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YarnChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container Map or Reduce Task HDFS MR AppMaster DataNode Slide 30 www.edureka.in/hadoop 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container Management Node Task JVM
  31. 31. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container DataNode 12. Execute 10. Create Container 11. Acquire Job Resources Slide 31 www.edureka.in/hadoop Management Node Task JVM
  32. 32. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object Client JVM DataNode Management Node Task JVM Poll for Status Slide 32 www.edureka.in/hadoop Update Status
  33. 33. Hadoop 2.0 : YARN Workflow Resource Manager Node Manager Node Manager App Master 2 Node Manager Node Manager Node Manager Container 2.2 Node Manager Container 2.3 Node Manager Node Manager Container 1.1 Container 2.1 Node Manager Container 1.2 Scheduler Applications Manager (AsM) Node Manager Node Manager Node Manager App Master 1 www.edureka.in/hadoop
  34. 34. YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 34 www.edureka.in/hadoop Annie’s Question
  35. 35. Too much burden on Job Tracker. Slide 35 www.edureka.in/hadoop Annie’s Answer
  36. 36. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, the functionality of JobTracker has been replaced by which of the following YARN features: Slide 36 www.edureka.in/hadoop - Job Scheduling - Task Monitoring - Resource Management - Node management Annie’s Question
  37. 37. Task Monitoring and Resource Management. The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring. Slide 37 www.edureka.in/hadoop Annie’s Answer
  38. 38. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?: Slide 38 www.edureka.in/hadoop - Node Manager - Job Tracker - Task tracker - Application Master - Resource manager Annie’s Question
  39. 39. ApplicationMaster. Slide 39 www.edureka.in/hadoop Annie’s Answer
  40. 40. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster? Slide 40 www.edureka.in/hadoop - Yes - No Annie’s Question
  41. 41. Yes. You need to recompile the Jobs in MRv2 after enabling YARN to run the Job successfully in a YARN enabled Hadoop Cluster. Slide 41 www.edureka.in/hadoop Annie’s Answer
  42. 42. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Which of the following YARN/MRv2 daemon is responsible for launching the tasks? Slide 42 www.edureka.in/hadoop - Task Tracker - Resource Manager - ApplicationMaster - ApplicationMasterServer Annie’s Question
  43. 43. ApplicationMaster. Slide 43 www.edureka.in/hadoop Annie’s Answer
  44. 44.  Write the MapReduce programs using MRv2 for your Module-3 assignments.  Assignment Programs – WordCount , Patents, Temperature, and Alphabets  Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.  Execute the ‘teragen’ example to test the Cluster Slide 44 www.edureka.in/hadoop Module-10 Pre-work
  45. 45. Thank You See You in Class Next Week
  46. 46. www.edureka.in/hadoop NN High Availability – Quorum Journal Manager Standby Name Node Active Name Node Data Node Data Node Data Node Data Node …. Data Blocks Journal Node Write namespace modification to edit logs; only Active Name Node is the writer Journal Node Journal Node Read edit logs and applies to its own namespace Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.

×