2. Slide 2
Hello There!!
My name is Annie.
Let me test your Hadoop 1.x
knowledge?
Annie’s Introduction
3. Slide 3
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can you store 1 billion files in a Hadoop 1.x cluster?
- Yes
- No
Annie’s Question
4. Slide 4
No. Even though you have hundreds of DataNodes in the cluster,
the NameNode keeps all its metadata in memory, so you are limited
to a maximum of only 50-100M files in the entire cluster because of
a Single NameNode in Hadoop 1.x.
Annie’s Answer
5. Slide 5
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
A Hadoop 1.x cluster can have multiple HDFS Namespaces.
- True
- False
Annie’s Question
7. Slide 7
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Which of the following is (are) a significant disadvantage in Hadoop
1.0?
- ‘Single Point Of Failure’ of NameNode
- Too much burden on Job Tracker
Annie’s Question
8. Slide 8
Single Point of Failure of NameNode and too much burden
on Job Tracker.
Annie’s Answer
9. Slide 9
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can you use hundreds of Hadoop DataNode for any other processing
than MapReduce in Hadoop 1.x?
- Yes
- No
Annie’s Question
10. Slide 10
No. Hadoop 1.x dedicates all the DataNode resources to Map and
Reduce slots with no or little room for processing any other
workload.
Annie’s Answer
11. Slide 11
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can you use Hadoop for Real-time processing?
- Yes
- No
Annie’s Question
12. Slide 12
No. Hadoop is designed and developer for massively parallel batch
processing.
Annie’s Answer
13. Limitations of Hadoop 1.x
No horizontal scalability of NameNode
Does not support NameNode High Availability
Overburdened JobTracker
Not possible to run Non-MapReduce Big Data Applications on HDFS
Does not support Multi-tenancy
15. Slide 15 www.edureka.in/hadoop
Problem Description
NameNode – No Horizontal
Scalability
Single NameNode and Single Namespace, limited by
NameNode RAM
NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using
Secondary NameNode in case of failure
Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle of
Applications
MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used
for other workloads such as Graph processing etc.
Hadoop 1.x - Challenges
16. NameNode - No High Availability
NameNode - No Horizontal Scale
Data
Node
Data
Node
Data
Node
….
Client Get Block Locations
Block Management
Read Data
NameNode
NS
Slide 16 www.edureka.in/hadoop
NameNode – Scale and HA
17. Slide 17 www.edureka.in/hadoop
Name Node –Single Point of Failure
Secondary NameNode:
“Not a hot standby” for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
18. Slide 18 www.edureka.in/hadoop
Job Tracker – Overburdened
CPU
Spends a very significant portion of time and effort managing
the life cycle of applications
Network
Single Listener Thread to communicate with thousands of
Map and Reduce Jobs
Task Tracker Task Tracker Task Tracker….
Job
Tracker
19. Slide 19 www.edureka.in/hadoop
MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
Cascading Failures
The DataNode failures results in a serious
deterioration of the overall cluster
performance because of attempts to replicate
data and overload live nodes, through network
flooding.
Multi-tenancy
As clusters increase in size, you may want to
employ these clusters for a variety of models.
MRv1 dedicates its nodes to Hadoop and
cannot be re-purposed for other applications
and workloads in an Organization. With the
growing popularity and adoption of cloud
computing among enterprises, this becomes
more important.
20. Unutilized Data in HDFS
Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing
Slide 11 www.edureka.in/hadoop
21. Introducing Hadoop 2.0
Features Hadoop 1.x Hadoop 2.0
HDFS Federation One NameNode and a Namespace Multiple NameNode and
Namespaces
NameNode High Availability Not present Highly Available
YARN - Processing Control and
Multi-tenancy
JobTracker, TaskTracker Resource Manager, Node
Manager, App Master, Capacity
Scheduler
Other important Hadoop 2.0 features
HDFS Snapshots
NFSv3 access to data in HDFS
Support for running Hadoop on MS Windows
Binary Compatibility for MapReduce applications built on Hadoop 1.0
Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem
Slide 12 www.edureka.in/hadoop
22. Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
Slide 22 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Hadoop 2.0 Cluster Architecture - Federation
23. Slide 23 www.edureka.in/hadoop
cluster.
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
A) Reduces the load on any single NameNode by using the multiple,
independent NameNodes to manage individual parts of the file system
namespace.
B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster
administrator to split the Block Storage outside the local cluster.
24. Slide 24 www.edureka.in/hadoop
Annie’s Answer
(A). In order to scale the name service horizontally, HDFS federation uses
multiple independent NameNodes. The NameNodes are federated, that is, the
NameNodes are independent and do not require coordination with each other.
25. Slide 25
Annie’s Question
You have configured two NameNodes to manage /marketing and /finance
namespaces respectively. What will happen if you try to ‘put’ a file in
/accounting directory?
www.edureka.in/hadoop
26. Slide 26
Annie’s Answer
The ‘put’ will fail. None of the namespaces will manage the file and you will get
an IOException with a “No such file or directory error”.
www.edureka.in/hadoop
27. Slide 27
Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Data Node
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
Data Node
Container
App
Master
Data Node
Client
Data Node
Container
App
Master
Node Manager
Data Node
Container
App
Master
Node Manager
Hadoop 2.0 Cluster Architecture - HA
NameNode High
Availability
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
28. Slide 28
Hadoop 2.0 Cluster Architecture - HA
www.edureka.in/hadoop
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
NameNode
Standby
NameNode
Active
NameNode
Secondary
NameNode
NameNode
Edit logs
Meta-Data
Automatic failover
to Standby
NameNode
Manually Recover
using Secondary
NameNode
FSImage
29. Slide 29
Annie’s Question
NameNode HA was developed to overcome the following disadvantage in
Hadoop 1.0?
a) Single Point Of Failure of NameNode
b) Too much burden on Job Tracker
www.edureka.in/hadoop
31. Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks
(MPI, GIRAPH)
Slide 23 www.edureka.in/hadoop
YARN
Cluster Resource Management
YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework
YARN and Hadoop Ecosystem
33. Slide 33 www.edureka.in/hadoop
Organizes jobs into queues
Queue shares as %’s of cluster
FIFO scheduling within each
queue
Data locality-aware Scheduling
Hierarchical Queues
To manage the resource within an organization.
Capacity Guarantees
A fraction to the total available capacity allocated to each Queue.
Security
To safeguard applications from other users.
Elasticity
Resources are available in a predictable and elastic manner to queues.
Multi-tenancy
Set of limit to prevent over-utilization of resources by a single
application.
Operability
Runtime configuration of Queues.
Resource-based scheduling
If needed, Applications can request more resources than the default.
Multi-tenancy - Capacity Scheduler
34. Slide 34
Annie’s Question
YARN was developed to overcome the following disadvantage in Hadoop 1.0
MapReduce framework?
a) Single Point Of Failure Of NameNode
b) Too much burden on Job Tracker
www.edureka.in/hadoop
36. Slide 36
NameNode High
Availability
Next Generation
MapReduce
Hadoop 2.0 – In Summary
Client
HDFS YARN
Resource ManagerStandby
NameNode
Active
NameNode
Distributed Data Storage Distributed Data Processing
DataNode
Node Manager
Container
App
Master
…….
Masters
Slaves
Node Manager
DataNode
Container
App
Master
DataNode
Node Manager
Container
App
Master
Shared
edit logs
OR
Journal
Node
Scheduler
Applications
Manager
(AsM)
www.edureka.in/hadoop
37. Slide 37
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can you use Hadoop 2.0 for Real-time processing?
- Yes
- No
Annie’s Question
38. Slide 38
No. Even though YARN in Hadoop 2.0 supports multiple frameworks
for different workloads other than batch, you need Storm or S4 for
real-time processing.
Annie’s Answer
40. Slide 40 www.edureka.in/hadoop
Storm is coming….
APACHE STORM
The Real-time Hadoop
• Continuous commutation system
Distributed, Reliable, Fault-tolerant,
Scalable and Robust
• Suitable for Big Data processing
• Guarantees no data loss
Programming Language agnostic
• JSON-based for Ruby, Python etc.
Use case
• Stream processing
• Distributed RPC
• Continuous Computation
41. Hadoop Vs. Storm
Hadoop Storm
Differences
Fundamentally as Batch
processing system
Real-time processing,
process unterminated
streams (e.g. twitter
feeds) of data, process
data as it arrives
MapReduce Jobs run to
completion
Topologies (Computation
Graph) run forever
Stateful Nodes
Stateless Nodes
Hadoop Storm
Similarities
Scalable Scalable
Guarantees no data loss Guarantees no data loss
Open Source Open Source
42. Storm Use Cases
Data Normalization
• Groupon uses Storm to build real-time data integration
systems.
Analytics
• Storm powers Twitter’s publisher analytics product,
processing every tweet and click that happens on Twitter to
provide analytics for Twitter's publisher partners.
• Flipboard use Storm across a wide range of services ranging
from Content Search to real-time analytics, to generating
custom magazine fields.
Log processing
• Alibaba uses Storm to process the application log and data
change in databases to supply real-time data stats for data
apps.
• NaviSite uses Storm in its server log monitoring and auditing
system.