2. Certified Big Data & Hadoop Training – DataFlair
Agenda
Introduction to Hadoop
Hadoop nodes & daemons
Hadoop Architecture
Characteristics
Hadoop Features
3. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
The Technology that empowers Yahoo, Facebook, Twitter, Walmart and others
Hadoop
4. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that
allows distributed processing of
large data-sets across the cluster
of commodity hardware
5. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that
allows distributed processing of
large data-sets across the cluster
of commodity hardware
Open Source
Source code is freely available
It may be redistributed and
modified
6. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows Distributed Processing of
large data-sets across the cluster
of commodity hardware
Distributed Processing
Data is processed distributedly
on multiple nodes / servers
Multiple machines processes
the data independently
7. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows distributed processing of
large data-sets across the Cluster
of commodity hardware
Cluster
Multiple machines connected
together
Nodes are connected via LAN
8. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows distributed processing of
large data-sets across the cluster
of Commodity Hardware
Commodity Hardware
Economic / affordable
machines
Typically low performance
hardware
9. Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
• Open source framework written in Java
• Inspired by Google's Map-Reduce programming model as well as its file
system (GFS)
10. Certified Big Data & Hadoop Training – DataFlair
Hadoop defeated
Super computer
Hadoop became
top-level project
launched Hive,
SQL Support for Hadoop
Development of
started as Lucene sub-project
published GFS &
MapReduce papers
2002 2003 2005 2006 2008
Doug Cutting started
working on
Doug Cutting added
DFS & MapReduce
in
converted 4TB of
image archives over
100 EC2 instances
Doug Cutting
joined Cloudera
2009
2004
Hadoop History
2007
11. Certified Big Data & Hadoop Training – DataFlair
Hadoop Components
Hadoop consists of three key parts
12. Certified Big Data & Hadoop Training – DataFlair
Master Node Slave Node
Hadoop Nodes
Nodes
13. Certified Big Data & Hadoop Training – DataFlair
Master Node Slave Node
Hadoop Daemons
Resource
Manager
NameNode
Node
Manager
DataNode
Nodes
14. Certified Big Data & Hadoop Training – DataFlair
Sub Work Sub Work Sub Work Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Work
Sub Work Sub Work Sub Work Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Basic Hadoop Architecture
16. Certified Big Data & Hadoop Training – DataFlair
Open Source
• Source code is freely
available
• Can be redistributed
• Can be modified
Free
Affordable
Community
Transparent
Inter-
operable
No vendor
lock
Open
Source
17. Certified Big Data & Hadoop Training – DataFlair
Distributed Processing
• Data is processed distributedly
on cluster
• Multiple nodes in the cluster
process data independently
Centralized Processing
Distributed Processing
18. Certified Big Data & Hadoop Training – DataFlair
Fault Tolerance
• Failure of nodes are recovered
automatically
• Framework takes care of failure
of hardware as well tasks
19. Certified Big Data & Hadoop Training – DataFlair
Reliability
• Data is reliably stored on the
cluster of machines despite
machine failures
• Failure of nodes doesn’t
cause data loss
20. Certified Big Data & Hadoop Training – DataFlair
High Availability
• Data is highly available and
accessible despite hardware
failure
• There will be no downtime for
end user application due to
data
21. Certified Big Data & Hadoop Training – DataFlair
Scalability
• Vertical Scalability – New
hardware can be added to the
nodes
• Horizontal Scalability – New
nodes can be added on the fly
22. Certified Big Data & Hadoop Training – DataFlair
Economic
• No need to purchase costly license
• No need to purchase costly hardware
Economic
Open Source
Commodity
Hardware =
+
23. Certified Big Data & Hadoop Training – DataFlair
Easy to Use
• Distributed computing challenges
are handled by framework
• Client just need to concentrate on
business logic
24. Certified Big Data & Hadoop Training – DataFlair
Data Locality
• Move computation to data
instead of data to computation
• Data is processed on the nodes
where it is stored Storage Servers App Servers
Data Data
Data
Data
Servers
Data Data
Data
Data
Algorithm
Algo Algo
Algo
Algo
25. Certified Big Data & Hadoop Training – DataFlair
Summary
• Everyday we generate 2.3 trillion GBs of data
• Hadoop handles huge volumes of data efficiently
• Hadoop uses the power of distributed computing
• HDFS & Yarn are two main components of Hadoop
• It is highly fault tolerant, reliable & available
26. Certified Big Data & Hadoop Training – DataFlair
Thank You
DataFlair
/c/DataFlairWS /DataFlairWS