This presentation about Hadoop will help you understand what is Big Data, what is Hadoop, how Hadoop came into existence, what are the various components of Hadoop and an explanation on Hadoop use case. In the current time, there is a lot of data being generated every day and this massive amount of data cannot be stored, processed and analyzed using the traditional ways. That is why Hadoop can into existence as a solution for Big Data. Hadoop is a framework that manages Big Data storage in a distributed way and processes it parallelly. Now, let us get started and understand the importance of Hadoop and why we actually need it.
Below topics are explained in this Hadoop presentation:
1. The rise of Big Data
2. What is Big Data?
3. Big Data and its challenges
4. Hadoop as a solution
5. What is Hadoop?
6. Components of Hadoop
7. Use case of Hadoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
5. Soon there was a high demand for other fruits. So,
he started harvesting apples and oranges as well
6. He then realizes that it is time consuming and
difficult to harvest all the fruits by himself
7. So, he hires 2 more people to work with him. With
this, harvesting is done simultaneously
8. Now, the storage room becomes a bottleneck to
store and access all the fruits in a single storage
area
9. Jack now decides to distribute the storage area
and give each one of them a separate storage
space
10. Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges
11. To complete the order on time, all of them work
parallelly with their own storage space
Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges
12. This solution helps them to complete the order on
time without any hassles
Fruit
basket
13. All of them are happy and they are prepared
for an increase in demand in the future
14. All of them are happy and they are prepared
for an increase in demand in the future
So, how does this story
relate to Big Data?
15. The rise of Big Data
Structured data
Earlier with limited data, only one processor and one storage unit was needed
16. The rise of Big Data
Structured data
Semi structured data
Unstructured data
Soon, data generation increased leading to high volume of data along with
different data formats
17. The rise of Big Data
Structured data
Semi structured data
Unstructured data
A single processor was not enough to process such high volume of different kinds
of data as it was very time consuming
18. The rise of Big Data
Structured data
Semi structured data
Unstructured data
Hence, multiple processors were used to process high volume of data and this
saved time
19. The rise of Big Data
Structured data
Semi structured data
Unstructured data
The single storage unit became the bottleneck due to which network overhead
was generated
20. The rise of Big Data
Structured data
Semi structured data
Unstructured data
The solution was to use distributed storage for each processor. This enabled easy
access to store and access data
21. The rise of Big Data
Structured data
Semi structured data
Unstructured data
This method worked and there was no network overhead generated
22. The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
23. The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
Parallel processing
24. The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
Parallel processing Distributed storage
26. What’s in it for you?
1. Big Data and it’s challenges1
27. What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
28. What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
29. What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
1. Components of Hadoop4
30. What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
1. Components of Hadoop4
1. Use case of Hadoop5
32. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
33. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
34. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
35. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
36. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
37. What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
39. Big Data challenges and solution
Distributed storagesSingle central storage
Challenges Solutions
Distributed storage
40. Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Challenges Solutions
Distributed storage
41. Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Challenges Solutions
Distributed storage
42. Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Challenges Solutions
Distributed storage
43. Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Ability to process every type
of data
Challenges Solutions
Distributed storage
44. Hadoop as a solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Ability to process every type
of data
Challenges Solutions
Distributed storage
46. What is Hadoop?
Big Data
VOLUME
STORING
Storing Processing Analyzing
Hadoop is a framework that manages big data storage in a distributed way and processes it parallelly
51. What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) is specially designed for storing huge datasets in commodity
hardware
Distributed storage
53. What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode
There is only one
NameNode
54. What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode
There is only one
NameNode
DataNode DataNode
There can be multiple
DataNodes
57. What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
58. What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
59. What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
60. What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
HeartBeat is the signal that DataNode
continuously sends to the NameNode.
This signal shows the status of the DataNode
63. What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
.
.
.
Data is divided into
blocks of 128 MB each
64. What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
Data is divided into
blocks of 128 MB each
.
.
.
.
.
65. What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
Data is divided into
blocks of 128 MB each
Blocks are then
replicated among the
DataNodes
.
.
.
.
.
74. What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Processor
75. What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS
76. What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Output
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS
78. What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node
79. What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node
MapReduce approach – Data is
processed at the Slave nodes
Slave Slave
Slave Slave
Master
81. What is MapReduce?
Input Split
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
The input dataset is first
split into chunks of data
82. What is MapReduce?
Input Split Map phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
These chunks of data are
then processed by map
tasks parallelly
83. What is MapReduce?
Input Split Map phase Reduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1
84. What is MapReduce?
Input Split Map phase Shuffle and sortReduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 2
Car, 2
Ship, 3
Train, 2
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1
At the reduce task, the
aggregation takes place and
the final output is obtained
85. Components of Hadoop version 2.0
Storage unit of
Hadoop
Processing unit of
Hadoop
Resource management
unit of Hadoop
92. What is YARN?
Resource
Manager
Responsible for resource
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request
93. What is YARN?
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request
94. What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
container
container container
Client submits the
job request
95. What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
App Master
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
App Master container
container container
App Master requests
container from the
NodeManager
Client submits the
job request
96. What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
App Master
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
App Master container
container container
App Master requests
container from the
NodeManager
Client submits the
job request
98. Hadoop use case – Combating fraudulent activities
Fraud activities
Detecting fraudulent transactions is one among the various problems any bank faces
99. Zions’ main challenge was to combat the fraudulent activities which were taking place
Challenge
Hadoop use case – Combating fraudulent activities
100. Approaches used by Zions’ security team to combat fraudulent activities
Hadoop use case – Combating fraudulent activities
101. Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
102. Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
Parallel processing system
Problem
Analyzing unstructured data
was not possible
103. Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
Parallel processing system
Problem
Analyzing unstructured data
was not possible
104. How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
105. How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible
106. How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing Analyzing
In-depth analysis of different data
formats became easy and time
efficient
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible
107. How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing Analyzing Detecting
In-depth analysis of different data
formats became easy and time
efficient
The team could now detect
everything from malware, spear
phishing attempts to account
takeovers
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible