2. Why Hadoop?
Suppose we want to process a data. In the traditional approach, we used to store data on local machines. This data was then
processed. Now as data started increasing, the local machines or computers were not capable enough to store this huge data set.
So, data was then started to be stored on remote servers. Now suppose we need to process that data. So, in the traditional
approach, this data has to be fetched from the servers and then processed upon. Suppose this data is of 500 GB. Now, practically it
is very complex and expensive to fetch this data. This approach is also called Enterprise Approach.
In the new Hadoop Approach, instead of fetching the data on local machines we send the query to the data. Obviously, the
query to process the data will not be as huge as the data itself. Moreover, at the server, the query is divided into several
parts. All these parts process the data simultaneously. This is called parallel execution and is possible because of Map
Reduce. So, now not only there is no need to fetch the data, but also the processing takes lesser time. The result of the query
is then sent to the user. Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional
approach.
3. What Is Hadoop?
Facebook
Google
JPMorgan and Chase
Goldmansachs
Yahoo
AWS
Microsoft
IBM
Cloudera
IQVIA
Rackspace Technology
And Many More
Companies Uses Hadoop Based System
Data Scientist
Big Data Visualizer
Big Data Research Analyst
Big Data Engineer
Big Data Analyst
Big Data Architect
And Many More
Jobs Opportunity With Apache Hadoop
4. Hadoop Architecture
Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is 4KB. When we install Hadoop, the HDFS by
default changes the block size to 64 MB. Since it is used to store huge data. We can also change the block size to 128 MB. Now HDFS
works with Data Node and Name Node. While Name Node is a master service and it keeps the metadata as for on which commodity
hardware, the data is residing, the Data Node stores the actual data. Now, since the block size is of 64 MB thus the storage required to
store metadata is reduced thus making HDFS better. Also, Hadoop stores three copies of every dataset at three different locations.
This ensures that the Hadoop is not prone to single point of failure.
Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a query into multiple parts and now each part
process the data coherently. This parallel execution helps to execute a query faster and makes Hadoop a suitable and optimal choice
to deal with Big Data.
YARN: As we know that Yet Another Resource Negotiator works like an operating system to Hadoop and as operating systems are
resource managers so YARN manages the resources of Hadoop so that Hadoop serves big data in a better way.
1.
2.
3.
Hadoop has a master-slave
topology. In this topology, we have
one master node and multiple slave
nodes. Master node’s function is to
assign a task to various slave nodes
and manage resources. The slave
nodes do the actual computing.
Slave nodes store the real data
whereas on master we have
metadata. This means it stores data
about data.
5. Map Reduce
MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for
processing large volumes of data in parallel by dividing the work into a set of independent tasks.
You need to put business logic in the way MapReduce works and rest things will be taken care by
the framework. Work (complete job) which is submitted by the user to master is divided into small
works (tasks) and assigned to slaves.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a
list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much
powerful and efficient due to MapRreduce as here parallel processing is done.
Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples
Two Key Words :- 1. Map , 2.Reduce
1.
2.
6. Hadoop Yarn
Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. In short,
it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
a.
b.
c.
7. Hadoop HDFS
HDFS stands for Hadoop Distributed File System.
It provides for data storage of Hadoop. HDFS
splits the data unit into smaller units called
blocks and stores them in a distributed manner.
It has got two daemons running. One for master
node – NameNode and other for slave nodes –
DataNode.
HDFS has a Master-slave architecture. The daemon called
NameNode runs on the master server. It is responsible for
Namespace management and regulates file access by the client.
DataNode daemon runs on slave nodes. It is responsible for
storing actual business data. Internally, a file gets split into a
number of data blocks and stored on a group of slave machines.
Namenode manages modifications to file system namespace.
These are actions like the opening, closing and renaming files or
directories. NameNode also keeps track of mapping of blocks to
DataNodes. This DataNodes serves read/write request from the file
system’s client. DataNode also creates, deletes and replicates
blocks on demand from NameNode.
HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
Balancing : If a datanode is crashed the blocks present on it will
be gone too and the blocks will be under-replicated compared to
the remaining blocks. Here master node(namenode) will give a
signal to datanodes containing replicas of those lost blocks to
replicate so that overall distribution of blocks is balanced.
Replication:: It is done by datanode.
Terms related to HDFS:
8. Features Of Hadoop
Economically Feasible:
It is cheaper to store
data and process it than
it was in the traditional
approach. Since the
actual machines used to
store data are only
commodity hardware.
Easy to Use: The
projects or set of tools
provided by Apache
Hadoop are easy to
work upon in order to
analyze complex data
sets.
Open Source: Since
Hadoop is distributed as
an open source software
under Apache License,
so one does not need to
pay for it, just download
it and use it.
Scalability: Hadoop is
highly scalable in
nature. If one needs to
scale up or scale down
the cluster, one only
needs to change the
number of commodity
hardware in the cluster.
Fault Tolerance: Since Hadoop stores
three copies of data, so even if one copy is
lost because of any commodity hardware
failure, the data is safe. Moreover, as
Hadoop version 3 has multiple name
nodes, so even the single point of failure
of Hadoop has also been removed.
Locality of Data: This is one of the most
alluring and promising features of
Hadoop. In Hadoop, to process a query
over a data set, instead of bringing the
data to the local computer we send the
query to the server and fetch the final
result from there. This is called data
locality.
Distributed Processing:
HDFS and Map Reduce
ensures distributed
storage and processing
of the data.