SlideShare a Scribd company logo
1 of 19
Download to read offline
1
Hadoop Ecosystem
Overview: Apache Hadoop is an open-source framework intended to make interaction with big data
easier, However, for those who are not acquainted with this technology, one question arises that what
is big data? Big data is a term given to the data sets which can’t be processed in an efficient manner
with the help of traditional methodology such as RDBMS. Hadoop has made its place in the industries
and companies that need to work on large data sets which are sensitive and needs efficient handling.
Hadoop is a framework that enables processing of large data sets which reside in the form of clusters.
Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of
technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There are
four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most of the
tools or solutions are used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are part
of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it
revolves around data and hence making its synthesis easier.
2
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data
sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the
form of log files.
HDFS consists of two core components i.e.
• Name node
• Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in
the distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of
the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and
later on acknowledges the resource manager. Application manager works as an interface between the
resource manager and node manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications which transform big data sets into a manageable
one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map
generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple,
Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of
tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
3
Pig does the work of executing commands and in the background, all the activities of MapReduce are
taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way
Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop
Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL
datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers
and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection
whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name
suggests helps the system to develop itself based on some patterns, user/environmental interaction or
on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per
our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database,
the request must be processed within a short quick span of time. At such times, HBase comes handy as
it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a huge
task in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing with the help
of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
4
Zookeeper: There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all
the problems by performing synchronization, inter-component-based communication, grouping, and
maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together
as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie
workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
HDFS
HDFS is capable of handling larger size data with high volume velocity and variety makes Hadoop
work more efficient and reliable with easy access to all its components. HDFS stores the data in the
form of the block where the size of each data block is 128MB in size which is configurable means you
can change it according to your requirement in hdfs-site.xml file in your Hadoop directory.
Some Important Features of HDFS (Hadoop Distributed File System)
• It’s easy to access the files stored in HDFS.
• HDFS also provides high availability and fault tolerance.
• Provides scalability to scaleup or scaledown nodes as per our requirement.
• Data is stored in distributed manner i.e., various Datanodes are responsible for storing the data.
• HDFS provides Replication because of which no fear of Data Loss.
• HDFS Provides High Reliability as it can store data in a large range of Petabytes.
• HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the
cluster information.
• Provides high throughput.
HDFS Architecture
As we all know Hadoop works on the MapReduce algorithm which is a master-slave architecture, HDFS
has NameNode and DataNode that works in the similar pattern.
1. NameNode (Master)
2. DataNode (Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the Datanode (Slaves).
Namenode is mainly used for storing the Metadata i.e. nothing but the data about the data. Meta Data
can be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location (Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or Processing power in order to
Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat signals and block
reports from all the slaves i.e., DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that, the more number
of DataNode your Hadoop cluster has More Data can be stored. so it is advised that the DataNode
should have High storing capacity to store a large number of file blocks. Datanode performs operations
like creation, deletion, etc. according to the instruction provided by the NameNode.
5
Objectives and Assumptions Of HDFS
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity hardware so
node failure is possible, so the fundamental goal of HDFS figure out this failure problem and recover
it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so HDFS has
to be cool enough to deal with these very large data sets on a single cluster.
3. Moving Data is Costlier then Moving the Computation: If the computational operation is
performed near the location where the data is present then it is quite faster and the overall throughput
of the system can be increased along with minimizing the network congestion which is a good
assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it to switch across
diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write once read
much access for Files. A file written then closed should not be changed, only data can be appended.
This assumption helps us to minimize the data coherency issue. MapReduce fits perfectly with such
kind of file model.
Features of HDFS
Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
Replication - Due to some unfavourable conditions, the node containing the data may be loss. So, to
overcome such problems, HDFS always maintains the copy of data on a different machine.
Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event of
failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the
copy of that data automatically become active.
Distributed data storage - This is one of the most important features of HDFS that makes Hadoop
very powerful. Here, data is divided into multiple blocks and stored into nodes.
Portable - HDFS is designed in such a way that it can easily portable from platform to another.
6
MapReduce
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This
simple scalability is what has attracted many programmers to use the MapReduce model.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally, the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes. Most of the computing takes place
on nodes with data on local disks that reduces the network traffic. After completion of the given tasks,
the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
7
Map and Reduce Tasks
A single processing run of the MapReduce processing engine is known as a MapReduce job. Each
MapReduce job is composed of a map task and a reduce task and each task consists of multiple stages.
Figure shows the map and reduce task, along with their individual stages.
An illustration of a MapReduce job with the map stage highlighted.
MapReduce tasks
o map
o combine (optional)
o partition
o Reduce tasks
o shuffle and sort
o reduce
Map
The first stage of MapReduce is known as map, during which the dataset file is divided into multiple
smaller splits. Each split is parsed into its constituent records as a key-value pair. The key is usually the
ordinal position of the record, and the value is the actual record.
The parsed key-value pairs for each split are then sent to a map function or mapper, with one mapper
function per split. The map function executes user-defined logic. Each split generally contains multiple
key-value pairs, and the mapper is run once for each key-value pair in the split.
The mapper processes each key-value pair as per the user-defined logic and further generates a key-
value pair as its output. The output key can either be the same as the input key or a substring value from
the input value, or another serializable user-defined object. Similarly, the output value can either be the
same as the input value or a substring value from the input value, or another serializable user-defined
object.
When all records of the split have been processed, the output is a list of key-value pairs where multiple
key-value pairs can exist for the same key. It should be noted that for an input key-value pair, a mapper
may not produce any output key-value pair (filtering) or can generate multiple key-value pairs
(demultiplexing.) The map stage can be summarized by the equation shown in Figure 6.9.
8
Combine
Generally, the output of the map function is handled directly by the reduce function. However, map
tasks and reduce tasks are mostly run over different nodes. This requires moving data between mappers
and reducers. This data movement can consume a lot of valuable bandwidth and directly contributes to
processing latency.
With larger datasets, the time taken to move the data between map and reduce stages can exceed the
actual processing undertaken by the map and reduce tasks. For this reason, the MapReduce engine
provides an optional combine function (combiner) that summarizes a mapper’s output before it gets
processed by the reducer. Figure illustrates the consolidation of the output from the map stage by the
combine stage.
The combine stage groups the output from the map stage.
A combiner is essentially a reducer function that locally groups a mapper’s output on the same node as
the mapper. A reducer function can be used as a combiner function, or a custom user-defined function
can be used.
The MapReduce engine combines all values for a given key from the mapper output, creating multiple
key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of
all corresponding values for that key. The combiner stage is only an optimization stage, and may
therefore not even be called by the MapReduce engine.
For example, a combiner function will work for finding the largest or the smallest number, but will not
work for finding the average of all numbers since it only works with a subset of the data. The combine
stage can be summarized by the equation shown in Figure 6.11.
Partition
During the partition stage, if more than one reducer is involved, a partitioner divides the output from
the mapper or combiner (if specified and called by the MapReduce engine) into partitions between
reducer instances. The number of partitions will equal the number of reducers. Figure 6shows the
partition stage assigning the outputs from the combine stage to specific reducers.
9
The partition stage assigns output from the map task to reducers.
Although each partition contains multiple key-value pairs, all records for a particular key are assigned
to the same partition. The MapReduce engine guarantees a random and fair distribution between
reducers while making sure that all of the same keys across multiple mappers end up with the same
reducer instance.
Depending on the nature of the job, certain reducers can sometimes receive a large number of key-value
pairs compared to others. As a result of this uneven workload, some reducers will finish earlier than
others. Overall, this is less efficient and leads to longer job execution times than if the work was evenly
split across reducers. This can be rectified by customizing the partitioning logic in order to guarantee a
fair distribution of key-value pairs.
The partition function is the last stage of the map task. It returns the index of the reducer to which a
particular partition should be sent. The partition stage can be summarized by the equation in Figure
A summary of the partition stage.
Shuffle and Sort
During the first stage of the reduce task, output from all partitioners is copied across the network to the
nodes running the reduce task. This is known as shuffling. The list-based key-value output from each
partitioner can contain the same key multiple times.
Next, the MapReduce engine automatically groups and sorts the key-value pairs according to the keys
so that the output contains a sorted list of all input keys and their values with the same keys appearing
together. The way in which keys are grouped and sorted can be customized.
This merge creates a single key-value pair per group, where key is the group key and the value is the
list of all group values. This stage can be summarized by the equation in Figure 6.14.
10
A summary of the shuffle and sort stage.
Figure depicts a hypothetical MapReduce job that is executing the shuffle and sort stage of the reduce
task.
During the shuffle and sort stage, data is copied across the network to the reducer nodes and sorted by
key.
Reduce
Reduce is the final stage of the reduce task. Depending on the user-defined logic specified in the reduce
function (reducer), the reducer will either further summarize its input or will emit the output without
making any changes. In either case, for each key-value pair that a reducer receives, the list of values
stored in the value part of the pair is processed and another key-value pair is written out.
The output key can either be the same as the input key or a substring value from the input value, or
another serializable user-defined object. The output value can either be the same as the input value or a
substring value from the input value, or another serializable user-defined object.
Note that just like the mapper, for the input key-value pair, a reducer may not produce any output key-
value pair (filtering) or can generate multiple key-value pairs (demultiplexing). The output of the
reducer, that is the key-value pairs, is then written out as a separate file—one file per reducer. This is
depicted in Figure 6.16, which highlights the reduce stage of the reduce task. To view the full output
from the MapReduce job, all the file parts must be combined.
11
The reduce stage is the last stage of the reduce task.
The number of reducers can be customized. It is also possible to have a MapReduce job without a
reducer, for example when performing filtering.
Note that the output signature (key-value types) of the map function should match that of the input
signature (key-value types) of the reduce/combine function. The reduce stage can be summarized by
the equation in Figure 6.17.
Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be
finding the unique words and the number of occurrences of those unique words.
12
YARN
Apache Hadoop YARN, or as it is called Yet Another Resource Negotiator. It is an upgrade to
MapReduce present in Hadoop version 1.0 because it is a mighty and efficient resource manager that
helps support applications such as HBase, Spark, and Hive. The main idea behind YARN is a layer that
is used to split up the resource management layer and processing component layer.
YARN can work parallelly with various applications, bringing greater efficiency while processing the
data. YARN is responsible for providing resources such as storage space and also functions as a resource
manager.
Before the framework received its official name, it was known as MapReduce2, which is used by YARN
to manage and allocate resources for various processes and jobs efficiently.
Components Of YARN
Resource Manager
The Resource Manager has the highest authority as it manages the allocation of resources. It runs many
services, including the resource scheduler, which decides how to assign the resources.
Resource Manager contains the metadata regarding the location and number of resources available to
the data nodes, collectively known as Rack awareness. The Resource Manager is present on each cluster
and can accept processes from the users and allocate resources to them.
The resource manager can be broken down into two sub-parts:
Application Manager
The application manager is responsible for validating the jobs submitted to the resource manager. It
verifies if the system has enough resources to run the job and rejects them if it is out of resources. It
13
also ensures that there is no other application with the same ID that has already been submitted that
could cause an error. After performing these checks, it finally forwards the application to the scheduler.
Schedulers
The scheduler as the name suggests is responsible for scheduling the tasks. The scheduler neither
monitors nor tracks the status of the application nor does restart if there is any failure occurred. There
are three types of Schedulers available in YARN: FIFO [First In, First Out] schedulers, capacity
schedulers, and Fair schedulers. Out of these clusters to run large jobs executed promptly, it’s better to
use Capacity and Fair Schedulers.
Node Manager
The Node Manager works as a slave installed at each node and functions as a monitoring and reporting
agent for the Resource Manager. It also transmits the health status of each node to the Resource Manager
and offers resources to the cluster.
It is responsible for monitoring resource usage by individual containers and reporting it to the Resource
manager. The Node Manager can also kill or destroy the container if it receives a command from the
Resource Manager to do so.
It also monitors the usage of resources, performs log Management, and helps in creating container
processes and executing them on the request of the application master.
Now we shall discuss the components of Node Manager:
Containers
Containers are a fraction of Node Manager capacity, whose responsibility is to provide physical
resources like a disk space on a single node. All of the actual processing takes place inside the container.
An application can use a specific amount of memory from the CPU only after permission has been
granted by the container.
Application Master
Application master is the framework-specific process that negotiates resources for a single application.
It works along with the Node Manager and monitors the execution of the task. Application Master also
sends heartbeats to the resource manager which provides a report after the application has started.
Application Master requests the container in the node manager by launching CLC (Container Launch
Context) which takes care of all the resources required an application needs to execute.
HBase
HBase is a top-level Apache project written in java which fulfills the need to read and write data in real-
time. It provides a simple interface to the distributed data. It can be accessed by Apache Hive, Apache
Pig, MapReduce, and store information in HDFS.
HDFS-vs-HBase
HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.
14
HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned
to region server as well as DDL (create, delete table) operations. It monitors all Region Server instances
present in the cluster. In a distributed environment, Master runs several background threads. HMaster
has many features like controlling load balancing, failover etc.
Region Server –
HBase Tables are divided horizontally by row key range into regions. Regions are the basic building
elements of HBase cluster that consists of the distribution of tables and are comprised of Column
families. Region Server runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region
Server are responsible for several things, like handling, managing, executing as well as reads and writes
HBase operations on that set of regions. The default size of a region is 256 MB.
Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure notification etc. Clients communicate with
region servers via zookeeper.
Advantages of HBase –
• Can store large data sets
• Database can be shared
• Cost-effective from gigabytes to petabytes
• High availability through failover and replication
Disadvantages of HBase –
• No support SQL structure
• No transaction support
• Sorted only on key
• Memory issues on the cluster
Comparison between HBase and HDFS:
• HBase provides low latency access while HDFS provides high latency operations.
• HBase supports random read and writes while HDFS supports Write once Read Many times.
15
• HBase is accessed through shell commands, Java API, REST, Avro or Thrift API while HDFS
is accessed through MapReduce jobs.
HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in distributed
storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to
MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
• Hive is not capable of handling real-time data.
• It is not designed for online transaction processing.
• Hive queries contain high latency.
Sqoop
Sqoop is a command-line interface application for transferring data between relational databases and
Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs
which can be run multiple times to import updates made to a database since the last import. Using
Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL Server/DB2
and vice versa.
16
Sqoop Working
Step 1: Sqoop send the request to Relational DB to send the return the metadata information about the
table (Metadata here is the data about the table in relational DB).
Step 2: From the received information it will generate the java classes (Reason why you should have
Java configured before get it working-Sqoop internally uses JDBC API to generate data).
Step 3: Now Sqoop (As its written in java? tries to package the compiled classes to be able to generate
table structure) , post compiling creates jar file(Java packaging standard).
Apache Zookeeper
Apache Zookeeper is a distributed, open-source coordination service for distributed systems. It provides
a central place for distributed applications to store data, communicate with one another, and coordinate
activities. Zookeeper is used in distributed systems to coordinate distributed processes and services. It
provides a simple, tree-structured data model, a simple API, and a distributed protocol to ensure data
consistency and availability. Zookeeper is designed to be highly reliable and fault-tolerant, and it can
handle high levels of read and write throughput.
Zookeeper is implemented in Java and is widely used in distributed systems, particularly in the Hadoop
ecosystem. It is an Apache Software Foundation project and is released under the Apache License 2.0.
The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-like
structure. Each znode can store data and has a set of permissions that control access to the znode. The
znodes are organized in a hierarchical namespace, similar to a file system. At the root of the hierarchy
17
is the root znode, and all other znodes are children of the root znode. The hierarchy is similar to a file
system hierarchy, where each znode can have children and grandchildren, and so on.
Important Components in Zookeeper
ZooKeeper Services
Leader & Follower
Request Processor – Active in Leader Node and is responsible for processing write requests. After
processing, it sends changes to the follower nodes
Atomic Broadcast – Present in both Leader Node and Follower Nodes. It is responsible for sending the
changes to other Nodes.
In-memory Databases (Replicated Databases)-It is responsible for storing the data in the zookeeper.
Every node contains its own databases. Data is also written to the file system providing recoverability
in case of any problems with the cluster.
Other Components
Client – One of the nodes in our distributed application cluster. Access information from the server.
Every client sends a message to the server to let the server know that client is alive.
Server– Provides all the services to the client. Gives acknowledgment to the client.
Ensemble– Group of Zookeeper servers. The minimum number of nodes that are required to form an
ensemble is 3.
How ZooKeeper in Hadoop Works?
ZooKeeper operates as a distributed file system and exposes a simple set of APIs that enable clients to
read and write data to the file system. It stores its data in a tree-like structure called a znode, which can
be thought of as a file or a directory in a traditional file system. ZooKeeper uses a consensus algorithm
to ensure that all of its servers have a consistent view of the data stored in the Znodes. This means that
if a client writes data to a znode, that data will be replicated to all of the other servers in the ZooKeeper
ensemble.
One important feature of ZooKeeper is its ability to support the notion of a “watch.” A watch allows a
client to register for notifications when the data stored in a znode changes. This can be useful for
monitoring changes to the data stored in ZooKeeper and reacting to those changes in a distributed
system.
18
In Hadoop, ZooKeeper is used for a variety of purposes, including:
Storing configuration information: ZooKeeper is used to store configuration information that is shared
by multiple Hadoop components. For example, it might be used to store the locations of NameNodes in
a Hadoop cluster or the addresses of JobTracker nodes.
Providing distributed synchronization: ZooKeeper is used to coordinate the activities of various Hadoop
components and ensure that they are working together in a consistent manner. For example, it might be
used to ensure that only one NameNode is active at a time in a Hadoop cluster.
Maintaining naming: ZooKeeper is used to maintain a centralized naming service for Hadoop
components. This can be useful for identifying and locating resources in a distributed system.
Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online analytic application.
It is mainly used for real-time data ingestion from different web applications into storage like HDFS,
HBase, etc. Apache Flume reliably collects, aggregates, and transports big data generated from external
sources to the central store.
Apache Oozie
Apache Oozie is a Workflow engine. It is used to run workflow jobs such as Hadoop Map/Reduce and
Pig. It is Java Web-Application that runs in a Java servlet container. By using Oozie multiple jobs can
be bounded sequentially into one logical unit of work. The major advantage of the Oozie framework is
that it is fully integrated with the Apache Hadoop stack and supports Hadoop jobs for Apache
MapReduce, Pig, Hive, and Sqoop.
The following figure shows the architecture of Apache Oozie.
19
Oozie Client
An Apache Oozie client is a command-line utility that interacts with the Oozie server using the Oozie
command-line tool, the Oozie Java client API, or the Oozie HTTP REST API. The Oozie command-
line tool and the Oozie Java API eventually use the Oozie HTTP REST API to communicate with the
Oozie server.
Oozie Server
Apache Oozie server is a Java web application that runs in a Java servlet container. Oozie uses Apache
Tomcat by default, which is an open-source Java servlet technology. The Oozie server does not store
any user or job information in memory. Oozie main all this information such as running or completed
in the SQL database. When a user request to process a job, the Oozie server fetches the conforming job
state from the SQL database and performs the requested operation, and updates the SQL database with
the new state of the job.
Oozie Database
Apache Oozie database stores all of the stateful information such as workflow definitions, running and
completed jobs. Oozie fetches the corresponding job-state from the SQL database while processing a
user request and performs the requested operation, and updates the SQL database with the new state of
the job. Oozie provides support for databases such as Derby, MySQL, Oracle, and PostgreSQL.

More Related Content

Similar to BIGDATA MODULE 3.pdf

Similar to BIGDATA MODULE 3.pdf (20)

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
paper
paperpaper
paper
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data
Big dataBig data
Big data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 

Recently uploaded

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 

Recently uploaded (20)

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 

BIGDATA MODULE 3.pdf

  • 1. 1 Hadoop Ecosystem Overview: Apache Hadoop is an open-source framework intended to make interaction with big data easier, However, for those who are not acquainted with this technology, one question arises that what is big data? Big data is a term given to the data sets which can’t be processed in an efficient manner with the help of traditional methodology such as RDBMS. Hadoop has made its place in the industries and companies that need to work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies. Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. Following are the components that collectively form a Hadoop ecosystem: • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling Note: Apart from the above-mentioned components, there are many other components too that are part of the Hadoop ecosystem. All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it revolves around data and hence making its synthesis easier.
  • 2. 2 HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. HDFS consists of two core components i.e. • Name node • Data Node Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost effective. HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system. YARN: Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e. • Resource Manager • Nodes Manager • Application Manager Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two. MapReduce: By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is: Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples. PIG: Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. It is a platform for structuring the data flow, processing and analyzing huge data sets.
  • 3. 3 Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS. Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM. Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem. HIVE: With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line. JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries. Mahout: Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries. Apache Spark: It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. It consumes in memory resources hence, thus being faster than the prior in terms of optimization. Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably. Apache HBase: It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively. At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data Other Components: Apart from all of these, there are some other components too that carry out a huge task in order to make Hadoop capable of processing large datasets. They are as follows: Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.
  • 4. 4 Zookeeper: There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter-component-based communication, grouping, and maintenance. Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it. HDFS HDFS is capable of handling larger size data with high volume velocity and variety makes Hadoop work more efficient and reliable with easy access to all its components. HDFS stores the data in the form of the block where the size of each data block is 128MB in size which is configurable means you can change it according to your requirement in hdfs-site.xml file in your Hadoop directory. Some Important Features of HDFS (Hadoop Distributed File System) • It’s easy to access the files stored in HDFS. • HDFS also provides high availability and fault tolerance. • Provides scalability to scaleup or scaledown nodes as per our requirement. • Data is stored in distributed manner i.e., various Datanodes are responsible for storing the data. • HDFS provides Replication because of which no fear of Data Loss. • HDFS Provides High Reliability as it can store data in a large range of Petabytes. • HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the cluster information. • Provides high throughput. HDFS Architecture As we all know Hadoop works on the MapReduce algorithm which is a master-slave architecture, HDFS has NameNode and DataNode that works in the similar pattern. 1. NameNode (Master) 2. DataNode (Slave) 1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the Datanode (Slaves). Namenode is mainly used for storing the Metadata i.e. nothing but the data about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. Meta Data can also be the name of the file, size, and the information about the location (Block number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication. Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc. As our NameNode is working as a Master it should have a high RAM or Processing power in order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat signals and block reports from all the slaves i.e., DataNodes. 2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that, the more number of DataNode your Hadoop cluster has More Data can be stored. so it is advised that the DataNode should have High storing capacity to store a large number of file blocks. Datanode performs operations like creation, deletion, etc. according to the instruction provided by the NameNode.
  • 5. 5 Objectives and Assumptions Of HDFS 1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity hardware so node failure is possible, so the fundamental goal of HDFS figure out this failure problem and recover it. 2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so HDFS has to be cool enough to deal with these very large data sets on a single cluster. 3. Moving Data is Costlier then Moving the Computation: If the computational operation is performed near the location where the data is present then it is quite faster and the overall throughput of the system can be increased along with minimizing the network congestion which is a good assumption. 4. Portable Across Various Platform: HDFS Posses portability which allows it to switch across diverse Hardware and software platforms. 5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write once read much access for Files. A file written then closed should not be changed, only data can be appended. This assumption helps us to minimize the data coherency issue. MapReduce fits perfectly with such kind of file model. Features of HDFS Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster. Replication - Due to some unfavourable conditions, the node containing the data may be loss. So, to overcome such problems, HDFS always maintains the copy of data on a different machine. Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active. Distributed data storage - This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes. Portable - HDFS is designed in such a way that it can easily portable from platform to another.
  • 6. 6 MapReduce MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper’s job is to process the input data. Generally, the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 7. 7 Map and Reduce Tasks A single processing run of the MapReduce processing engine is known as a MapReduce job. Each MapReduce job is composed of a map task and a reduce task and each task consists of multiple stages. Figure shows the map and reduce task, along with their individual stages. An illustration of a MapReduce job with the map stage highlighted. MapReduce tasks o map o combine (optional) o partition o Reduce tasks o shuffle and sort o reduce Map The first stage of MapReduce is known as map, during which the dataset file is divided into multiple smaller splits. Each split is parsed into its constituent records as a key-value pair. The key is usually the ordinal position of the record, and the value is the actual record. The parsed key-value pairs for each split are then sent to a map function or mapper, with one mapper function per split. The map function executes user-defined logic. Each split generally contains multiple key-value pairs, and the mapper is run once for each key-value pair in the split. The mapper processes each key-value pair as per the user-defined logic and further generates a key- value pair as its output. The output key can either be the same as the input key or a substring value from the input value, or another serializable user-defined object. Similarly, the output value can either be the same as the input value or a substring value from the input value, or another serializable user-defined object. When all records of the split have been processed, the output is a list of key-value pairs where multiple key-value pairs can exist for the same key. It should be noted that for an input key-value pair, a mapper may not produce any output key-value pair (filtering) or can generate multiple key-value pairs (demultiplexing.) The map stage can be summarized by the equation shown in Figure 6.9.
  • 8. 8 Combine Generally, the output of the map function is handled directly by the reduce function. However, map tasks and reduce tasks are mostly run over different nodes. This requires moving data between mappers and reducers. This data movement can consume a lot of valuable bandwidth and directly contributes to processing latency. With larger datasets, the time taken to move the data between map and reduce stages can exceed the actual processing undertaken by the map and reduce tasks. For this reason, the MapReduce engine provides an optional combine function (combiner) that summarizes a mapper’s output before it gets processed by the reducer. Figure illustrates the consolidation of the output from the map stage by the combine stage. The combine stage groups the output from the map stage. A combiner is essentially a reducer function that locally groups a mapper’s output on the same node as the mapper. A reducer function can be used as a combiner function, or a custom user-defined function can be used. The MapReduce engine combines all values for a given key from the mapper output, creating multiple key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of all corresponding values for that key. The combiner stage is only an optimization stage, and may therefore not even be called by the MapReduce engine. For example, a combiner function will work for finding the largest or the smallest number, but will not work for finding the average of all numbers since it only works with a subset of the data. The combine stage can be summarized by the equation shown in Figure 6.11. Partition During the partition stage, if more than one reducer is involved, a partitioner divides the output from the mapper or combiner (if specified and called by the MapReduce engine) into partitions between reducer instances. The number of partitions will equal the number of reducers. Figure 6shows the partition stage assigning the outputs from the combine stage to specific reducers.
  • 9. 9 The partition stage assigns output from the map task to reducers. Although each partition contains multiple key-value pairs, all records for a particular key are assigned to the same partition. The MapReduce engine guarantees a random and fair distribution between reducers while making sure that all of the same keys across multiple mappers end up with the same reducer instance. Depending on the nature of the job, certain reducers can sometimes receive a large number of key-value pairs compared to others. As a result of this uneven workload, some reducers will finish earlier than others. Overall, this is less efficient and leads to longer job execution times than if the work was evenly split across reducers. This can be rectified by customizing the partitioning logic in order to guarantee a fair distribution of key-value pairs. The partition function is the last stage of the map task. It returns the index of the reducer to which a particular partition should be sent. The partition stage can be summarized by the equation in Figure A summary of the partition stage. Shuffle and Sort During the first stage of the reduce task, output from all partitioners is copied across the network to the nodes running the reduce task. This is known as shuffling. The list-based key-value output from each partitioner can contain the same key multiple times. Next, the MapReduce engine automatically groups and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys and their values with the same keys appearing together. The way in which keys are grouped and sorted can be customized. This merge creates a single key-value pair per group, where key is the group key and the value is the list of all group values. This stage can be summarized by the equation in Figure 6.14.
  • 10. 10 A summary of the shuffle and sort stage. Figure depicts a hypothetical MapReduce job that is executing the shuffle and sort stage of the reduce task. During the shuffle and sort stage, data is copied across the network to the reducer nodes and sorted by key. Reduce Reduce is the final stage of the reduce task. Depending on the user-defined logic specified in the reduce function (reducer), the reducer will either further summarize its input or will emit the output without making any changes. In either case, for each key-value pair that a reducer receives, the list of values stored in the value part of the pair is processed and another key-value pair is written out. The output key can either be the same as the input key or a substring value from the input value, or another serializable user-defined object. The output value can either be the same as the input value or a substring value from the input value, or another serializable user-defined object. Note that just like the mapper, for the input key-value pair, a reducer may not produce any output key- value pair (filtering) or can generate multiple key-value pairs (demultiplexing). The output of the reducer, that is the key-value pairs, is then written out as a separate file—one file per reducer. This is depicted in Figure 6.16, which highlights the reduce stage of the reduce task. To view the full output from the MapReduce job, all the file parts must be combined.
  • 11. 11 The reduce stage is the last stage of the reduce task. The number of reducers can be customized. It is also possible to have a MapReduce job without a reducer, for example when performing filtering. Note that the output signature (key-value types) of the map function should match that of the input signature (key-value types) of the reduce/combine function. The reduce stage can be summarized by the equation in Figure 6.17. Let us understand, how a MapReduce works by taking an example where I have a text file called example.txt whose contents are as follows: Dear, Bear, River, Car, Car, River, Deer, Car and Bear Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be finding the unique words and the number of occurrences of those unique words.
  • 12. 12 YARN Apache Hadoop YARN, or as it is called Yet Another Resource Negotiator. It is an upgrade to MapReduce present in Hadoop version 1.0 because it is a mighty and efficient resource manager that helps support applications such as HBase, Spark, and Hive. The main idea behind YARN is a layer that is used to split up the resource management layer and processing component layer. YARN can work parallelly with various applications, bringing greater efficiency while processing the data. YARN is responsible for providing resources such as storage space and also functions as a resource manager. Before the framework received its official name, it was known as MapReduce2, which is used by YARN to manage and allocate resources for various processes and jobs efficiently. Components Of YARN Resource Manager The Resource Manager has the highest authority as it manages the allocation of resources. It runs many services, including the resource scheduler, which decides how to assign the resources. Resource Manager contains the metadata regarding the location and number of resources available to the data nodes, collectively known as Rack awareness. The Resource Manager is present on each cluster and can accept processes from the users and allocate resources to them. The resource manager can be broken down into two sub-parts: Application Manager The application manager is responsible for validating the jobs submitted to the resource manager. It verifies if the system has enough resources to run the job and rejects them if it is out of resources. It
  • 13. 13 also ensures that there is no other application with the same ID that has already been submitted that could cause an error. After performing these checks, it finally forwards the application to the scheduler. Schedulers The scheduler as the name suggests is responsible for scheduling the tasks. The scheduler neither monitors nor tracks the status of the application nor does restart if there is any failure occurred. There are three types of Schedulers available in YARN: FIFO [First In, First Out] schedulers, capacity schedulers, and Fair schedulers. Out of these clusters to run large jobs executed promptly, it’s better to use Capacity and Fair Schedulers. Node Manager The Node Manager works as a slave installed at each node and functions as a monitoring and reporting agent for the Resource Manager. It also transmits the health status of each node to the Resource Manager and offers resources to the cluster. It is responsible for monitoring resource usage by individual containers and reporting it to the Resource manager. The Node Manager can also kill or destroy the container if it receives a command from the Resource Manager to do so. It also monitors the usage of resources, performs log Management, and helps in creating container processes and executing them on the request of the application master. Now we shall discuss the components of Node Manager: Containers Containers are a fraction of Node Manager capacity, whose responsibility is to provide physical resources like a disk space on a single node. All of the actual processing takes place inside the container. An application can use a specific amount of memory from the CPU only after permission has been granted by the container. Application Master Application master is the framework-specific process that negotiates resources for a single application. It works along with the Node Manager and monitors the execution of the task. Application Master also sends heartbeats to the resource manager which provides a report after the application has started. Application Master requests the container in the node manager by launching CLC (Container Launch Context) which takes care of all the resources required an application needs to execute. HBase HBase is a top-level Apache project written in java which fulfills the need to read and write data in real- time. It provides a simple interface to the distributed data. It can be accessed by Apache Hive, Apache Pig, MapReduce, and store information in HDFS. HDFS-vs-HBase HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.
  • 14. 14 HMaster – The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned to region server as well as DDL (create, delete table) operations. It monitors all Region Server instances present in the cluster. In a distributed environment, Master runs several background threads. HMaster has many features like controlling load balancing, failover etc. Region Server – HBase Tables are divided horizontally by row key range into regions. Regions are the basic building elements of HBase cluster that consists of the distribution of tables and are comprised of Column families. Region Server runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region Server are responsible for several things, like handling, managing, executing as well as reads and writes HBase operations on that set of regions. The default size of a region is 256 MB. Zookeeper – It is like a coordinator in HBase. It provides services like maintaining configuration information, naming, providing distributed synchronization, server failure notification etc. Clients communicate with region servers via zookeeper. Advantages of HBase – • Can store large data sets • Database can be shared • Cost-effective from gigabytes to petabytes • High availability through failover and replication Disadvantages of HBase – • No support SQL structure • No transaction support • Sorted only on key • Memory issues on the cluster Comparison between HBase and HDFS: • HBase provides low latency access while HDFS provides high latency operations. • HBase supports random read and writes while HDFS supports Write once Read Many times.
  • 15. 15 • HBase is accessed through shell commands, Java API, REST, Avro or Thrift API while HDFS is accessed through MapReduce jobs. HIVE Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It was developed by Facebook. Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs. Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF). Features of Hive • Hive is fast and scalable. • It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs. • It is capable of analyzing large datasets stored in HDFS. • It allows different storage types such as plain text, RCFile, and HBase. • It uses indexing to accelerate queries. • It can operate on compressed data stored in the Hadoop ecosystem. • It supports user-defined functions (UDFs) where user can provide its functionality. Limitations of Hive • Hive is not capable of handling real-time data. • It is not designed for online transaction processing. • Hive queries contain high latency. Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL Server/DB2 and vice versa.
  • 16. 16 Sqoop Working Step 1: Sqoop send the request to Relational DB to send the return the metadata information about the table (Metadata here is the data about the table in relational DB). Step 2: From the received information it will generate the java classes (Reason why you should have Java configured before get it working-Sqoop internally uses JDBC API to generate data). Step 3: Now Sqoop (As its written in java? tries to package the compiled classes to be able to generate table structure) , post compiling creates jar file(Java packaging standard). Apache Zookeeper Apache Zookeeper is a distributed, open-source coordination service for distributed systems. It provides a central place for distributed applications to store data, communicate with one another, and coordinate activities. Zookeeper is used in distributed systems to coordinate distributed processes and services. It provides a simple, tree-structured data model, a simple API, and a distributed protocol to ensure data consistency and availability. Zookeeper is designed to be highly reliable and fault-tolerant, and it can handle high levels of read and write throughput. Zookeeper is implemented in Java and is widely used in distributed systems, particularly in the Hadoop ecosystem. It is an Apache Software Foundation project and is released under the Apache License 2.0. The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-like structure. Each znode can store data and has a set of permissions that control access to the znode. The znodes are organized in a hierarchical namespace, similar to a file system. At the root of the hierarchy
  • 17. 17 is the root znode, and all other znodes are children of the root znode. The hierarchy is similar to a file system hierarchy, where each znode can have children and grandchildren, and so on. Important Components in Zookeeper ZooKeeper Services Leader & Follower Request Processor – Active in Leader Node and is responsible for processing write requests. After processing, it sends changes to the follower nodes Atomic Broadcast – Present in both Leader Node and Follower Nodes. It is responsible for sending the changes to other Nodes. In-memory Databases (Replicated Databases)-It is responsible for storing the data in the zookeeper. Every node contains its own databases. Data is also written to the file system providing recoverability in case of any problems with the cluster. Other Components Client – One of the nodes in our distributed application cluster. Access information from the server. Every client sends a message to the server to let the server know that client is alive. Server– Provides all the services to the client. Gives acknowledgment to the client. Ensemble– Group of Zookeeper servers. The minimum number of nodes that are required to form an ensemble is 3. How ZooKeeper in Hadoop Works? ZooKeeper operates as a distributed file system and exposes a simple set of APIs that enable clients to read and write data to the file system. It stores its data in a tree-like structure called a znode, which can be thought of as a file or a directory in a traditional file system. ZooKeeper uses a consensus algorithm to ensure that all of its servers have a consistent view of the data stored in the Znodes. This means that if a client writes data to a znode, that data will be replicated to all of the other servers in the ZooKeeper ensemble. One important feature of ZooKeeper is its ability to support the notion of a “watch.” A watch allows a client to register for notifications when the data stored in a znode changes. This can be useful for monitoring changes to the data stored in ZooKeeper and reacting to those changes in a distributed system.
  • 18. 18 In Hadoop, ZooKeeper is used for a variety of purposes, including: Storing configuration information: ZooKeeper is used to store configuration information that is shared by multiple Hadoop components. For example, it might be used to store the locations of NameNodes in a Hadoop cluster or the addresses of JobTracker nodes. Providing distributed synchronization: ZooKeeper is used to coordinate the activities of various Hadoop components and ensure that they are working together in a consistent manner. For example, it might be used to ensure that only one NameNode is active at a time in a Hadoop cluster. Maintaining naming: ZooKeeper is used to maintain a centralized naming service for Hadoop components. This can be useful for identifying and locating resources in a distributed system. Flume Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. It is mainly used for real-time data ingestion from different web applications into storage like HDFS, HBase, etc. Apache Flume reliably collects, aggregates, and transports big data generated from external sources to the central store. Apache Oozie Apache Oozie is a Workflow engine. It is used to run workflow jobs such as Hadoop Map/Reduce and Pig. It is Java Web-Application that runs in a Java servlet container. By using Oozie multiple jobs can be bounded sequentially into one logical unit of work. The major advantage of the Oozie framework is that it is fully integrated with the Apache Hadoop stack and supports Hadoop jobs for Apache MapReduce, Pig, Hive, and Sqoop. The following figure shows the architecture of Apache Oozie.
  • 19. 19 Oozie Client An Apache Oozie client is a command-line utility that interacts with the Oozie server using the Oozie command-line tool, the Oozie Java client API, or the Oozie HTTP REST API. The Oozie command- line tool and the Oozie Java API eventually use the Oozie HTTP REST API to communicate with the Oozie server. Oozie Server Apache Oozie server is a Java web application that runs in a Java servlet container. Oozie uses Apache Tomcat by default, which is an open-source Java servlet technology. The Oozie server does not store any user or job information in memory. Oozie main all this information such as running or completed in the SQL database. When a user request to process a job, the Oozie server fetches the conforming job state from the SQL database and performs the requested operation, and updates the SQL database with the new state of the job. Oozie Database Apache Oozie database stores all of the stateful information such as workflow definitions, running and completed jobs. Oozie fetches the corresponding job-state from the SQL database while processing a user request and performs the requested operation, and updates the SQL database with the new state of the job. Oozie provides support for databases such as Derby, MySQL, Oracle, and PostgreSQL.