1. CSA15-CLOUD COMPUTING AND BIG DATA ANALYTICS
Dr.J.Praveen chandar
Associate Professor/CSE
HADOOP AND MAPREDUCE
ARCHITECTURE
Dr.J.Praveenchandar/CSE
2. HADOOP AND MAPREDUCE ARCHITECTURE
Dr.J.Praveenchandar/CSE
Big data â Apache Hadoop & Hadoop EcoSystem â Analyzing data
with Hadoop streaming â HDFS concept â Interface to HDFS -
Moving Data in and out of Hadoop âIntroduction to MapReduce â
MapReduce Algorithm and Architecture â Understanding inputs
and outputs of MapReduce - Anatomy of MapReduce Job run â
Failures in classical MapReduce and YARN â Job scheduling - Data
Serialization.
3. Big data
⢠Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
⢠It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.
⢠Big data is also a data but with huge size.
Example of Big Data
⢠New York Stock Exchange Social Media
Dr.J.Praveenchandar/CSE
4. Apache Hadoop & Hadoop Ecosystem
⢠Apache Hadoop is an open source framework
intended to make interaction with big data easier,
⢠Hadoop has made its place in the industries and
companies that need to work on large data sets
which are sensitive and needs efficient handling.
⢠Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters.
⢠Being a framework, Hadoop is made up of several
modules that are supported by a large ecosystem of
technologies.
Dr.J.Praveenchandar/CSE
6. Apache Hadoop & Hadoop Ecosystem
Dr.J.Praveenchandar/CSE
⢠Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
There are four major elements of Hadoop i.e.
⢠HDFS,
⢠MapReduce,
⢠YARN, and
⢠Hadoop Common.
8. Components that collectively form a Hadoop ecosystem
Dr.J.Praveenchandar/CSE
⢠HDFS: Hadoop Distributed File System
⢠YARN: Yet Another Resource Negotiator
⢠MapReduce: Programming based Data Processing
⢠Spark: In-Memory data processing
⢠PIG, HIVE: Query based processing of data services
⢠HBase: NoSQL Database
⢠Mahout, Spark MLLib: Machine Learning algorithm libraries
⢠Solar, Lucene: Searching and Indexing
⢠Zookeeper: Managing cluster
⢠Oozie: Job Scheduling
9. HDFS:
Dr.J.Praveenchandar/CSE
⢠HDFS is the primary or major component of Hadoop ecosystem and
is responsible for storing large data sets of structured or
unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
⢠HDFS consists of two core components i.e.
ď§ Name node
ď§ Data Node
10. YARN:
Dr.J.Praveenchandar/CSE
⢠Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. In short,
it performs scheduling and resource allocation for the Hadoop
System.
⢠Consists of three major components i.e.
ď§ Resource Manager
ď§ Nodes Manager
ď§ Application Manager
11. MapReduce:
Dr.J.Praveenchandar/CSE
⢠By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processingâs logic and helps to
write applications which transform big data sets into a manageable
one.
⢠MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
ď§ Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
ď§ Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
12. PIG:
Dr.J.Praveenchandar/CSE
⢠Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
⢠It is a platform for structuring the data flow, processing and
analyzing huge data sets.
⢠Pig does the work of executing commands and in the background, all
the activities of MapReduce are taken care of.
⢠After the processing, pig stores the result in HDFS.
13. HIVE:
Dr.J.Praveenchandar/CSE
⢠With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets.
⢠However, its query language is called as HQL (Hive Query Language).
⢠It is highly scalable as it allows real-time processing and batch
processing both.
⢠Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
⢠Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
14. Apache Spark:
Dr.J.Praveenchandar/CSE
⢠Itâs a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
⢠It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
⢠Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most
of the companies interchangeably.
16. Hadoop Streaming
Dr.J.Praveenchandar/CSE
⢠It is a utility or feature that comes with a Hadoop distribution that
allows developers or programmers to write the Map-Reduce
program using different programming languages like Ruby, Perl,
Python, C++
⢠If we are reading an image data then we can generate key-value pair
for each pixel where the key will be the location of the pixel and the
value will be its color value from (0-255) for a colored image.
⢠Now this list of key-value pairs is fed to the Map phase and Mapper
will work on each of these key-value pair of each pixel and generate
some intermediate key-value pairs which are then fed to the
Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS.
17. HDFS concept
Dr.J.Praveenchandar/CSE
⢠The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications.
⢠HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance
access to data across highly scalable Hadoop clusters.
⢠HDFS enables the rapid transfer of data between compute nodes.
⢠At its outset, it was closely coupled with Map Reduce, a framework
for data processing that filters and divides up work among the nodes
in a cluster, and it organizes and condenses the results into a
cohesive answer to a query.
19. HDFS
Dr.J.Praveenchandar/CSE
⢠HDFS uses a primary/secondary architecture.
⢠The HDFS cluster's NameNode is the primary server that manages
the file system namespace and controls client access to files.
⢠As the central component of the Hadoop Distributed File System, the
NameNode maintains and manages the file system namespace and
provides clients with the right access permissions.
⢠The system's DataNodes manage the storage that's attached to the
nodes they run
20. HDFS
Dr.J.Praveenchandar/CSE
⢠HDFS exposes a file system namespace and enables user data to be
stored in files.
⢠A file is split into one or more of the blocks that are stored in a set of
DataNodes.
⢠The NameNode performs file system namespace operations,
including opening, closing and renaming files and directories.
⢠The NameNode also governs the mapping of blocks to the
DataNodes.
⢠The DataNodes serve read and write requests from the clients of the
file system. In addition, they perform block creation, deletion and
replication when the NameNode instructs them to do so.
21. HDFS
Dr.J.Praveenchandar/CSE
⢠HDFS supports a traditional hierarchical file organization.
⢠An application or user can create directories and then store files
inside these directories.
⢠The file system namespace hierarchy is like most other file systems --
a user can create, remove, rename or move files from one directory
to another.
22. Features of HDFS
There are several features that make HDFS particularly useful,
including
Data replication. This is used to ensure that the data is always
available and prevents data loss.
For example, when a node crashes or there is a hardware failure,
replicated data can be pulled from elsewhere within a cluster, so
processing continues while data is recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks
and store them across nodes in a large cluster ensures fault tolerance
and reliability.
High availability. As mentioned earlier, because of replication across
notes, data is available even if the NameNode or a DataNode fails.
Dr.J.Praveenchandar/CSE
23. Features of HDFS
Dr.J.Praveenchandar/CSE
⢠Scalability. Because HDFS stores data on various nodes in the
cluster, as requirements increase, a cluster can scale to hundreds of
nodes.
⢠High throughput. Because HDFS stores data in a distributed
manner, the data can be processed in parallel on a cluster of nodes.
This, plus data locality (see next bullet), cut the processing time and
enable high throughput.
⢠Data locality. With HDFS, computation happens on the DataNodes
where the data resides, rather than having the data move to where
the computational unit is.
24. Benefits of using HDFS
Dr.J.Praveenchandar/CSE
⢠Cost effectiveness.
⢠Large data set storage.
⢠Fast recovery from hardware failure.
⢠Portability.
⢠Streaming data access.
25. HDFS use cases and examples
Dr.J.Praveenchandar/CSE
⢠The Hadoop Distributed File System emerged at Yahoo as a part of
that company's online ad placement and search engine
requirements.
⢠Like other web-based companies, Yahoo juggled a variety of
applications that were accessed by an increasing number of users,
who were creating more and more data.
⢠EBay, Facebook, LinkedIn and Twitter are among the companies
that used HDFS to underpin big data analytics to address
requirements similar to Yahoo's.
26. Moving data into and out of Hadoop
Dr.J.Praveenchandar/CSE
⢠Data movement is one of those things that you arenât likely to think
too much about until youâre fully committed to using Hadoop on a
project, at which point it becomes this big scary unknown that has to
be tackled.
⢠Ingress and egress refer to data movement into and out of a system,
respectively.
Key elements of data movement
⢠Moving large quantities of data in and out of Hadoop offers logistical
challenges that include consistency guarantees and resource impacts
on data sources and destinations.
28. Key elements of data movement
Dr.J.Praveenchandar/CSE
⢠Idempotence : An idempotent operation produces the same result no
matter how many times itâs executed.
⢠Aggregation: It is performed to acquire the final result of the MapReduce
job, that is combining the output of the Mapper and displaying the
result.
⢠Data format transformation
The data format transformation process converts one data format
into another.
⢠Monitoring
ensures that functions are performing as expected in
Monitoring
automated
systems.
⢠Data format transformation
The data format transformation process converts one data format into
another.
29. MapReduce
Dr.J.Praveenchandar/CSE
⢠MapReduce is a programming model for writing applications that
can process Big Data in parallel on multiple nodes.
⢠MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.
⢠Traditional Enterprise Systems normally have a centralized server to
store and process data.
⢠Traditional model is certainly not suitable to process huge volumes
of scalable data and cannot be accommodated by standard database
servers.
⢠Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.
30. MapReduce Works
Dr.J.Praveenchandar/CSE
⢠The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
⢠The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
⢠The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.
⢠The reduce task is always performed after the map job.
32. MapReduce
Dr.J.Praveenchandar/CSE
⢠Input Phase â Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
⢠Map â Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or more
key-value pairs.
⢠Intermediate Keys â They key-value pairs generated by the mapper
are known as intermediate keys.
⢠Combiner â A combiner is a type of local Reducer that groups
similar data from the map phase into identifiable sets.
⢠Shuffle and Sort â The Reducer task starts with the Shuffle and Sort
step.
33. MapReduce
Dr.J.Praveenchandar/CSE
⢠Reducer â The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them.
⢠Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range of processing.
⢠Once the execution is over, it gives zero or more key-value pairs to
the final step.
⢠Output Phase â In the output phase, we have an output formatter
that translates the final key-value pairs from the Reducer function
and writes them onto a file using a record writer.
35. MapReduce-Example
⢠Let us take a real-world example to comprehend the power of
MapReduce.
⢠Twitter receives around 500 million tweets per day, which is nearly
3000 tweets per second.
⢠The following illustration shows how Tweeter manages its tweets
with the help of MapReduce.
Dr.J.Praveenchandar/CSE
36. MapReduce
Dr.J.Praveenchandar/CSE
⢠MapReduce algorithm performs the following actions â
⢠Tokenize â Tokenizes the tweets into maps of tokens and writes
them as key-value pairs.
⢠Filter â Filters unwanted words from the maps of tokens and writes
the filtered maps as key-value pairs.
⢠Count â Generates a token counter per word.
⢠Aggregate Counters â Prepares an aggregate of similar counter
values into small manageable units.
37. MapReduce - Algorithm
⢠The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
⢠The map task is done by means of Mapper Class
⢠The reduce task is done by means of Reducer Class.
⢠Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.
Dr.J.Praveenchandar/CSE
38. MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
⢠MapReduce implements various mathematical algorithms to divide a
task into small parts and assign them to multiple systems. In
technical terms, MapReduce algorithm helps in sending the Map &
Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following â
⢠Sorting
⢠Searching
⢠Indexing
⢠TF-IDF
39. MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
Sorting
⢠Sorting is one of the basic MapReduce algorithms to process and
analyze data.
⢠MapReduce implements sorting algorithm to automatically sort the
output key-value pairs from the mapper by their keys.
Searching
⢠Searching plays an important role in MapReduce algorithm. It helps
in the combiner phase (optional) and in the Reducer phase.
⢠Let us try to understand how Searching works with the help of an
example.
40. MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
Indexing
⢠Normally indexing is used to point to a particular data and its address.
⢠It performs batch indexing on the input files for a particular Mapper.
TF-IDF
⢠TF-IDF is a text processing algorithm which is short for Term Frequency â
Inverse Document Frequency.
⢠It is one of the common web analysis algorithms. Here, the term 'frequency'
refers to the number of times a term appears in a document.
Term Frequency (TF)
⢠It measures how frequently a particular term occurs in a document.
⢠It is calculated by the number of times a word appears in a document
divided by the total number of words in that document.
41. MapReduce Architecture
Dr.J.Praveenchandar/CSE
⢠MapReduce and HDFS are the two major components
of Hadoop which makes it so powerful and efficient to use.
⢠MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner.
⢠The data is first split and then combined to produce the final
result.
⢠The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less
overhead over the cluster network and to reduce the processing
power.
⢠The MapReduce task is mainly divided into two phases Map
Phase and Reduce Phase.
43. Components of MapReduce Architecture:
Dr.J.Praveenchandar/CSE
⢠Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing.
⢠Job: The MapReduce Job is the actual work that the client wanted to
do which is comprised of so many smaller tasks that the client wants
to process or execute.
⢠Hadoop MapReduce Master: It divides the particular job into
subsequent job-parts.
⢠Job-Parts: The task or sub-jobs that are obtained after dividing the
main job
⢠Input Data: The data set that is fed to the MapReduce for processing.
⢠Output Data: The final result is obtained after the processing.
44. Job tracker and the task tracker
Dr.J.Praveenchandar/CSE
⢠Job Tracker: The work of Job tracker is to manage all the resources
and all the jobs across the cluster and also to schedule each map on
the Task Tracker running on the same data node since there can be
hundreds of data nodes available in the cluster.
⢠Task Tracker: The Task Tracker can be considered as the actual
slaves that are working on the instruction given by the Job Tracker.
This Task Tracker is deployed on each of the nodes available in the
cluster that executes the Map and Reduce task as instructed by Job
Tracker.
46. INPUTS AND OUTPUTS IN MAPREDUCE
⢠Data input :-
⢠The two classes that support data input in MapReduce are
InputFormat and Record-Reader.
⢠The InputFormat class is consulted to determine how the input data
should be partitioned for the map tasks, and the RecordReader
performs the reading of data from the inputs.
Dr.J.Praveenchandar/CSE
47. INPUTS AND OUTPUTS IN MAPREDUCE
Dr.J.Praveenchandar/CSE
Data output :-
⢠MapReduce uses a similar process for supporting output data as it
does for input data.
⢠Two classes must exist, an OutputFormat and a RecordWriter.
⢠The OutputFormat performs some basic validation of the data sink
properties, and the RecordWriter writes each reducer output to the
data sink.
48. Anatomy of a MapReduce Job Run
Dr.J.Praveenchandar/CSE
⢠You can run a MapReduce job with a single method call: submit() on
a Job object (you can also call waitForCompletion(),
⢠which submits the job if it hasnât been submitted already, then waits
for it to finish
⢠This method call conceals a great deal of processing behind the
scenes. This section uncovers the steps Hadoop takes to run a job.
49. Anatomy of a MapReduce Job Run
Dr.J.Praveenchandar/CSE
Here are five independent entities
⢠The client, which submits the MapReduce job.
⢠The YARN resource manager, which coordinates the allocation of compute
resources on the cluster.
⢠The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
⢠The MapReduce application master, which coordinates the tasks running
the MapReduce job. The application master and the MapReduce tasks run
in containers that are scheduled by the resource manager and managed
by the node managers.
⢠The distributed filesystem (normally HDFS)which is used for sharing job
files between the other entities.
51. Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
⢠In the MapReduce 1 runtime there are three failure modes to
consider: failure of the running task, failure of the tastracker, and
failure of the jobtracker. Letâs look at each in turn.
Task Failure
⢠Consider first the case of the child task failing.
⢠The most common way that this happens is when user code in the
map or reduce task throws a runtime exception.
⢠If this happens, the child JVM reports the error back to its parent
tasktracker, before it exits. The error ultimately makes it into the
user logs.
⢠The tasktracker marks the task attempt as failed, freeing up a slot to
run another task.
52. Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
Tasktracker Failure
⢠Failure of a tasktracker is another failure mode.
⢠If a tasktracker fails by crashing, or running very slowly, it will stop
sending heartbeats to the jobtracker (or send them very
infrequently).
⢠The jobtracker will notice a tasktracker that has stopped sending
heartbeats.
⢠A tasktracker can also be blacklisted by the jobtracker, even if the
tasktracker has not failed.
53. Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
Jobtracker Failure
⢠Failure of the jobtracker is the most serious failure mode.
⢠Hadoop has no mechanism for dealing with failure of the
jobtrackerâit is a single point of failureâso in this case the job fails.
⢠However, this failure mode has a low chance of occurring, since the
chance of a particular machine failing is low.
⢠After restarting a jobtracker, any jobs that were running at the time it
was stopped will need to be re-submitted.