SlideShare a Scribd company logo
1 of 53
CSA15-CLOUD COMPUTING AND BIG DATA ANALYTICS
Dr.J.Praveen chandar
Associate Professor/CSE
HADOOP AND MAPREDUCE
ARCHITECTURE
Dr.J.Praveenchandar/CSE
HADOOP AND MAPREDUCE ARCHITECTURE
Dr.J.Praveenchandar/CSE
Big data – Apache Hadoop & Hadoop EcoSystem – Analyzing data
with Hadoop streaming – HDFS concept – Interface to HDFS -
Moving Data in and out of Hadoop –Introduction to MapReduce –
MapReduce Algorithm and Architecture – Understanding inputs
and outputs of MapReduce - Anatomy of MapReduce Job run –
Failures in classical MapReduce and YARN – Job scheduling - Data
Serialization.
Big data
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.
• Big data is also a data but with huge size.
Example of Big Data
• New York Stock Exchange Social Media
Dr.J.Praveenchandar/CSE
Apache Hadoop & Hadoop Ecosystem
• Apache Hadoop is an open source framework
intended to make interaction with big data easier,
• Hadoop has made its place in the industries and
companies that need to work on large data sets
which are sensitive and needs efficient handling.
• Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters.
• Being a framework, Hadoop is made up of several
modules that are supported by a large ecosystem of
technologies.
Dr.J.Praveenchandar/CSE
A multi-node Hadoop cluster
Dr.J.Praveenchandar/CSE
Apache Hadoop & Hadoop Ecosystem
Dr.J.Praveenchandar/CSE
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
There are four major elements of Hadoop i.e.
• HDFS,
• MapReduce,
• YARN, and
• Hadoop Common.
Dr.J.Praveenchandar/CSE
Components that collectively form a Hadoop ecosystem
Dr.J.Praveenchandar/CSE
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
HDFS:
Dr.J.Praveenchandar/CSE
• HDFS is the primary or major component of Hadoop ecosystem and
is responsible for storing large data sets of structured or
unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
• HDFS consists of two core components i.e.
 Name node
 Data Node
YARN:
Dr.J.Praveenchandar/CSE
• Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. In short,
it performs scheduling and resource allocation for the Hadoop
System.
• Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
MapReduce:
Dr.J.Praveenchandar/CSE
• By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to
write applications which transform big data sets into a manageable
one.
• MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
 Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
 Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Dr.J.Praveenchandar/CSE
• Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the background, all
the activities of MapReduce are taken care of.
• After the processing, pig stores the result in HDFS.
HIVE:
Dr.J.Praveenchandar/CSE
• With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets.
• However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch
processing both.
• Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
Apache Spark:
Dr.J.Praveenchandar/CSE
• It’s a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most
of the companies interchangeably.
Dr.J.Praveenchandar/CSE
Hadoop Streaming
Dr.J.Praveenchandar/CSE
• It is a utility or feature that comes with a Hadoop distribution that
allows developers or programmers to write the Map-Reduce
program using different programming languages like Ruby, Perl,
Python, C++
• If we are reading an image data then we can generate key-value pair
for each pixel where the key will be the location of the pixel and the
value will be its color value from (0-255) for a colored image.
• Now this list of key-value pairs is fed to the Map phase and Mapper
will work on each of these key-value pair of each pixel and generate
some intermediate key-value pairs which are then fed to the
Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS.
HDFS concept
Dr.J.Praveenchandar/CSE
• The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications.
• HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance
access to data across highly scalable Hadoop clusters.
• HDFS enables the rapid transfer of data between compute nodes.
• At its outset, it was closely coupled with Map Reduce, a framework
for data processing that filters and divides up work among the nodes
in a cluster, and it organizes and condenses the results into a
cohesive answer to a query.
Dr.J.Praveenchandar/CSE
HDFS
Dr.J.Praveenchandar/CSE
• HDFS uses a primary/secondary architecture.
• The HDFS cluster's NameNode is the primary server that manages
the file system namespace and controls client access to files.
• As the central component of the Hadoop Distributed File System, the
NameNode maintains and manages the file system namespace and
provides clients with the right access permissions.
• The system's DataNodes manage the storage that's attached to the
nodes they run
HDFS
Dr.J.Praveenchandar/CSE
• HDFS exposes a file system namespace and enables user data to be
stored in files.
• A file is split into one or more of the blocks that are stored in a set of
DataNodes.
• The NameNode performs file system namespace operations,
including opening, closing and renaming files and directories.
• The NameNode also governs the mapping of blocks to the
DataNodes.
• The DataNodes serve read and write requests from the clients of the
file system. In addition, they perform block creation, deletion and
replication when the NameNode instructs them to do so.
HDFS
Dr.J.Praveenchandar/CSE
• HDFS supports a traditional hierarchical file organization.
• An application or user can create directories and then store files
inside these directories.
• The file system namespace hierarchy is like most other file systems --
a user can create, remove, rename or move files from one directory
to another.
Features of HDFS
There are several features that make HDFS particularly useful,
including
Data replication. This is used to ensure that the data is always
available and prevents data loss.
For example, when a node crashes or there is a hardware failure,
replicated data can be pulled from elsewhere within a cluster, so
processing continues while data is recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks
and store them across nodes in a large cluster ensures fault tolerance
and reliability.
High availability. As mentioned earlier, because of replication across
notes, data is available even if the NameNode or a DataNode fails.
Dr.J.Praveenchandar/CSE
Features of HDFS
Dr.J.Praveenchandar/CSE
• Scalability. Because HDFS stores data on various nodes in the
cluster, as requirements increase, a cluster can scale to hundreds of
nodes.
• High throughput. Because HDFS stores data in a distributed
manner, the data can be processed in parallel on a cluster of nodes.
This, plus data locality (see next bullet), cut the processing time and
enable high throughput.
• Data locality. With HDFS, computation happens on the DataNodes
where the data resides, rather than having the data move to where
the computational unit is.
Benefits of using HDFS
Dr.J.Praveenchandar/CSE
• Cost effectiveness.
• Large data set storage.
• Fast recovery from hardware failure.
• Portability.
• Streaming data access.
HDFS use cases and examples
Dr.J.Praveenchandar/CSE
• The Hadoop Distributed File System emerged at Yahoo as a part of
that company's online ad placement and search engine
requirements.
• Like other web-based companies, Yahoo juggled a variety of
applications that were accessed by an increasing number of users,
who were creating more and more data.
• EBay, Facebook, LinkedIn and Twitter are among the companies
that used HDFS to underpin big data analytics to address
requirements similar to Yahoo's.
Moving data into and out of Hadoop
Dr.J.Praveenchandar/CSE
• Data movement is one of those things that you aren’t likely to think
too much about until you’re fully committed to using Hadoop on a
project, at which point it becomes this big scary unknown that has to
be tackled.
• Ingress and egress refer to data movement into and out of a system,
respectively.
Key elements of data movement
• Moving large quantities of data in and out of Hadoop offers logistical
challenges that include consistency guarantees and resource impacts
on data sources and destinations.
Dr.J.Praveenchandar/CSE
Key elements of data movement
Dr.J.Praveenchandar/CSE
• Idempotence : An idempotent operation produces the same result no
matter how many times it’s executed.
• Aggregation: It is performed to acquire the final result of the MapReduce
job, that is combining the output of the Mapper and displaying the
result.
• Data format transformation
The data format transformation process converts one data format
into another.
• Monitoring
ensures that functions are performing as expected in
Monitoring
automated
systems.
• Data format transformation
The data format transformation process converts one data format into
another.
MapReduce
Dr.J.Praveenchandar/CSE
• MapReduce is a programming model for writing applications that
can process Big Data in parallel on multiple nodes.
• MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.
• Traditional Enterprise Systems normally have a centralized server to
store and process data.
• Traditional model is certainly not suitable to process huge volumes
of scalable data and cannot be accommodated by standard database
servers.
• Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.
MapReduce Works
Dr.J.Praveenchandar/CSE
• The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
• The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
• The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.
• The reduce task is always performed after the map job.
MapReduce
Dr.J.Praveenchandar/CSE
MapReduce
Dr.J.Praveenchandar/CSE
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or more
key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper
are known as intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups
similar data from the map phase into identifiable sets.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort
step.
MapReduce
Dr.J.Praveenchandar/CSE
• Reducer − The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them.
• Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range of processing.
• Once the execution is over, it gives zero or more key-value pairs to
the final step.
• Output Phase − In the output phase, we have an output formatter
that translates the final key-value pairs from the Reducer function
and writes them onto a file using a record writer.
Map &f Reduce
Dr.J.Praveenchandar/CSE
MapReduce-Example
• Let us take a real-world example to comprehend the power of
MapReduce.
• Twitter receives around 500 million tweets per day, which is nearly
3000 tweets per second.
• The following illustration shows how Tweeter manages its tweets
with the help of MapReduce.
Dr.J.Praveenchandar/CSE
MapReduce
Dr.J.Praveenchandar/CSE
• MapReduce algorithm performs the following actions −
• Tokenize − Tokenizes the tweets into maps of tokens and writes
them as key-value pairs.
• Filter − Filters unwanted words from the maps of tokens and writes
the filtered maps as key-value pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an aggregate of similar counter
values into small manageable units.
MapReduce - Algorithm
• The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
• The map task is done by means of Mapper Class
• The reduce task is done by means of Reducer Class.
• Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.
Dr.J.Praveenchandar/CSE
MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
• MapReduce implements various mathematical algorithms to divide a
task into small parts and assign them to multiple systems. In
technical terms, MapReduce algorithm helps in sending the Map &
Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
• Sorting
• Searching
• Indexing
• TF-IDF
MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
Sorting
• Sorting is one of the basic MapReduce algorithms to process and
analyze data.
• MapReduce implements sorting algorithm to automatically sort the
output key-value pairs from the mapper by their keys.
Searching
• Searching plays an important role in MapReduce algorithm. It helps
in the combiner phase (optional) and in the Reducer phase.
• Let us try to understand how Searching works with the help of an
example.
MapReduce - Algorithm
Dr.J.Praveenchandar/CSE
Indexing
• Normally indexing is used to point to a particular data and its address.
• It performs batch indexing on the input files for a particular Mapper.
TF-IDF
• TF-IDF is a text processing algorithm which is short for Term Frequency −
Inverse Document Frequency.
• It is one of the common web analysis algorithms. Here, the term 'frequency'
refers to the number of times a term appears in a document.
Term Frequency (TF)
• It measures how frequently a particular term occurs in a document.
• It is calculated by the number of times a word appears in a document
divided by the total number of words in that document.
MapReduce Architecture
Dr.J.Praveenchandar/CSE
• MapReduce and HDFS are the two major components
of Hadoop which makes it so powerful and efficient to use.
• MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner.
• The data is first split and then combined to produce the final
result.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less
overhead over the cluster network and to reduce the processing
power.
• The MapReduce task is mainly divided into two phases Map
Phase and Reduce Phase.
Dr.J.Praveenchandar/CSE
Components of MapReduce Architecture:
Dr.J.Praveenchandar/CSE
• Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing.
• Job: The MapReduce Job is the actual work that the client wanted to
do which is comprised of so many smaller tasks that the client wants
to process or execute.
• Hadoop MapReduce Master: It divides the particular job into
subsequent job-parts.
• Job-Parts: The task or sub-jobs that are obtained after dividing the
main job
• Input Data: The data set that is fed to the MapReduce for processing.
• Output Data: The final result is obtained after the processing.
Job tracker and the task tracker
Dr.J.Praveenchandar/CSE
• Job Tracker: The work of Job tracker is to manage all the resources
and all the jobs across the cluster and also to schedule each map on
the Task Tracker running on the same data node since there can be
hundreds of data nodes available in the cluster.
• Task Tracker: The Task Tracker can be considered as the actual
slaves that are working on the instruction given by the Job Tracker.
This Task Tracker is deployed on each of the nodes available in the
cluster that executes the Map and Reduce task as instructed by Job
Tracker.
UNDERSTANDING INPUTS AND OUTPUTS IN MAPREDUCE
Dr.J.Praveenchandar/CSE
INPUTS AND OUTPUTS IN MAPREDUCE
• Data input :-
• The two classes that support data input in MapReduce are
InputFormat and Record-Reader.
• The InputFormat class is consulted to determine how the input data
should be partitioned for the map tasks, and the RecordReader
performs the reading of data from the inputs.
Dr.J.Praveenchandar/CSE
INPUTS AND OUTPUTS IN MAPREDUCE
Dr.J.Praveenchandar/CSE
Data output :-
• MapReduce uses a similar process for supporting output data as it
does for input data.
• Two classes must exist, an OutputFormat and a RecordWriter.
• The OutputFormat performs some basic validation of the data sink
properties, and the RecordWriter writes each reducer output to the
data sink.
Anatomy of a MapReduce Job Run
Dr.J.Praveenchandar/CSE
• You can run a MapReduce job with a single method call: submit() on
a Job object (you can also call waitForCompletion(),
• which submits the job if it hasn’t been submitted already, then waits
for it to finish
• This method call conceals a great deal of processing behind the
scenes. This section uncovers the steps Hadoop takes to run a job.
Anatomy of a MapReduce Job Run
Dr.J.Praveenchandar/CSE
Here are five independent entities
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of compute
resources on the cluster.
• The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
• The MapReduce application master, which coordinates the tasks running
the MapReduce job. The application master and the MapReduce tasks run
in containers that are scheduled by the resource manager and managed
by the node managers.
• The distributed filesystem (normally HDFS)which is used for sharing job
files between the other entities.
Dr.J.Praveenchandar/CSE
Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
• In the MapReduce 1 runtime there are three failure modes to
consider: failure of the running task, failure of the tastracker, and
failure of the jobtracker. Let’s look at each in turn.
Task Failure
• Consider first the case of the child task failing.
• The most common way that this happens is when user code in the
map or reduce task throws a runtime exception.
• If this happens, the child JVM reports the error back to its parent
tasktracker, before it exits. The error ultimately makes it into the
user logs.
• The tasktracker marks the task attempt as failed, freeing up a slot to
run another task.
Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
Tasktracker Failure
• Failure of a tasktracker is another failure mode.
• If a tasktracker fails by crashing, or running very slowly, it will stop
sending heartbeats to the jobtracker (or send them very
infrequently).
• The jobtracker will notice a tasktracker that has stopped sending
heartbeats.
• A tasktracker can also be blacklisted by the jobtracker, even if the
tasktracker has not failed.
Failures in Classic MapReduce
Dr.J.Praveenchandar/CSE
Jobtracker Failure
• Failure of the jobtracker is the most serious failure mode.
• Hadoop has no mechanism for dealing with failure of the
jobtracker—it is a single point of failure—so in this case the job fails.
• However, this failure mode has a low chance of occurring, since the
chance of a particular machine failing is low.
• After restarting a jobtracker, any jobs that were running at the time it
was stopped will need to be re-submitted.

More Related Content

Similar to HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
hadoop
hadoophadoop
hadoopswatic018
 
hadoop
hadoophadoop
hadoopswatic018
 
Big data
Big dataBig data
Big dataAlisha Roy
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdfSunil D Patil
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 

Similar to HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt (20)

Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Anju
AnjuAnju
Anju
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

More from ManiMaran230751

one shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSone shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSManiMaran230751
 
Deep-Learning-2017-Lecture ML DL RNN.ppt
Deep-Learning-2017-Lecture  ML DL RNN.pptDeep-Learning-2017-Lecture  ML DL RNN.ppt
Deep-Learning-2017-Lecture ML DL RNN.pptManiMaran230751
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.pptManiMaran230751
 
12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.pptManiMaran230751
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptManiMaran230751
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptxManiMaran230751
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
The Stochastic Network Calculus: A Modern Approach.pptx
The Stochastic Network Calculus: A Modern Approach.pptxThe Stochastic Network Calculus: A Modern Approach.pptx
The Stochastic Network Calculus: A Modern Approach.pptxManiMaran230751
 
Open Access and IR along with Quality Indicators.pptx
Open Access and IR along with Quality Indicators.pptxOpen Access and IR along with Quality Indicators.pptx
Open Access and IR along with Quality Indicators.pptxManiMaran230751
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptxManiMaran230751
 
Acoustic Model.pptx
Acoustic Model.pptxAcoustic Model.pptx
Acoustic Model.pptxManiMaran230751
 

More from ManiMaran230751 (11)

one shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSone shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DS
 
Deep-Learning-2017-Lecture ML DL RNN.ppt
Deep-Learning-2017-Lecture  ML DL RNN.pptDeep-Learning-2017-Lecture  ML DL RNN.ppt
Deep-Learning-2017-Lecture ML DL RNN.ppt
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt
 
12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.ppt
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptx
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
The Stochastic Network Calculus: A Modern Approach.pptx
The Stochastic Network Calculus: A Modern Approach.pptxThe Stochastic Network Calculus: A Modern Approach.pptx
The Stochastic Network Calculus: A Modern Approach.pptx
 
Open Access and IR along with Quality Indicators.pptx
Open Access and IR along with Quality Indicators.pptxOpen Access and IR along with Quality Indicators.pptx
Open Access and IR along with Quality Indicators.pptx
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
 
Acoustic Model.pptx
Acoustic Model.pptxAcoustic Model.pptx
Acoustic Model.pptx
 

Recently uploaded

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt

  • 1. CSA15-CLOUD COMPUTING AND BIG DATA ANALYTICS Dr.J.Praveen chandar Associate Professor/CSE HADOOP AND MAPREDUCE ARCHITECTURE Dr.J.Praveenchandar/CSE
  • 2. HADOOP AND MAPREDUCE ARCHITECTURE Dr.J.Praveenchandar/CSE Big data – Apache Hadoop & Hadoop EcoSystem – Analyzing data with Hadoop streaming – HDFS concept – Interface to HDFS - Moving Data in and out of Hadoop –Introduction to MapReduce – MapReduce Algorithm and Architecture – Understanding inputs and outputs of MapReduce - Anatomy of MapReduce Job run – Failures in classical MapReduce and YARN – Job scheduling - Data Serialization.
  • 3. Big data • Big Data is a collection of data that is huge in volume, yet growing exponentially with time. • It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. • Big data is also a data but with huge size. Example of Big Data • New York Stock Exchange Social Media Dr.J.Praveenchandar/CSE
  • 4. Apache Hadoop & Hadoop Ecosystem • Apache Hadoop is an open source framework intended to make interaction with big data easier, • Hadoop has made its place in the industries and companies that need to work on large data sets which are sensitive and needs efficient handling. • Hadoop is a framework that enables processing of large data sets which reside in the form of clusters. • Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies. Dr.J.Praveenchandar/CSE
  • 5. A multi-node Hadoop cluster Dr.J.Praveenchandar/CSE
  • 6. Apache Hadoop & Hadoop Ecosystem Dr.J.Praveenchandar/CSE • Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. There are four major elements of Hadoop i.e. • HDFS, • MapReduce, • YARN, and • Hadoop Common.
  • 8. Components that collectively form a Hadoop ecosystem Dr.J.Praveenchandar/CSE • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling
  • 9. HDFS: Dr.J.Praveenchandar/CSE • HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. • HDFS consists of two core components i.e.  Name node  Data Node
  • 10. YARN: Dr.J.Praveenchandar/CSE • Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. • Consists of three major components i.e.  Resource Manager  Nodes Manager  Application Manager
  • 11. MapReduce: Dr.J.Praveenchandar/CSE • By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. • MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:  Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method.  Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
  • 12. PIG: Dr.J.Praveenchandar/CSE • Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. • It is a platform for structuring the data flow, processing and analyzing huge data sets. • Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. • After the processing, pig stores the result in HDFS.
  • 13. HIVE: Dr.J.Praveenchandar/CSE • With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. • However, its query language is called as HQL (Hive Query Language). • It is highly scalable as it allows real-time processing and batch processing both. • Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. • Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line.
  • 14. Apache Spark: Dr.J.Praveenchandar/CSE • It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. • It consumes in memory resources hence, thus being faster than the prior in terms of optimization. • Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably.
  • 16. Hadoop Streaming Dr.J.Praveenchandar/CSE • It is a utility or feature that comes with a Hadoop distribution that allows developers or programmers to write the Map-Reduce program using different programming languages like Ruby, Perl, Python, C++ • If we are reading an image data then we can generate key-value pair for each pixel where the key will be the location of the pixel and the value will be its color value from (0-255) for a colored image. • Now this list of key-value pairs is fed to the Map phase and Mapper will work on each of these key-value pair of each pixel and generate some intermediate key-value pairs which are then fed to the Reducer after doing shuffling and sorting then the final output produced by the reducer will be written to the HDFS.
  • 17. HDFS concept Dr.J.Praveenchandar/CSE • The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. • HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. • HDFS enables the rapid transfer of data between compute nodes. • At its outset, it was closely coupled with Map Reduce, a framework for data processing that filters and divides up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer to a query.
  • 19. HDFS Dr.J.Praveenchandar/CSE • HDFS uses a primary/secondary architecture. • The HDFS cluster's NameNode is the primary server that manages the file system namespace and controls client access to files. • As the central component of the Hadoop Distributed File System, the NameNode maintains and manages the file system namespace and provides clients with the right access permissions. • The system's DataNodes manage the storage that's attached to the nodes they run
  • 20. HDFS Dr.J.Praveenchandar/CSE • HDFS exposes a file system namespace and enables user data to be stored in files. • A file is split into one or more of the blocks that are stored in a set of DataNodes. • The NameNode performs file system namespace operations, including opening, closing and renaming files and directories. • The NameNode also governs the mapping of blocks to the DataNodes. • The DataNodes serve read and write requests from the clients of the file system. In addition, they perform block creation, deletion and replication when the NameNode instructs them to do so.
  • 21. HDFS Dr.J.Praveenchandar/CSE • HDFS supports a traditional hierarchical file organization. • An application or user can create directories and then store files inside these directories. • The file system namespace hierarchy is like most other file systems -- a user can create, remove, rename or move files from one directory to another.
  • 22. Features of HDFS There are several features that make HDFS particularly useful, including Data replication. This is used to ensure that the data is always available and prevents data loss. For example, when a node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing continues while data is recovered. Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster ensures fault tolerance and reliability. High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or a DataNode fails. Dr.J.Praveenchandar/CSE
  • 23. Features of HDFS Dr.J.Praveenchandar/CSE • Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to hundreds of nodes. • High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput. • Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data move to where the computational unit is.
  • 24. Benefits of using HDFS Dr.J.Praveenchandar/CSE • Cost effectiveness. • Large data set storage. • Fast recovery from hardware failure. • Portability. • Streaming data access.
  • 25. HDFS use cases and examples Dr.J.Praveenchandar/CSE • The Hadoop Distributed File System emerged at Yahoo as a part of that company's online ad placement and search engine requirements. • Like other web-based companies, Yahoo juggled a variety of applications that were accessed by an increasing number of users, who were creating more and more data. • EBay, Facebook, LinkedIn and Twitter are among the companies that used HDFS to underpin big data analytics to address requirements similar to Yahoo's.
  • 26. Moving data into and out of Hadoop Dr.J.Praveenchandar/CSE • Data movement is one of those things that you aren’t likely to think too much about until you’re fully committed to using Hadoop on a project, at which point it becomes this big scary unknown that has to be tackled. • Ingress and egress refer to data movement into and out of a system, respectively. Key elements of data movement • Moving large quantities of data in and out of Hadoop offers logistical challenges that include consistency guarantees and resource impacts on data sources and destinations.
  • 28. Key elements of data movement Dr.J.Praveenchandar/CSE • Idempotence : An idempotent operation produces the same result no matter how many times it’s executed. • Aggregation: It is performed to acquire the final result of the MapReduce job, that is combining the output of the Mapper and displaying the result. • Data format transformation The data format transformation process converts one data format into another. • Monitoring ensures that functions are performing as expected in Monitoring automated systems. • Data format transformation The data format transformation process converts one data format into another.
  • 29. MapReduce Dr.J.Praveenchandar/CSE • MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes. • MapReduce provides analytical capabilities for analyzing huge volumes of complex data. • Traditional Enterprise Systems normally have a centralized server to store and process data. • Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. • Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.
  • 30. MapReduce Works Dr.J.Praveenchandar/CSE • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key- value pairs). • The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. • The reduce task is always performed after the map job.
  • 32. MapReduce Dr.J.Praveenchandar/CSE • Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. • Map − Map is a user-defined function, which takes a series of key- value pairs and processes each one of them to generate zero or more key-value pairs. • Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys. • Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. • Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step.
  • 33. MapReduce Dr.J.Praveenchandar/CSE • Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. • Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. • Once the execution is over, it gives zero or more key-value pairs to the final step. • Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
  • 35. MapReduce-Example • Let us take a real-world example to comprehend the power of MapReduce. • Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. • The following illustration shows how Tweeter manages its tweets with the help of MapReduce. Dr.J.Praveenchandar/CSE
  • 36. MapReduce Dr.J.Praveenchandar/CSE • MapReduce algorithm performs the following actions − • Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs. • Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs. • Count − Generates a token counter per word. • Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
  • 37. MapReduce - Algorithm • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • The map task is done by means of Mapper Class • The reduce task is done by means of Reducer Class. • Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them. Dr.J.Praveenchandar/CSE
  • 38. MapReduce - Algorithm Dr.J.Praveenchandar/CSE • MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster. These mathematical algorithms may include the following − • Sorting • Searching • Indexing • TF-IDF
  • 39. MapReduce - Algorithm Dr.J.Praveenchandar/CSE Sorting • Sorting is one of the basic MapReduce algorithms to process and analyze data. • MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Searching • Searching plays an important role in MapReduce algorithm. It helps in the combiner phase (optional) and in the Reducer phase. • Let us try to understand how Searching works with the help of an example.
  • 40. MapReduce - Algorithm Dr.J.Praveenchandar/CSE Indexing • Normally indexing is used to point to a particular data and its address. • It performs batch indexing on the input files for a particular Mapper. TF-IDF • TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document Frequency. • It is one of the common web analysis algorithms. Here, the term 'frequency' refers to the number of times a term appears in a document. Term Frequency (TF) • It measures how frequently a particular term occurs in a document. • It is calculated by the number of times a word appears in a document divided by the total number of words in that document.
  • 41. MapReduce Architecture Dr.J.Praveenchandar/CSE • MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient to use. • MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner. • The data is first split and then combined to produce the final result. • The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and to reduce the processing power. • The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
  • 43. Components of MapReduce Architecture: Dr.J.Praveenchandar/CSE • Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. • Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so many smaller tasks that the client wants to process or execute. • Hadoop MapReduce Master: It divides the particular job into subsequent job-parts. • Job-Parts: The task or sub-jobs that are obtained after dividing the main job • Input Data: The data set that is fed to the MapReduce for processing. • Output Data: The final result is obtained after the processing.
  • 44. Job tracker and the task tracker Dr.J.Praveenchandar/CSE • Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across the cluster and also to schedule each map on the Task Tracker running on the same data node since there can be hundreds of data nodes available in the cluster. • Task Tracker: The Task Tracker can be considered as the actual slaves that are working on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes available in the cluster that executes the Map and Reduce task as instructed by Job Tracker.
  • 45. UNDERSTANDING INPUTS AND OUTPUTS IN MAPREDUCE Dr.J.Praveenchandar/CSE
  • 46. INPUTS AND OUTPUTS IN MAPREDUCE • Data input :- • The two classes that support data input in MapReduce are InputFormat and Record-Reader. • The InputFormat class is consulted to determine how the input data should be partitioned for the map tasks, and the RecordReader performs the reading of data from the inputs. Dr.J.Praveenchandar/CSE
  • 47. INPUTS AND OUTPUTS IN MAPREDUCE Dr.J.Praveenchandar/CSE Data output :- • MapReduce uses a similar process for supporting output data as it does for input data. • Two classes must exist, an OutputFormat and a RecordWriter. • The OutputFormat performs some basic validation of the data sink properties, and the RecordWriter writes each reducer output to the data sink.
  • 48. Anatomy of a MapReduce Job Run Dr.J.Praveenchandar/CSE • You can run a MapReduce job with a single method call: submit() on a Job object (you can also call waitForCompletion(), • which submits the job if it hasn’t been submitted already, then waits for it to finish • This method call conceals a great deal of processing behind the scenes. This section uncovers the steps Hadoop takes to run a job.
  • 49. Anatomy of a MapReduce Job Run Dr.J.Praveenchandar/CSE Here are five independent entities • The client, which submits the MapReduce job. • The YARN resource manager, which coordinates the allocation of compute resources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers. • The distributed filesystem (normally HDFS)which is used for sharing job files between the other entities.
  • 51. Failures in Classic MapReduce Dr.J.Praveenchandar/CSE • In the MapReduce 1 runtime there are three failure modes to consider: failure of the running task, failure of the tastracker, and failure of the jobtracker. Let’s look at each in turn. Task Failure • Consider first the case of the child task failing. • The most common way that this happens is when user code in the map or reduce task throws a runtime exception. • If this happens, the child JVM reports the error back to its parent tasktracker, before it exits. The error ultimately makes it into the user logs. • The tasktracker marks the task attempt as failed, freeing up a slot to run another task.
  • 52. Failures in Classic MapReduce Dr.J.Praveenchandar/CSE Tasktracker Failure • Failure of a tasktracker is another failure mode. • If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently). • The jobtracker will notice a tasktracker that has stopped sending heartbeats. • A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed.
  • 53. Failures in Classic MapReduce Dr.J.Praveenchandar/CSE Jobtracker Failure • Failure of the jobtracker is the most serious failure mode. • Hadoop has no mechanism for dealing with failure of the jobtracker—it is a single point of failure—so in this case the job fails. • However, this failure mode has a low chance of occurring, since the chance of a particular machine failing is low. • After restarting a jobtracker, any jobs that were running at the time it was stopped will need to be re-submitted.