Mapreduce is for Hadoop Ecosystem in Data Science

MAPREDUCE: SIMPLIFIED
DATA PROCESSING ON
LARGE CLUSTERS
AUTHORS: JEFFREY DEAN AND SANJAY GHEMAWAT
PRESENTED BY: DAKSH GOTI
DATE: 9/11/2024
ID: 032172125
GUIDED BY: DR. HAILU XU

INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of
machines to complete the tasks in reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the
computation, distribute the data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by
hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

INTRODUCTION
 Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of
computers.

HADOOP
ARCHITECTURE
At its core, Hadoop has two major
layers namely.
1. Processing/Computation layer
(MapReduce)
2. Storage layer (Hadoop Distributed
File System).
Apart from the above-mentioned two
core components, Hadoop framework
also includes the following two
modules.
1.Hadoop Common - These are Java
libraries and utilities required by other
Hadoop modules.
2.Hadoop YARN - This is a framework
for job scheduling and cluster
resource management.

HOW DOES HADOOP WORK?
Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M
(preferably l128M). These files are then distributed across various cluster nodes for further processing •HDFS,
being on top of the local file system, supervises the processing • Blocks are replicated for handling hardware
failure • Checking that the code was executed successfully. • Performing the sort that takes place between the
map and reduce stages • Sending the sorted data to a certain computer. • Writing the debugging logs for each
job.
Hadoop runs code across a cluster of computers This process includes the following core tasks that Hadoop
performs

ADVANTAGES OF HADOOP
adoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop librar
y itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without
interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms
since it is Java based.

1. HDFS:
HDFS is the primary or major component
of Hadoop ecosyster and is responsible
for storing large data sets of structured or
unstructured data across various nodes
and thereby maintaining the metadata in
the form of log files
HDFS consists of two core components i.e.
• Name node
• Data Node
2. YARN:
Yet Another Resource Negotiator, as the
name implies, YARN is the one who helps
to manage the resources across the
clusters.
In short, it performs scheduling and
resource allocation for the Hadoop
System. Consists of three major
components i.e.
• Resource Manager
• Nodes Manager
• Application Manager

Advantages of HDFS:
It is inexpensive, immutable in nature,
stores data reliably, ability to tolerate
faults scalable, block structured, can
process a large amount of data
simultaneously and many more.
Disadvantages of HDFS:
It's the biggest disadvantage is that it is
not fit for small quantities of data. Also, it
has issues related to potential stability,
restrictive and rough in nature.
Hadoop also supports a wide range of
software packages such as Apache Flumes
Apache Oozie, Apache HBase, Apache
Sqoop, Apache Spark, Apache Storm,
Apache Pig, Apache Hive, Apache Phoenix,
Cloudera Impala.

INTRODUCTION
• Overview:
MapReduce is a programming model and an associated implementation
designed to simplify the processing of large data sets on clusters of
commodity hardware.
• Purpose:
It enables automatic parallelization, fault tolerance, data distribution, and
load balancing.
• Context:
Extensively used at Google for various large-scale computations, such as
processing web crawls, logs, and generating derived data sets.
• Provides:
 User-defined functions
 Automatic parallelization and distribution
 Fault-tolerance
 I/O scheduling
 Status and monitoring

PROGRAMMING MODEL
Map
Map Function:
Takes an input
key/value pair and
produces a set of
intermediate key/value
pairs.
Reduce
Reduce Function:
Merges all
intermediate values
associated with the
same intermediate key
to produce the final
output.
Key
Key Features:
• Simplicity in defining
computations.
• Automatic parallelization
and fault-tolerance.
• High scalability.

EXAMPLE OF MAPREDUCE
Word Count Example:
•Map Function: For each word in the document, emit (word, 1).
•Reduce Function: Sum up all values for each word to get the total count.
Pseudo-code:
map(String key, String value):
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Input Files
Input file1
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example

EXAMPLE:
 Distributed Grep
 The map function emits a line if it matches a supplied pattern
 Count of URL access frequency.
 The map function processes logs of web page requests and outputs <URL, 1>
 Reverse web-link graph
 The map function outputs <target, source> pairs for each link to a target URL found in a page named source
 Term-Vector per Host
 A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs
 Inverted Index
 The map function parses each document, and emits a sequence of (word, document ID) pairs
 Distributed Sort
 The map function extracts the key from each record, and emits a (key, record) pair

SYSTEM
ARCHITECTURE
•Components:
•Master Node: Coordinates the execution of map and reduce tasks,
manages task assignments.
•Worker Nodes: Perform map and reduce operations.
•Data Flow:
1.Input data is split into chunks.
2.Map tasks process these chunks and produce intermediate data.
3.Reduce tasks aggregate the intermediate data and produce the
final output.
•Diagram: Include a diagram of the execution overview.

M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data
• Sort it by intermediate keys
Execution
Overview

EXECUTION
OVERVIEW
•Execution Steps:
1.Input files are split into M pieces.
2.The master assigns map and reduce tasks to
worker nodes.
3.Map Phase: Workers read input splits, process
data, and write intermediate results to local disks.
4.Reduce Phase: Workers fetch intermediate
results, sort them by key, and execute the reduce
function.
5.Results are written to output files.
•Visualization: Include Figure 1 from the paper
(execution overview).

FAULT
TOLERANCE
•Worker Failures:
•The master periodically pings workers. If a worker
fails, tasks are reassigned.
•Completed map tasks on failed workers are re-
executed; reduce tasks are not.
•Master Failure:
•Rare; requires restarting the MapReduce job.
•Atomic Operations:
•Ensure consistency through atomic renaming of
output files.

PERFORMANCE OPTIMIZATION
•Locality Optimization:
•Map tasks are scheduled on nodes that store the input data locally.
•Backup Tasks:
•Duplicate last remaining tasks to mitigate "stragglers" and reduce
completion time.
•Combiner Function:
•Reduces data size during the map phase by merging intermediate
data locally.

PERFORMANCE
MEASURE
Measure the performance of
MapReduce on two computations
running on a large cluster of machines.
 Grep
 searches through approximately
one terabyte of data looking for a
particular pattern
 Sort
 sorts approximately one terabyte
of data

GREP COMPUTATION
 Scans 10 billions 100-byte records, searching for
rare 3-character pattern (occurs in 92,337
records).
 input is split into approximately 64MB pieces (M =
15000), entire output is placed in one file , R = 1
 Startup overhead is significant for short jobs
Data Transfer rate over time

SORT COMPUTATION
 Backup tasks improves completion time reasonably
 System manages machine failures relatively quickly.
Data transfer rates over time for different executions of the sort
program

USE CASES AND APPLICATIONS
Large-Scale Indexing:
Used in Google’s production indexing
system for search.
Data Mining and Machine
Learning:
Examples include clustering and
classification tasks.
Log Analysis and Web Graph
Processing:
Efficiently processes large-scale logs and
builds web graphs.

PERFORMANCE
RESULTS
•Metrics:
•Execution time, input/output size, and intermediate
data size.
•Examples:
•Sorting 1 TB of data using MapReduce.
•Data transfer rates during different phases of
execution.

CONCLUSION
•Summary:
•MapReduce simplifies parallel
data processing on large clusters.
•Advantages:
•Easy to use, scalable, fault-
tolerant, and efficient.
•Impact:
•Widely adopted at Google and
beyond for large-scale data
processing tasks.

Q&A
OPEN THE FLOOR FOR
QUESTIONS.

Mapreduce is for Hadoop Ecosystem in Data Science

More Related Content

Similar to Mapreduce is for Hadoop Ecosystem in Data Science

Recently uploaded

Mapreduce is for Hadoop Ecosystem in Data Science