MAPREDUCE: SIMPLIFIED
DATA PROCESSING ON
LARGE CLUSTERS
AUTHORS: JEFFREY DEAN AND SANJAY GHEMAWAT
PRESENTED BY: DAKSH GOTI
DATE: 9/11/2024
ID: 032172125
GUIDED BY: DR. HAILU XU
INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of
machines to complete the tasks in reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the
computation, distribute the data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by
hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
INTRODUCTION
TO HADOOP
INTRODUCTION
 Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of
computers.
HADOOP
ARCHITECTURE
At its core, Hadoop has two major
layers namely.
1. Processing/Computation layer
(MapReduce)
2. Storage layer (Hadoop Distributed
File System).
Apart from the above-mentioned two
core components, Hadoop framework
also includes the following two
modules.
1.Hadoop Common - These are Java
libraries and utilities required by other
Hadoop modules.
2.Hadoop YARN - This is a framework
for job scheduling and cluster
resource management.
HOW DOES HADOOP WORK?
Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M
(preferably l128M). These files are then distributed across various cluster nodes for further processing •HDFS,
being on top of the local file system, supervises the processing • Blocks are replicated for handling hardware
failure • Checking that the code was executed successfully. • Performing the sort that takes place between the
map and reduce stages • Sending the sorted data to a certain computer. • Writing the debugging logs for each
job.
Hadoop runs code across a cluster of computers This process includes the following core tasks that Hadoop
performs
ADVANTAGES OF HADOOP
adoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop librar
y itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without
interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms
since it is Java based.
HADOOP ECOSYSTEM
1. HDFS:
HDFS is the primary or major component
of Hadoop ecosyster and is responsible
for storing large data sets of structured or
unstructured data across various nodes
and thereby maintaining the metadata in
the form of log files
HDFS consists of two core components i.e.
• Name node
• Data Node
2. YARN:
Yet Another Resource Negotiator, as the
name implies, YARN is the one who helps
to manage the resources across the
clusters.
In short, it performs scheduling and
resource allocation for the Hadoop
System. Consists of three major
components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
Advantages of HDFS:
It is inexpensive, immutable in nature,
stores data reliably, ability to tolerate
faults scalable, block structured, can
process a large amount of data
simultaneously and many more.
Disadvantages of HDFS:
It's the biggest disadvantage is that it is
not fit for small quantities of data. Also, it
has issues related to potential stability,
restrictive and rough in nature.
Hadoop also supports a wide range of
software packages such as Apache Flumes
Apache Oozie, Apache HBase, Apache
Sqoop, Apache Spark, Apache Storm,
Apache Pig, Apache Hive, Apache Phoenix,
Cloudera Impala.
INTRODUCTION
• Overview:
MapReduce is a programming model and an associated implementation
designed to simplify the processing of large data sets on clusters of
commodity hardware.
• Purpose:
It enables automatic parallelization, fault tolerance, data distribution, and
load balancing.
• Context:
Extensively used at Google for various large-scale computations, such as
processing web crawls, logs, and generating derived data sets.
• Provides:
 User-defined functions
 Automatic parallelization and distribution
 Fault-tolerance
 I/O scheduling
 Status and monitoring
PROGRAMMING MODEL
Map
Map Function:
Takes an input
key/value pair and
produces a set of
intermediate key/value
pairs.
Reduce
Reduce Function:
Merges all
intermediate values
associated with the
same intermediate key
to produce the final
output.
Key
Key Features:
• Simplicity in defining
computations.
• Automatic parallelization
and fault-tolerance.
• High scalability.
EXAMPLE OF MAPREDUCE
Word Count Example:
•Map Function: For each word in the document, emit (word, 1).
•Reduce Function: Sum up all values for each word to get the total count.
Pseudo-code:
map(String key, String value):
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Input Files
Input file1
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
EXAMPLE:
 Distributed Grep
 The map function emits a line if it matches a supplied pattern
 Count of URL access frequency.
 The map function processes logs of web page requests and outputs <URL, 1>
 Reverse web-link graph
 The map function outputs <target, source> pairs for each link to a target URL found in a page named source
 Term-Vector per Host
 A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs
 Inverted Index
 The map function parses each document, and emits a sequence of (word, document ID) pairs
 Distributed Sort
 The map function extracts the key from each record, and emits a (key, record) pair
SYSTEM
ARCHITECTURE
•Components:
•Master Node: Coordinates the execution of map and reduce tasks,
manages task assignments.
•Worker Nodes: Perform map and reduce operations.
•Data Flow:
1.Input data is split into chunks.
2.Map tasks process these chunks and produce intermediate data.
3.Reduce tasks aggregate the intermediate data and produce the
final output.
•Diagram: Include a diagram of the execution overview.
M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data
• Sort it by intermediate keys
Execution
Overview
EXECUTION
OVERVIEW
•Execution Steps:
1.Input files are split into M pieces.
2.The master assigns map and reduce tasks to
worker nodes.
3.Map Phase: Workers read input splits, process
data, and write intermediate results to local disks.
4.Reduce Phase: Workers fetch intermediate
results, sort them by key, and execute the reduce
function.
5.Results are written to output files.
•Visualization: Include Figure 1 from the paper
(execution overview).
FAULT
TOLERANCE
•Worker Failures:
•The master periodically pings workers. If a worker
fails, tasks are reassigned.
•Completed map tasks on failed workers are re-
executed; reduce tasks are not.
•Master Failure:
•Rare; requires restarting the MapReduce job.
•Atomic Operations:
•Ensure consistency through atomic renaming of
output files.
PERFORMANCE OPTIMIZATION
•Locality Optimization:
•Map tasks are scheduled on nodes that store the input data locally.
•Backup Tasks:
•Duplicate last remaining tasks to mitigate "stragglers" and reduce
completion time.
•Combiner Function:
•Reduces data size during the map phase by merging intermediate
data locally.
PERFORMANCE
MEASURE
Measure the performance of
MapReduce on two computations
running on a large cluster of machines.
 Grep
 searches through approximately
one terabyte of data looking for a
particular pattern
 Sort
 sorts approximately one terabyte
of data
GREP COMPUTATION
 Scans 10 billions 100-byte records, searching for
rare 3-character pattern (occurs in 92,337
records).
 input is split into approximately 64MB pieces (M =
15000), entire output is placed in one file , R = 1
 Startup overhead is significant for short jobs
Data Transfer rate over time
SORT COMPUTATION
 Backup tasks improves completion time reasonably
 System manages machine failures relatively quickly.
Data transfer rates over time for different executions of the sort
program
USE CASES AND APPLICATIONS
Large-Scale Indexing:
Used in Google’s production indexing
system for search.
Data Mining and Machine
Learning:
Examples include clustering and
classification tasks.
Log Analysis and Web Graph
Processing:
Efficiently processes large-scale logs and
builds web graphs.
PERFORMANCE
RESULTS
•Metrics:
•Execution time, input/output size, and intermediate
data size.
•Examples:
•Sorting 1 TB of data using MapReduce.
•Data transfer rates during different phases of
execution.
CONCLUSION
•Summary:
•MapReduce simplifies parallel
data processing on large clusters.
•Advantages:
•Easy to use, scalable, fault-
tolerant, and efficient.
•Impact:
•Widely adopted at Google and
beyond for large-scale data
processing tasks.
Q&A
OPEN THE FLOOR FOR
QUESTIONS.

Mapreduce is for Hadoop Ecosystem in Data Science

  • 1.
    MAPREDUCE: SIMPLIFIED DATA PROCESSINGON LARGE CLUSTERS AUTHORS: JEFFREY DEAN AND SANJAY GHEMAWAT PRESENTED BY: DAKSH GOTI DATE: 9/11/2024 ID: 032172125 GUIDED BY: DR. HAILU XU
  • 2.
    INTRODUCTION o Many tasksin large scale data processing composed of: o Computations that processes large amount of raw data which produce a lots of other data. o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in reasonable period of time. o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the data, and handle failures. o But these techniques contains very complex programming codes. o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
  • 3.
  • 4.
    INTRODUCTION  Hadoop isan Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.
  • 5.
    HADOOP ARCHITECTURE At its core,Hadoop has two major layers namely. 1. Processing/Computation layer (MapReduce) 2. Storage layer (Hadoop Distributed File System). Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules. 1.Hadoop Common - These are Java libraries and utilities required by other Hadoop modules. 2.Hadoop YARN - This is a framework for job scheduling and cluster resource management.
  • 6.
    HOW DOES HADOOPWORK? Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably l128M). These files are then distributed across various cluster nodes for further processing •HDFS, being on top of the local file system, supervises the processing • Blocks are replicated for handling hardware failure • Checking that the code was executed successfully. • Performing the sort that takes place between the map and reduce stages • Sending the sorted data to a certain computer. • Writing the debugging logs for each job. Hadoop runs code across a cluster of computers This process includes the following core tasks that Hadoop performs
  • 7.
    ADVANTAGES OF HADOOP adoopframework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop librar y itself has been designed to detect and handle failures at the application layer. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
  • 8.
  • 9.
    1. HDFS: HDFS isthe primary or major component of Hadoop ecosyster and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files HDFS consists of two core components i.e. • Name node • Data Node 2. YARN: Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e. • Resource Manager • Nodes Manager • Application Manager
  • 10.
    Advantages of HDFS: Itis inexpensive, immutable in nature, stores data reliably, ability to tolerate faults scalable, block structured, can process a large amount of data simultaneously and many more. Disadvantages of HDFS: It's the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential stability, restrictive and rough in nature. Hadoop also supports a wide range of software packages such as Apache Flumes Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
  • 11.
    INTRODUCTION • Overview: MapReduce isa programming model and an associated implementation designed to simplify the processing of large data sets on clusters of commodity hardware. • Purpose: It enables automatic parallelization, fault tolerance, data distribution, and load balancing. • Context: Extensively used at Google for various large-scale computations, such as processing web crawls, logs, and generating derived data sets. • Provides:  User-defined functions  Automatic parallelization and distribution  Fault-tolerance  I/O scheduling  Status and monitoring
  • 12.
    PROGRAMMING MODEL Map Map Function: Takesan input key/value pair and produces a set of intermediate key/value pairs. Reduce Reduce Function: Merges all intermediate values associated with the same intermediate key to produce the final output. Key Key Features: • Simplicity in defining computations. • Automatic parallelization and fault-tolerance. • High scalability.
  • 13.
    EXAMPLE OF MAPREDUCE WordCount Example: •Map Function: For each word in the document, emit (word, 1). •Reduce Function: Sum up all values for each word to get the total count. Pseudo-code: map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 14.
    Input Files Input file1 Inputfile2 Each line passed to individual mapper instances Map Key Value Splitting Sort and Shuffle Reduce Key Value Pairs Final Output Output file o Words Count Example
  • 15.
    EXAMPLE:  Distributed Grep The map function emits a line if it matches a supplied pattern  Count of URL access frequency.  The map function processes logs of web page requests and outputs <URL, 1>  Reverse web-link graph  The map function outputs <target, source> pairs for each link to a target URL found in a page named source  Term-Vector per Host  A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs  Inverted Index  The map function parses each document, and emits a sequence of (word, document ID) pairs  Distributed Sort  The map function extracts the key from each record, and emits a (key, record) pair
  • 16.
    SYSTEM ARCHITECTURE •Components: •Master Node: Coordinatesthe execution of map and reduce tasks, manages task assignments. •Worker Nodes: Perform map and reduce operations. •Data Flow: 1.Input data is split into chunks. 2.Map tasks process these chunks and produce intermediate data. 3.Reduce tasks aggregate the intermediate data and produce the final output. •Diagram: Include a diagram of the execution overview.
  • 17.
    M input splits of16- 64MB each Partitioning function hash(intermediate_key) mod R (0) mapreduce(spec, &result) R regions • Read all intermediate data • Sort it by intermediate keys Execution Overview
  • 18.
    EXECUTION OVERVIEW •Execution Steps: 1.Input filesare split into M pieces. 2.The master assigns map and reduce tasks to worker nodes. 3.Map Phase: Workers read input splits, process data, and write intermediate results to local disks. 4.Reduce Phase: Workers fetch intermediate results, sort them by key, and execute the reduce function. 5.Results are written to output files. •Visualization: Include Figure 1 from the paper (execution overview).
  • 19.
    FAULT TOLERANCE •Worker Failures: •The masterperiodically pings workers. If a worker fails, tasks are reassigned. •Completed map tasks on failed workers are re- executed; reduce tasks are not. •Master Failure: •Rare; requires restarting the MapReduce job. •Atomic Operations: •Ensure consistency through atomic renaming of output files.
  • 20.
    PERFORMANCE OPTIMIZATION •Locality Optimization: •Maptasks are scheduled on nodes that store the input data locally. •Backup Tasks: •Duplicate last remaining tasks to mitigate "stragglers" and reduce completion time. •Combiner Function: •Reduces data size during the map phase by merging intermediate data locally.
  • 21.
    PERFORMANCE MEASURE Measure the performanceof MapReduce on two computations running on a large cluster of machines.  Grep  searches through approximately one terabyte of data looking for a particular pattern  Sort  sorts approximately one terabyte of data
  • 22.
    GREP COMPUTATION  Scans10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records).  input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1  Startup overhead is significant for short jobs Data Transfer rate over time
  • 23.
    SORT COMPUTATION  Backuptasks improves completion time reasonably  System manages machine failures relatively quickly. Data transfer rates over time for different executions of the sort program
  • 24.
    USE CASES ANDAPPLICATIONS Large-Scale Indexing: Used in Google’s production indexing system for search. Data Mining and Machine Learning: Examples include clustering and classification tasks. Log Analysis and Web Graph Processing: Efficiently processes large-scale logs and builds web graphs.
  • 25.
    PERFORMANCE RESULTS •Metrics: •Execution time, input/outputsize, and intermediate data size. •Examples: •Sorting 1 TB of data using MapReduce. •Data transfer rates during different phases of execution.
  • 26.
    CONCLUSION •Summary: •MapReduce simplifies parallel dataprocessing on large clusters. •Advantages: •Easy to use, scalable, fault- tolerant, and efficient. •Impact: •Widely adopted at Google and beyond for large-scale data processing tasks.
  • 27.
    Q&A OPEN THE FLOORFOR QUESTIONS.