MapReduce:
Simplified Data Processing on Large Clusters
Presented by Cleverence Kombe
By Jeffrey Dean and Sanjay Ghemawat
OUTLINES
1. Introduction
2. Programming Model
3. Implementation
4. Refinements
5. Performance
6. Experience and Conclusion
1. INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in
reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the
data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data
Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.
oWhat is MapReduce?
Programming Model, approach, for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of commodity machines.
Many real world tasks are expressible in this model.
oMapReduce provides:
User-defined functions
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
1. INTRODUCTION CONT…
oInput & Output are sets of key/value pairs
oProgrammer specifies two functions:
1. map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
1. reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (most cases just one)
2. PROGRAMMING MODEL
2. PROGRAMMING MODEL
…
Input Files
Input file1
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
2. PROGRAMMING MODEL
…More Examples
Distributed Grep
 The map function emits a line if it matches a supplied pattern
Count of URL access frequency.
 The map function processes logs of web page requests and outputs <URL, 1>
Reverse web-link graph
 The map function outputs <target, source> pairs for each link to a target URL found in a page named source
Term-Vector per Host
 A term vector summarizes the most important words that occur in a document or a set of documents as a list
of (word, frequency) pairs
Inverted Index
 The map function parses each document, and emits a sequence of (word, document ID) pairs
Distributed Sort
 The map function extracts the key from each record, and emits a (key, record) pair
 Many different implementations are possible
 The right choice is depending on the environment.
 Typical cluster: (wide use at Google, large clusters of PC’s connected via
switched nets)
• Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of
memory per machine.
• connected with networking HW, Limited bisection bandwidth
• Storage is on local IDE disks (inexpensive)
• GFS: distributed file system manages data
• Scheduling system by the users to submit the tasks (Job=set of tasks
mapped by scheduler to set of available PC within the cluster)
Implemented using C++ library and linked into user programs
3. IMPLEMENTATION
Execution Overview
Map
• Divide the input into M equal-sized splits
• Each split is 16-64 MB large
Reduce
• Partitioning intermediate key space into R pieces
• hash(intermediate_key) mod R
Typical setting:
• 2,000 machines
• M = 200,000
• R = 5,000
3. IMPLEMENTATION...
M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data
• Sort it by intermediate keys
Execution Overview…3. IMPLEMENTATION…
Fault Tolerance
Works: Handled through re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Why do we need to re-execute even the completed tasks?
• Re-execute in progress reduce tasks
• Task completion committed through master
Master failure:
• It can be handled, but don't yet (master failure unlikely)
3. IMPLEMENTATION…
Locality
Master scheduling policy:
• Asks GFS for locations of replicas of input file blocks
• Map tasks typically split into 64MB (GFS block size)
• Map tasks scheduled so GFS input block replica are on same machine or
same rack
As a result:
• most task’s input data is read locally and consumes no network bandwidth
3. IMPLEMENTATION…
Backup Tasks
common causes that lengthens the total time taken for a
MapReduce operation is a straggler.
mechanism to alleviate the problem of stragglers.
the master schedules backup executions of the remaining in-
progress tasks.
significantly reduces the time to complete large MapReduce
operations.( up to 40% )
3. IMPLEMENTATION…
• Different partitioning functions.
• User specify the number of reduce tasks/output that they desire (R).
• Combiner function.
• Useful for saving network bandwidth
• Different input/output types
• Skipping bad records
• Master asks next worker is told to skip the bad record
• Local execution
• an alternative implementation of the MapReduce library that sequentially executes all of the work for
a MapReduce operation on the local machine.
• Status info
• Progress of the computation & more info…
• Counters
• count occurrences of various events. (Ex: total number of words processed)
4. REFINEMENT
Measure the performance of MapReduce on two
computations running on a large cluster of machines.
Grep
• searches through approximately one terabyte of
data looking for a particular pattern
Sort
• sorts approximately one terabyte of data
5. PERFORMANCE
Specifications
Cluster 1800 machines
Memory 4 GB
Processors Dual-processor 2 GHz Xeons with Hyper-
threading
Hard disk Dual 160 GB IDE disks
Network Gigabit Ethernet per machine
bandwidth approximately 100 Gbps
Cluster Configuration
5. PERFORMANCE…
Grep
Computation
Scans 10 billions 100-byte
records, searching for rare 3-
character pattern (occurs in
92,337 records).
 input is split into
approximately 64MB pieces (M
= 15000), entire output is
placed in one file , R = 1
Startup overhead is significant
for short jobs
Data Transfer rate over time
5. PERFORMANCE…
Sort Computation
 Backup tasks improves completion time reasonably
 System manages machine failures relatively quickly.
5. PERFORMANCE…
Data transfer rates over time for different executions of the sort program
44% longer 5% longer
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Fun to use: focus on problem, let library deal with messy details
No big need for parallelization knowledge
• (relief the user from dealing with low level parallelization details)
6. Experience & Conclusions
Thank
you!

Map reduce - simplified data processing on large clusters

  • 1.
    MapReduce: Simplified Data Processingon Large Clusters Presented by Cleverence Kombe By Jeffrey Dean and Sanjay Ghemawat
  • 2.
    OUTLINES 1. Introduction 2. ProgrammingModel 3. Implementation 4. Refinements 5. Performance 6. Experience and Conclusion
  • 3.
    1. INTRODUCTION o Manytasks in large scale data processing composed of: o Computations that processes large amount of raw data which produce a lots of other data. o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in reasonable period of time. o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the data, and handle failures. o But these techniques contains very complex programming codes. o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
  • 4.
    oWhat is MapReduce? ProgrammingModel, approach, for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of commodity machines. Many real world tasks are expressible in this model. oMapReduce provides: User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring 1. INTRODUCTION CONT…
  • 5.
    oInput & Outputare sets of key/value pairs oProgrammer specifies two functions: 1. map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs 1. reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (most cases just one) 2. PROGRAMMING MODEL
  • 6.
    2. PROGRAMMING MODEL … InputFiles Input file1 Input file2 Each line passed to individual mapper instances Map Key Value Splitting Sort and Shuffle Reduce Key Value Pairs Final Output Output file o Words Count Example
  • 7.
    2. PROGRAMMING MODEL …MoreExamples Distributed Grep  The map function emits a line if it matches a supplied pattern Count of URL access frequency.  The map function processes logs of web page requests and outputs <URL, 1> Reverse web-link graph  The map function outputs <target, source> pairs for each link to a target URL found in a page named source Term-Vector per Host  A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs Inverted Index  The map function parses each document, and emits a sequence of (word, document ID) pairs Distributed Sort  The map function extracts the key from each record, and emits a (key, record) pair
  • 8.
     Many differentimplementations are possible  The right choice is depending on the environment.  Typical cluster: (wide use at Google, large clusters of PC’s connected via switched nets) • Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine. • connected with networking HW, Limited bisection bandwidth • Storage is on local IDE disks (inexpensive) • GFS: distributed file system manages data • Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster) Implemented using C++ library and linked into user programs 3. IMPLEMENTATION
  • 9.
    Execution Overview Map • Dividethe input into M equal-sized splits • Each split is 16-64 MB large Reduce • Partitioning intermediate key space into R pieces • hash(intermediate_key) mod R Typical setting: • 2,000 machines • M = 200,000 • R = 5,000 3. IMPLEMENTATION...
  • 10.
    M input splits of16- 64MB each Partitioning function hash(intermediate_key) mod R (0) mapreduce(spec, &result) R regions • Read all intermediate data • Sort it by intermediate keys Execution Overview…3. IMPLEMENTATION…
  • 11.
    Fault Tolerance Works: Handledthrough re-execution • Detect failure via periodic heartbeats • Re-execute completed + in-progress map tasks • Why do we need to re-execute even the completed tasks? • Re-execute in progress reduce tasks • Task completion committed through master Master failure: • It can be handled, but don't yet (master failure unlikely) 3. IMPLEMENTATION…
  • 12.
    Locality Master scheduling policy: •Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size) • Map tasks scheduled so GFS input block replica are on same machine or same rack As a result: • most task’s input data is read locally and consumes no network bandwidth 3. IMPLEMENTATION…
  • 13.
    Backup Tasks common causesthat lengthens the total time taken for a MapReduce operation is a straggler. mechanism to alleviate the problem of stragglers. the master schedules backup executions of the remaining in- progress tasks. significantly reduces the time to complete large MapReduce operations.( up to 40% ) 3. IMPLEMENTATION…
  • 14.
    • Different partitioningfunctions. • User specify the number of reduce tasks/output that they desire (R). • Combiner function. • Useful for saving network bandwidth • Different input/output types • Skipping bad records • Master asks next worker is told to skip the bad record • Local execution • an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. • Status info • Progress of the computation & more info… • Counters • count occurrences of various events. (Ex: total number of words processed) 4. REFINEMENT
  • 15.
    Measure the performanceof MapReduce on two computations running on a large cluster of machines. Grep • searches through approximately one terabyte of data looking for a particular pattern Sort • sorts approximately one terabyte of data 5. PERFORMANCE
  • 16.
    Specifications Cluster 1800 machines Memory4 GB Processors Dual-processor 2 GHz Xeons with Hyper- threading Hard disk Dual 160 GB IDE disks Network Gigabit Ethernet per machine bandwidth approximately 100 Gbps Cluster Configuration 5. PERFORMANCE…
  • 17.
    Grep Computation Scans 10 billions100-byte records, searching for rare 3- character pattern (occurs in 92,337 records).  input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1 Startup overhead is significant for short jobs Data Transfer rate over time 5. PERFORMANCE…
  • 18.
    Sort Computation  Backuptasks improves completion time reasonably  System manages machine failures relatively quickly. 5. PERFORMANCE… Data transfer rates over time for different executions of the sort program 44% longer 5% longer
  • 19.
    MapReduce has provento be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal with messy details No big need for parallelization knowledge • (relief the user from dealing with low level parallelization details) 6. Experience & Conclusions
  • 20.