Advance Topic in
Computer Science
MapReduce
Simplified Data Processing on Large Cluster
MAP REDUCE Simpified data processing on Large Cluster
 Outline
1. Abstraction
2. Introduction
3. Programming Model
4. Implementation
5. Refinements
6. Performance
7. Experience
8. Related Work
Introduction
Introduction
Problem?
Problems
• large input data
• compute in distributed cross
hundreds or thousands of machines
• require finishing in a reasonale
amount of time
 how to parallelize the computations,
distributed the data, and handle
failure conspire?
Solution
Solution
 build a library, allows
• to express the simple computations
• hide the messy details of parallelization,
fault-tolerance, data distribution
• load balancing in a library
 the library was inspired by the map and
reduce primitives present in Lisp and other
functional languagues
 user-specified map and reduce operations
to achived the goal
Solution
 provides a powerful interface
 enables automatic parallelization and
distribution of large-scale computations
 combined with an implementation of this
interface
 achieves high performance on large
clusters of comodity PCs
Programming Model
Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions: map, reduce
Job
Map
Reduce
Programming Model
 Processes input key/value pair
 Produces set of intermediate pairs
Map
in_key in_value
Input
List out_key out_value
Output
Programming Model
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
Reduce
Listout_key intermediate_
value
Input
out_value
Output
Programming Model
 Example: Count word occurrences
map(String input_key, String
input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key,
Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Programming Model
Other example uses:
 Distributed Grep
 Distributed Sort
 Count of URL Access Frequency
 Reverse Web-Link Graph
 Term-Vector per Host
 Inverted Index
Implementation
Overview
 Typical cluster
• 100s/1000s of 2-CPU x86 machines, 2-4 GB
of memory
• Limited bisection bandwidth
• Storage is on local IDE disks
• GFS: distributed file system manages data
(SOSP'03)
• Job scheduling system: jobs made up of
tasks, scheduler assigns tasks to machines
 Implement is a C++ library linked into
user programs
Execution
Execution
Execution Flow 1.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
(1) fork
(1) fork
(1) fork
Output
file
Execution Flow 2.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
Output
file
(2) assign
map
(2) assign
reduce
Execution Flow 3.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
Output
file
(3) read
Execution Flow 4.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
Output
file
(4) local
write
Execution Flow 5.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
Output
file
Execution Flow 6.
split 0
split 1
split 2
split 3
split 4
Input file
worker
worker
worker
Map
phase
User
program
Master
worker
worker
Intermediate files
(on local disks)
Reduce
phase
Output
file 0
Output
file 1
Output
file
(6)
write
Execution Flow
Master
Master data structure
 is special copies of the program
Master
worker
workerworker
workerworker
assign work
Master data structure
• Master keeps several data structure
• stores the state for each map task and
reduce task
• idle
• in-progress
• completed
• Store the identity of the worker
machine (for non-idle tasks)
Fault Tolerance
Master Failure
 Handled via re-execution
 Detech failure via periodic heartbeats
 Re-execute completed and in-progress map
tasks
 Re-execute in progress reduce tasks
 Task completion committed through master
 Master failure
 Could handle, but don’t yet (master failure
unlikely)
 Robust: lost 1600 of 1800 machines once,
but finished fine
Worker failure
• "failed" if workers do not response in
certain a mount of time
• the master reset any map tasks
completed by the worker as idle state
• Any map task or reduce task in
progress on failed worker is reset to
idle
• MapReduce is resilient to large-scale
worker failures
Task Granularity
Task Granularity
 Fine granularity taks: many more map
tasks than machines
 Minimizes time for fault recovery
 Can pipeline shuffling with map execution
 Better dynamic load balancing
 Often use 200,000 map/5000 reduce tasks
w/2000 machines
Task Granularity
Taks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Tasks Granularity
Backup
Backup Tasks
 "Straggler" lengthens the total time in
computation
 Straggler can arise for a whole host
of reasons
 a machine with a bad disk may have slow
read perfomance from 30MB/s – 1MB/s
 processor caches to be disabled causes
bug in machine initialization code, leading
slowed computation ~1/10
Backup Tasks
 Idea
• When a MapReduce operation is close to
completion, the master chedules backup
executions of the remaining in-progress tasks
• The task is completed whenever either the
pimary or the backup execution completes
• Increases few percent the computational
resources used by the operation to reduce
time
• Example: it takes more than 44% longer to
complete MapReduce task without increasing
resources
Refinements
Refinements
 The partitioning function on the
intermediate key to have partioned
data
 Ordering Guarantees, the intermediate
key/value pairs are processed in
increasing key order to generate a
sorted output
 Combiner Function, does partial
merging of this data before sending
over the network
Refinement
 Redundant Execution
 Slow workers significantly lengthen
completion time
• Other jobs consuming resources on machine
• Bad disks with soft errors transfer data very
slowly
• Weird things: processor caches disabled
 Solution: Near end of phase, spawn backup
copies of tasks
• Whichever one finishes first "wins"
 Effect: Dramatically shortens job completion
time
Refinement
 Locality Optimization
 Master scheduling policy
• Asks GFS for location of replicas of input file
blocks
• Map tasks typically split into 64MB (= GFS
block size)
• Map tasks sheduled so GFS input block
replica are on same machine or same rack
 Effect: Thousands of machines read at
local disk speed
• Without this, rack switches limit read rate
Refinement
Skipping Bad Records
 Map/Reduce functions sometimes fail for
particular inputs
• Best solution is to debug & fix, but not allways
possoble
• On seg fault
• Send UDP packet to master from signal handler
• Include sequence number of record being processed
• If master sees two failures for same record:
• Next worker is told to skip the record
 Effect: Can work around bugs in third-party
libraries
Refinement
Other Refinements
 Sorting guarantees within each reduce
partition
 Compression of intermediate data
 Combiner: useful for saving network
bandwidth
 Local execution for debugging/testing
 User-defined counters
Performance
Tests run on cluster of 1800 machines:
 4 GB of memory
 Dual-processor 2 GHz Xeons with
Hyper-Threading
 Dual 160 GB IDE disks
 Gigabit Ethernet per machine
 Bisection bandwidth approximately
100 Gbps
Performance
Two benchmarks:
Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K
matching records)
Sort Sort 1010 100-byte record
Performance: Grep
 Locality optimization helps:
◦ 1800 machines read 1 TB of data at peak of ~31
GB/s
◦ Without this, rack switches would limit to 10 GB/s
 Startup overhead is significant for short jobs
Performance: Sort
 Backup tasks reduce job completion
time significantly
 System deals well with failures
Normal No backup tasks 200 processes killed
Experience
 large-scale machine learning
problems
 clustering problems for the Google
News
 extraction of data used to produce
reports of popular queries
 extraction of properties of web pages
for new experiments and products
 large-scale graph computations.
Experience
Usage: MapReduce jobs run in
August 2004
Experience
Rewrote Google's production
indexing system using MapReduce
 New code is simpler, easier to understand
 MapReduce takes care of failures, slow
machines
 Easy to make indexing faster by adding more
machines
Our Experience
 Setup Apache Hadoop on:
◦ Single Node
◦ Cluster (on virtual machines)
 Run Word Count example
Related work
 Programming model inspired by functional language
primitives
 Partitioning/shuffling similar to many large-scale sorting
systems
◦ NOW-Sort ['97]
 Re-execution for fault tolerance
◦ BAD-FS ['04] and TACC ['97]
 Locality optimization has parallels with Active Disks/Diamond
work
◦ Active Disks ['01], Diamond ['04]
 Backup tasks similar to Eager Scheduling in Charlotte system
◦ Charlotte ['96]
 Dynamic load balancing solves similar problem as River's
distributed queues
◦ River ['99]
Conclusions
 MapReduce has proven to be a useful
abstraction
 Greatly simplifies large-scale
computations at Google
 Fun to use: focus on problem, let
library deal with messy details

MapReduce presentation

  • 1.
    Advance Topic in ComputerScience MapReduce Simplified Data Processing on Large Cluster
  • 2.
    MAP REDUCE Simpifieddata processing on Large Cluster  Outline 1. Abstraction 2. Introduction 3. Programming Model 4. Implementation 5. Refinements 6. Performance 7. Experience 8. Related Work
  • 3.
  • 4.
  • 5.
  • 6.
    Problems • large inputdata • compute in distributed cross hundreds or thousands of machines • require finishing in a reasonale amount of time  how to parallelize the computations, distributed the data, and handle failure conspire?
  • 7.
  • 8.
    Solution  build alibrary, allows • to express the simple computations • hide the messy details of parallelization, fault-tolerance, data distribution • load balancing in a library  the library was inspired by the map and reduce primitives present in Lisp and other functional languagues  user-specified map and reduce operations to achived the goal
  • 9.
    Solution  provides apowerful interface  enables automatic parallelization and distribution of large-scale computations  combined with an implementation of this interface  achieves high performance on large clusters of comodity PCs
  • 10.
  • 11.
    Programming Model  Input& Output: each a set of key/value pairs  Programmer specifies two functions: map, reduce Job Map Reduce
  • 12.
    Programming Model  Processesinput key/value pair  Produces set of intermediate pairs Map in_key in_value Input List out_key out_value Output
  • 13.
    Programming Model  Combinesall intermediate values for a particular key  Produces a set of merged output values (usually just one) Reduce Listout_key intermediate_ value Input out_value Output
  • 14.
    Programming Model  Example:Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 15.
    Programming Model Other exampleuses:  Distributed Grep  Distributed Sort  Count of URL Access Frequency  Reverse Web-Link Graph  Term-Vector per Host  Inverted Index
  • 16.
  • 17.
    Overview  Typical cluster •100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines  Implement is a C++ library linked into user programs
  • 18.
  • 19.
  • 20.
    Execution Flow 1. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 (1) fork (1) fork (1) fork Output file
  • 21.
    Execution Flow 2. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 Output file (2) assign map (2) assign reduce
  • 22.
    Execution Flow 3. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 Output file (3) read
  • 23.
    Execution Flow 4. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 Output file (4) local write
  • 24.
    Execution Flow 5. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 Output file
  • 25.
    Execution Flow 6. split0 split 1 split 2 split 3 split 4 Input file worker worker worker Map phase User program Master worker worker Intermediate files (on local disks) Reduce phase Output file 0 Output file 1 Output file (6) write
  • 26.
  • 27.
  • 28.
    Master data structure is special copies of the program Master worker workerworker workerworker assign work
  • 29.
    Master data structure •Master keeps several data structure • stores the state for each map task and reduce task • idle • in-progress • completed • Store the identity of the worker machine (for non-idle tasks)
  • 30.
  • 31.
    Master Failure  Handledvia re-execution  Detech failure via periodic heartbeats  Re-execute completed and in-progress map tasks  Re-execute in progress reduce tasks  Task completion committed through master  Master failure  Could handle, but don’t yet (master failure unlikely)  Robust: lost 1600 of 1800 machines once, but finished fine
  • 32.
    Worker failure • "failed"if workers do not response in certain a mount of time • the master reset any map tasks completed by the worker as idle state • Any map task or reduce task in progress on failed worker is reset to idle • MapReduce is resilient to large-scale worker failures
  • 33.
  • 34.
    Task Granularity  Finegranularity taks: many more map tasks than machines  Minimizes time for fault recovery  Can pipeline shuffling with map execution  Better dynamic load balancing  Often use 200,000 map/5000 reduce tasks w/2000 machines
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    Backup Tasks  "Straggler"lengthens the total time in computation  Straggler can arise for a whole host of reasons  a machine with a bad disk may have slow read perfomance from 30MB/s – 1MB/s  processor caches to be disabled causes bug in machine initialization code, leading slowed computation ~1/10
  • 49.
    Backup Tasks  Idea •When a MapReduce operation is close to completion, the master chedules backup executions of the remaining in-progress tasks • The task is completed whenever either the pimary or the backup execution completes • Increases few percent the computational resources used by the operation to reduce time • Example: it takes more than 44% longer to complete MapReduce task without increasing resources
  • 50.
  • 51.
    Refinements  The partitioningfunction on the intermediate key to have partioned data  Ordering Guarantees, the intermediate key/value pairs are processed in increasing key order to generate a sorted output  Combiner Function, does partial merging of this data before sending over the network
  • 52.
    Refinement  Redundant Execution Slow workers significantly lengthen completion time • Other jobs consuming resources on machine • Bad disks with soft errors transfer data very slowly • Weird things: processor caches disabled  Solution: Near end of phase, spawn backup copies of tasks • Whichever one finishes first "wins"  Effect: Dramatically shortens job completion time
  • 53.
    Refinement  Locality Optimization Master scheduling policy • Asks GFS for location of replicas of input file blocks • Map tasks typically split into 64MB (= GFS block size) • Map tasks sheduled so GFS input block replica are on same machine or same rack  Effect: Thousands of machines read at local disk speed • Without this, rack switches limit read rate
  • 54.
    Refinement Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs • Best solution is to debug & fix, but not allways possoble • On seg fault • Send UDP packet to master from signal handler • Include sequence number of record being processed • If master sees two failures for same record: • Next worker is told to skip the record  Effect: Can work around bugs in third-party libraries
  • 55.
    Refinement Other Refinements  Sortingguarantees within each reduce partition  Compression of intermediate data  Combiner: useful for saving network bandwidth  Local execution for debugging/testing  User-defined counters
  • 56.
    Performance Tests run oncluster of 1800 machines:  4 GB of memory  Dual-processor 2 GHz Xeons with Hyper-Threading  Dual 160 GB IDE disks  Gigabit Ethernet per machine  Bisection bandwidth approximately 100 Gbps
  • 57.
    Performance Two benchmarks: Grep Scan1010 100-byte records to extract records matching a rare pattern (92K matching records) Sort Sort 1010 100-byte record
  • 58.
    Performance: Grep  Localityoptimization helps: ◦ 1800 machines read 1 TB of data at peak of ~31 GB/s ◦ Without this, rack switches would limit to 10 GB/s  Startup overhead is significant for short jobs
  • 59.
    Performance: Sort  Backuptasks reduce job completion time significantly  System deals well with failures Normal No backup tasks 200 processes killed
  • 60.
    Experience  large-scale machinelearning problems  clustering problems for the Google News  extraction of data used to produce reports of popular queries  extraction of properties of web pages for new experiments and products  large-scale graph computations.
  • 61.
  • 62.
    Experience Rewrote Google's production indexingsystem using MapReduce  New code is simpler, easier to understand  MapReduce takes care of failures, slow machines  Easy to make indexing faster by adding more machines
  • 63.
    Our Experience  SetupApache Hadoop on: ◦ Single Node ◦ Cluster (on virtual machines)  Run Word Count example
  • 64.
    Related work  Programmingmodel inspired by functional language primitives  Partitioning/shuffling similar to many large-scale sorting systems ◦ NOW-Sort ['97]  Re-execution for fault tolerance ◦ BAD-FS ['04] and TACC ['97]  Locality optimization has parallels with Active Disks/Diamond work ◦ Active Disks ['01], Diamond ['04]  Backup tasks similar to Eager Scheduling in Charlotte system ◦ Charlotte ['96]  Dynamic load balancing solves similar problem as River's distributed queues ◦ River ['99]
  • 65.
    Conclusions  MapReduce hasproven to be a useful abstraction  Greatly simplifies large-scale computations at Google  Fun to use: focus on problem, let library deal with messy details