MapReduce basics

MapReduce basics
Harisankar H,
PhD student, DOS lab,
Dept. CSE, IIT Madras

6-Feb-2013

http://harisankarh.wordpress.com

Distributed processing ?
• Processing distributed across multiple
machines/servers

Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg

Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the
nodes fail
– Typically if you have 10,000 servers, on the average one will
fail per day

Issues in distributed processing
• Realized traditionally using special-purpose
implementations
– E.g., indexer, log processor
• Implementation really hard at socket programming level
– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization
– Scheduling across heterogeneous nodes
– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:
– Complex, brittle, non-generic code
– Reimplementation of common features like fault-tolerance,
distribution

Need for a generic abstraction for
distributed processing

App programmer  abstraction  systems developer

Separation of concerns

Express app Performance, fault
logic handling etc.

• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you
have both to some extent

MapReduce abstraction(app
programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which
act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}
– For each input <k,v>:
map(k1,v1)  list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2))  list(v3)

System will take care of distributing the tasks across thousands of machines,
handling locality, fault-tolerance etc.

Example: word count
• Problem:
– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:
– <document name, document contents>
• Organize the files in this format
• Output:
– <word, count>
• Get it in output files
• Next step:
– Define the map() and reduce() functions

Word count
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);

reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Program in java

public void reduce(Text key,
public void map(LongWritable key, Text Iterable<IntWritable> values, Context
value, Context context) throws … context) throws …
{ {
String line = value.toString(); int sum = 0;
StringTokenizer tokenizer = new for (IntWritable val : values) {
StringTokenizer(line); sum += val.get();
while (tokenizer.hasMoreTokens()) { }
word.set(tokenizer.nextToken()); context.write(key, new
context.write(word, one); IntWritable(sum));
} }
}

Implementing MapReduce abstraction

App programmer  abstraction  systems developer

• Looked at the application programmer’s view
• Need a platform which implements the
MapReduce abstraction
• Hadoop is the popular open-source
implementation of MapReduce abstraction
• Questions for the platform developer
– How to
• parallelize ?
• handle faults ?
• provide locality ?
• distribute the data ?

Basics of platform implementation
• parallelize ?
– Each map can be executed independently in parallel
– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?
– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?
– Have a distributed file system(HDFS)
• provide locality ?
– Prefer to execute map() on the nodes having input <k,v>
pair

MapReduce implementation
• Distributed File System(DFS) +
MapReduce(MR) Engine
– Specifically, MR engine uses a DFS
• Distributed files system
– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture
• Master assigns and manages replicated blocks in the
slaves

MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across
workers
– Workers perform the map() and reduce()
functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and
reassigns tasks if necessary
• Failure detection usually done through timeouts

Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>)  <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network]  reduce()

• Make map jobs of roughly equal expected
execution times
• Try to make reduce() jobs less skewed

Pros and cons of MapReduce
• Advantages
– Simple, easy to use distributed processing system
– Reasonably generic
– Exploits locality for performance
– Simple and less buggy implementation
• Issues
– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations
– E.g.: k-means clustering
– Generate-Map-Reduce
• Difficult to model streaming computations
• Centralized entities like master becomes bottlenecks
• Most real-world problems require large chains of MR jobs

Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster

Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

Hadoop components
• HDFS
– Master: Namenode
– Slave : DataNode
• MapReduce engine
– Master: JobTracker
– Slave: TaskTracker

MapReduce basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to MapReduce basics

Similar to MapReduce basics (20)

Recently uploaded

Recently uploaded (20)

MapReduce basics