In this session you will learn:
Meet MapReduce
Word Count Algorithm – Traditional approach
Traditional approach on a Distributed System
Traditional approach – Drawbacks
MapReduce Approach
Input & Output Forms of a MR program
Map, Shuffle & Sort, Reduce Phase
WordCount Code walkthrough
Workflow & Transformation of Data
Input Split & HDFS Block
Relation between Split & Block
Data locality Optimization
Speculative Execution
MR Flow with Single Reduce Task
MR flow with multiple Reducers
Input Format & Hierarchy
Output Format & Hierarchy
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
2. Page 2Classification: Restricted
Agenda
• Meet MapReduce
• Word Count Algorithm – Traditional approach
• Traditional approach on a Distributed System
• Traditional approach – Drawbacks
• MapReduce Approach
• Input & Output Forms of a MR program
• Map, Shuffle & Sort, Reduce Phase
• WordCount Code walkthrough
• Workflow & Transformation of Data
• Input Split & HDFS Block
• Relation between Split & Block
• Data locality Optimization
• Speculative Execution
• MR Flow with Single Reduce Task
• MR flow with multiple Reducers
• Input Format & Hierarchy
• Output Format & Hierarchy
3. Page 3Classification: Restricted
Meet MapReduce
• MapReduce is a programming model for distributed processing
• Advantage - easy scaling of data processing over multiple computing nodes
• The basic entities in this model are – mappers & reducers
• Decomposing a data processing application into mappers and reducers
is the task of developer
• once you write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change
5. Page 5Classification: Restricted
WordCount – Traditional Approach
define wordCount as Multiset;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
6. Page 6Classification: Restricted
Traditional Approach – Distributed Processing
define wordCount as Multiset;
for each document in documentSubset {
< same code as in perv.slide>
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
7. Page 7Classification: Restricted
Traditional Approach – Drawbacks
•Central Storage – bottleneck in bandwidth of the server
•Multiple Storage – handling splits
•Program runs in memory
•When processing large document sets, the number of unique
words can exceed the RAM storage of a machine
•Phase 2 handling by one machine?
•If Multiple machines are used for phase-2, how to partition the
data?
8. Page 8Classification: Restricted
Mapreduce Approach
• Has two execution phases – mapping & reducing
• These phases are defined by data processing functions called – mapper & reducer
• Mapping phase – MR takes the input data and feeds each data element to the mapper
• Reducing phase – reducer processes all the outputs from the mapper and arrives at a final result
9. Page 9Classification: Restricted
Input & Output forms:
• In order for mapping, reducing, partitioning, and shuffling (and a few others that were not
mentioned) to seamlessly work together, we need to agree on a common structure for the data being
processed
• InputFormat class is responsible for creating input splits and dividing them into records()
Input Output
map() <k1, v1> list(<k2, v2>)
reduce() <k2, list(v2)> list(<k3, v3>)
13. Page 13Classification: Restricted
MR - Work flow & Transformation of data
From i/p files to
the mapper
From the
Mapper to the
intermediate
results
From
intermediate
results to the
reducer
From the
reducer to
output files
16. Page 16Classification: Restricted
Relation Between Input Split & Hdfs Block
1 2 3 4 76 8 1095
File
Line
s
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split Split Split
• Logical records do not fit neatly into the HDFS blocks.
• Logical records are lines that cross the boundary of the blocks.
• First split contains line 5 although it spans across blocks.
17. Page 17Classification: Restricted
Data locality Optimization
• MR job is split into various map & reduce
tasks
• Map tasks run on the input splits
• Ideally, the task JVM would be initiated in the
node where the split/block of data exists
• While in some scenarios, JVMs might not be
free to accept another task.
• In that case, Task Tracker will be initiated at a
different location.
• Scenario a) Same node execution
• Scenario b) Off-node execution
• Scenario c) Off-rack execution
18. Page 18Classification: Restricted
Speculative execution
• MR job is split into various map & reduce tasks and they get executed in parallel.
• Overall job execution time is pulled down by the slowest task.
• Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is
running slower than expected and launches another equivalent task as a backup. This is
termed speculative execution of tasks.
22. Page 22Classification: Restricted
Combiner
• A combiner is a mini-reducer
• It gets executed on the mapper output at the mapper side
• Combiner’s output is fed to Reducer
• As the mapper output is further refined using combiner, data that has to be shuffled across the
cluster is minimized
• Because the combiner function is an optimization, Hadoop does not provide a guarantee of how
many times it will call it for a particular map output record,
if at all
• So, calling the combiner function zero, one, or many times should produce the same output from the
reducer.
23. Page 23Classification: Restricted
Combiner’s Contract
• Only those functions that obey commutative & associative properties can use combiners.
• Because
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
where as,
mean(0, 20, 10, 25, 15) = 14 and
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
24. Page 24Classification: Restricted
Partitioner
• We know that a unique key will always go to a unique reducer.
• Partitioner is responsible for sending key, value pairs to a reducer based on the key content.
• The default partitioner is Hash-partitioner. It takes mapper output, create a Hash value for each key
and divide it modulo by the number of reducers. The output of this calculation will determine the
reducer that this particular key would go to
30. Page 30Classification: Restricted
Counters
• Counters are a useful channel for gathering statistics about the job: for quality control or for
application-level statistics.
• Often used for debugging purpose.
• eg: Count number of Good records, bad records in the input
• Two types – Built-in & Custom Counters
• Examples of Built-in Counters:
• Map input records
• Map output records
• Filesystem bytes read
• Launched map tasks
• Failed map tasks
• Killed reduce tasks
31. Page 31Classification: Restricted
Joins
• Map-side join(Replication): A map-side join that works in situations where one of the datasets is
small enough to cache
• Reduce-side join(Repartition join): A reduce-side join for situations where you’re joining two or
more large datasets together
• Semi-join(A map-side join): Another map-side join where one dataset is initially too large to fit into
memory, but after some filtering
can be reduced down to a size that can fit in memory
32. Page 32Classification: Restricted
Distributed Cache
• Side data can be defined as extra read-only data needed by a job to process the main dataset
• To make side data available to all map or reduce tasks, we distribute those datasets using Hadoop’s
Distributed Cache mechanism.
pavan.hadoop@outlook.com