1
Distributed and Parallel Processing Technology
Chapter2.
MapReduce
Sun Jo
Introduction
 MapReduce is a programming model for data processing.
 Hadoop can run MapReduce programs written in various languages.
 We shall look at the same program expressed in Java, Ruby, Python, and
C++.
2
A Weather Dataset
 Program that mines weather data
 Weather sensors collect data every
hour at many locations across the
globe
 They gather a large volume of log data,
which is good candidate for analysis
with MapReduce
 Data Format
 Data from the National Climate Data
Center(NCDC)
 Stored using a line-oriented ASCII
format, in which each line is a record
3
A Weather Dataset
 Data Format
 Data files are organized by date and weather station.
 There is a directory for each year from 1901 to 2001, each containing a gzipped file
for each weather station with its readings for that year.
 The whole dataset is made up of a large number of relatively small files since there
are tens of thousands of weather station.
 The data was preprocessed so that each year’s readings were concatenated into a
single file. 4
Analyzing the Data with Unix Tools
 What’s the highest recorded global temperature for each year in the dataset?
 Unix Shell script program with awk, the classic tool for processing line-oriented
data
 Beginning of a run
 The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large Instance.
5
The scripts loops through the compressed year files
printing the year  processing each file using awk
Awk extracts the air temperature and the quality code from the data.
Temperature value 9999 signifies a missing value in the NCDC dataset.
Maximum temperature is 31.7℃ for 1901.
Analyzing the Data with Unix Tools
 To speed up the processing, run parts of the program in parallel
 Problems for parallel processing
 Dividing the work into equal-size pieces isn’t always easy or obvious.
• The file size for different years varies
• The whole run is dominated by the longest file
• A better approach is to split the input into fixed-size chunks and assign each chunk to a process
 Combining the results from independent processes may need further processing.
 Still limited by the processing capacity of a single machine, handling coordination and
reliability for multiple machines
 It’s feasible to parallelize the processing, though, it’s messy in practice.
6
Analyzing the Data with Hadoop – Map and Reduce
 Map and Reduce
 MapReduce works by breaking the processing into 2 phases: the map and the reduce.
 Both map and reduce phases have key-value pairs as input and output.
 Programmers have to specify two functions: map and reduce function.
 The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the value is each line of the data set.
 The map function pulls out the year and the air temperature from each input value.
 The reduce function takes <year, temperature> pairs as input and produces the
maximum temperature for each year as the result.
7
Analyzing the Data with Hadoop – Map and Reduce
 Original NCDC Format
 Input file for the map function, stored in HDFS
 Output of the map function, running in parallel for each block
 Input for the reduce function & Output of the reduce function
8
Analyzing the Data with Hadoop – Map and Reduce
 The whole data flow
9
Input File
Map()
<1950, 0>
<1950, 22>
<1949,111>
<1950, -11>
<1949, 78>
<1951, 25>
<1951, 10>
<1952, 22>
<1954, 0>
<1954, 22>
Shuffling Reduce ()
<1949, [111, 78]>
<1950, [0, 22, -11]>
<1951, [10, 76,34], 19>
<1952 ,[22, 34]>
<1953, [45]>
<1955, [23]>
<1949, 111>
<1950, 22>
<1951, 76>
<1952, 34>
<1953, 45>
<1955,25>
Analyzing the Data with Hadoop – Java MapReduce
 Having run through how the MapReduce program works, express it in code
 A map function, a reduce function, and some code to run the job are needed.
 Map function
10
Analyzing the Data with Hadoop – Java MapReduce
 Reduce function
11
Analyzing the Data with Hadoop – Java MapReduce
 Main function for running the MapReduce job
12
Analyzing the Data with Hadoop – Java MapReduce
 A test run
 The output is written to the output directory, which contains one output file 13
Analyzing the Data with Hadoop – Java MapReduce
 The new Java MapReduce API
 The new API, referred to as “Context Objects”, is type-incompatible with the old, so
applications need to be rewritten to take advantage of it.
 Notable differences
• Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes.
• The new API is in the org.apache.hadoop.mapreduce package and subpackages.
• The old API can still be found in org.apache.hadoop.mapred
• Makes extensive use of context objects that allow the user code to communicate with MapReduce system
• i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter
• Supports both a ‘push’ and a ‘pull’ style of iteration
• Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a
mapper to pull records from within the map() method.
• The same goes for the reducer
• Configuration has been unified.
• The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla
Configuration object.
• In the new API, job configuration is done through a Configuration.
• Job control is performed through the Job class rather than JobClient.
• Output files are named slightly differently
• part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs
• (nnnnn is an integer designating the part number, starting from 0)
14
Analyzing the Data with Hadoop – Java MapReduce
 The new Java MapReduce API
 Example 2-6 shows the MaxTemperature application rewritten to use the new API.
15
Scaling Out
 To scale out, we need to store the data in a distributed filesystem, HDFS.
 Hadoop moves the MapReduce computation to each machine hosting a part
of the data.
 Data Flow
 A MapReduce job consists of the input data, the MapReduce program, and
configuration information.
 Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks.
 Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on tasktrakers.
• Tasktrackers : run tasks and send progress report to the jobtracker.
 Hadoop divides the input into fixed-size pieces, called input splits, or just splits.
 Hadoop creates one map task for each split, which runs the user-defined map function
for each record in the split.
 The quality of the load balancing increases as the splits become more fine-grained.
• Default size : 1 HDFS block, 64MB
 Map tasks write their output to the local disk, not to HDFS.
 If the node running a map task fails, Hadoop will automatically rerun the map task on
another node to re-create the map output.
16
Scaling Out
 Data Flow – single reduce task
 Reduce tasks don’t have the advantage of data locality – the input to a single reduce task
is normally the output from all mappers.
 All map outputs are merged across the network and passed to the user-defined reduce
function.
 The output of the reduce is normally stored in HDFS.
17
Scaling Out
 Data Flow – multiple reduce tasks
 The number of reduce tasks is specified independently not governed by the input size.
 The map tasks partition their output by keys, each creating one partition for each reduce
task.
 There can be many keys and their associated values in each partition, but the records for
any key are all in a single partition.
18
Scaling Out
 Data Flow – zero reduce task
19
Scaling Out
 Combiner Functions
 Many MapReduce jobs are limited by the bandwidth available on the cluster.
 It pays to minimize the data transferred between map and reduce tasks.
 Hadoop allows the user to specify a combiner function to be run on the map
output – the combiner function’s output forms the input to the reduce function.
 The contract for the combiner function constrains the type of function that may be used.
 Example without a combiner function
 Example with a combiner function, finding maximum temperature for a map
20
<1950, 0>
<1950, 20>
<1950, 10>
Reduce ()
<1950, [0, 20, 10, 25, 15]> <1950, 25>
<1950, 25>
<1950, 15>
Map () shuffling
<1950, 0>
<1950, 20>
<1950, 10>
Reduce ()
<1950, [20, 25]> <1950, 25>
<1950, 25>
<1950, 15>
Map () shuffling
combiner
<1950, 20>
<1950, 25>
Scaling Out
 Combiner Functions
 The function calls on the temperature values can be expressed as follows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25
 Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.
 The combiner function doesn’t replace the reduce function.
 It can help cut down the amount of data shuffled between the maps and the reduces
21
Scaling Out
 Combiner Functions
 Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in MaxTemperatureReducer.
• The only change is to set the combiner class on the JobConf.
22
Hadoop Streaming
 Hadoop provides an API to MapReduce
 write the map and reduce functions in languages other than Java.
 We can use any language in MapReduce program.
 Hadoop Streaming
 Map input data is passed over standard input to your map function.
 The map function processes the data line by line and writes lines to standard output.
 A map output key-value pair is written as a single tab-delimited line.
 Reduce function reads lines from standard input (sorted by key), and writes its results to
standard output.
23
Hadoop Streaming
 Ruby
 The map function can be expressed in Ruby.
 Simulating the map function in Ruby with a Unix pipeline
 The reduce function for maximum temperature in Ruby
24
Hadoop Streaming
 Ruby
 Simulating the whole MapReduce pipeline with a Unix pipeline
 Hadoop command to run the whole MapReduce job
 When using the combiner which is coded in any streaming language
25
Hadoop Streaming
 Python
 Streaming supports any programming language that can read from standard input and
write to standard output.
 The map and reduce script in Python
 Test the programs and run the job in the same way we did in Ruby.
26
Hadoop Pipes
 Hadoop Pipes
 The name of the C++ interface to Hadoop MapReduce.
 Pipes uses sockets as the channel over which the tasktracker communicates with the
process running the C++ map or reduce function.
 The source code for the map and reduce functions in C++
27
Hadoop Pipes
 The source code for the map and reduce functions in C++
28
Hadoop Pipes
 Compiling and Running
 Makefile for C++ MapReduce program
 Defining PLATFORM which specifies the operating system, architecture, and data model
(e.g., 32- or 64-bit).
 To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model.
 Next step is to copy the executable code (program) to HDFS.
 Next, the sample data is copied from the local filesystem to HDFS.
29
Hadoop Pipes
 Compiling and Running
 Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of
the executable in HDFS using the –program argument:
30

Ababnsnsndjjdjdjdjjrjrjrjrjrjjrjrjrjjrjrjrjr

  • 1.
    1 Distributed and ParallelProcessing Technology Chapter2. MapReduce Sun Jo
  • 2.
    Introduction  MapReduce isa programming model for data processing.  Hadoop can run MapReduce programs written in various languages.  We shall look at the same program expressed in Java, Ruby, Python, and C++. 2
  • 3.
    A Weather Dataset Program that mines weather data  Weather sensors collect data every hour at many locations across the globe  They gather a large volume of log data, which is good candidate for analysis with MapReduce  Data Format  Data from the National Climate Data Center(NCDC)  Stored using a line-oriented ASCII format, in which each line is a record 3
  • 4.
    A Weather Dataset Data Format  Data files are organized by date and weather station.  There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year.  The whole dataset is made up of a large number of relatively small files since there are tens of thousands of weather station.  The data was preprocessed so that each year’s readings were concatenated into a single file. 4
  • 5.
    Analyzing the Datawith Unix Tools  What’s the highest recorded global temperature for each year in the dataset?  Unix Shell script program with awk, the classic tool for processing line-oriented data  Beginning of a run  The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance. 5 The scripts loops through the compressed year files printing the year  processing each file using awk Awk extracts the air temperature and the quality code from the data. Temperature value 9999 signifies a missing value in the NCDC dataset. Maximum temperature is 31.7℃ for 1901.
  • 6.
    Analyzing the Datawith Unix Tools  To speed up the processing, run parts of the program in parallel  Problems for parallel processing  Dividing the work into equal-size pieces isn’t always easy or obvious. • The file size for different years varies • The whole run is dominated by the longest file • A better approach is to split the input into fixed-size chunks and assign each chunk to a process  Combining the results from independent processes may need further processing.  Still limited by the processing capacity of a single machine, handling coordination and reliability for multiple machines  It’s feasible to parallelize the processing, though, it’s messy in practice. 6
  • 7.
    Analyzing the Datawith Hadoop – Map and Reduce  Map and Reduce  MapReduce works by breaking the processing into 2 phases: the map and the reduce.  Both map and reduce phases have key-value pairs as input and output.  Programmers have to specify two functions: map and reduce function.  The input to the map phase is the raw NCDC data. • Here, the key is the offset of the beginning of the line and the value is each line of the data set.  The map function pulls out the year and the air temperature from each input value.  The reduce function takes <year, temperature> pairs as input and produces the maximum temperature for each year as the result. 7
  • 8.
    Analyzing the Datawith Hadoop – Map and Reduce  Original NCDC Format  Input file for the map function, stored in HDFS  Output of the map function, running in parallel for each block  Input for the reduce function & Output of the reduce function 8
  • 9.
    Analyzing the Datawith Hadoop – Map and Reduce  The whole data flow 9 Input File Map() <1950, 0> <1950, 22> <1949,111> <1950, -11> <1949, 78> <1951, 25> <1951, 10> <1952, 22> <1954, 0> <1954, 22> Shuffling Reduce () <1949, [111, 78]> <1950, [0, 22, -11]> <1951, [10, 76,34], 19> <1952 ,[22, 34]> <1953, [45]> <1955, [23]> <1949, 111> <1950, 22> <1951, 76> <1952, 34> <1953, 45> <1955,25>
  • 10.
    Analyzing the Datawith Hadoop – Java MapReduce  Having run through how the MapReduce program works, express it in code  A map function, a reduce function, and some code to run the job are needed.  Map function 10
  • 11.
    Analyzing the Datawith Hadoop – Java MapReduce  Reduce function 11
  • 12.
    Analyzing the Datawith Hadoop – Java MapReduce  Main function for running the MapReduce job 12
  • 13.
    Analyzing the Datawith Hadoop – Java MapReduce  A test run  The output is written to the output directory, which contains one output file 13
  • 14.
    Analyzing the Datawith Hadoop – Java MapReduce  The new Java MapReduce API  The new API, referred to as “Context Objects”, is type-incompatible with the old, so applications need to be rewritten to take advantage of it.  Notable differences • Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes. • The new API is in the org.apache.hadoop.mapreduce package and subpackages. • The old API can still be found in org.apache.hadoop.mapred • Makes extensive use of context objects that allow the user code to communicate with MapReduce system • i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter • Supports both a ‘push’ and a ‘pull’ style of iteration • Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. • The same goes for the reducer • Configuration has been unified. • The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object. • In the new API, job configuration is done through a Configuration. • Job control is performed through the Job class rather than JobClient. • Output files are named slightly differently • part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs • (nnnnn is an integer designating the part number, starting from 0) 14
  • 15.
    Analyzing the Datawith Hadoop – Java MapReduce  The new Java MapReduce API  Example 2-6 shows the MaxTemperature application rewritten to use the new API. 15
  • 16.
    Scaling Out  Toscale out, we need to store the data in a distributed filesystem, HDFS.  Hadoop moves the MapReduce computation to each machine hosting a part of the data.  Data Flow  A MapReduce job consists of the input data, the MapReduce program, and configuration information.  Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks.  Two types of nodes, 1 jobtracker and several tasktrackers • Jobtracker : coordinates and schedules tasks to run on tasktrakers. • Tasktrackers : run tasks and send progress report to the jobtracker.  Hadoop divides the input into fixed-size pieces, called input splits, or just splits.  Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.  The quality of the load balancing increases as the splits become more fine-grained. • Default size : 1 HDFS block, 64MB  Map tasks write their output to the local disk, not to HDFS.  If the node running a map task fails, Hadoop will automatically rerun the map task on another node to re-create the map output. 16
  • 17.
    Scaling Out  DataFlow – single reduce task  Reduce tasks don’t have the advantage of data locality – the input to a single reduce task is normally the output from all mappers.  All map outputs are merged across the network and passed to the user-defined reduce function.  The output of the reduce is normally stored in HDFS. 17
  • 18.
    Scaling Out  DataFlow – multiple reduce tasks  The number of reduce tasks is specified independently not governed by the input size.  The map tasks partition their output by keys, each creating one partition for each reduce task.  There can be many keys and their associated values in each partition, but the records for any key are all in a single partition. 18
  • 19.
    Scaling Out  DataFlow – zero reduce task 19
  • 20.
    Scaling Out  CombinerFunctions  Many MapReduce jobs are limited by the bandwidth available on the cluster.  It pays to minimize the data transferred between map and reduce tasks.  Hadoop allows the user to specify a combiner function to be run on the map output – the combiner function’s output forms the input to the reduce function.  The contract for the combiner function constrains the type of function that may be used.  Example without a combiner function  Example with a combiner function, finding maximum temperature for a map 20 <1950, 0> <1950, 20> <1950, 10> Reduce () <1950, [0, 20, 10, 25, 15]> <1950, 25> <1950, 25> <1950, 15> Map () shuffling <1950, 0> <1950, 20> <1950, 10> Reduce () <1950, [20, 25]> <1950, 25> <1950, 25> <1950, 15> Map () shuffling combiner <1950, 20> <1950, 25>
  • 21.
    Scaling Out  CombinerFunctions  The function calls on the temperature values can be expressed as follows: • Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25  Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function • mean(0, 20, 10, 25, 15) = 14 • mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.  The combiner function doesn’t replace the reduce function.  It can help cut down the amount of data shuffled between the maps and the reduces 21
  • 22.
    Scaling Out  CombinerFunctions  Specifying a combiner function • The combiner function is defined using the Reducer interface • It is the same implementation as the reducer function in MaxTemperatureReducer. • The only change is to set the combiner class on the JobConf. 22
  • 23.
    Hadoop Streaming  Hadoopprovides an API to MapReduce  write the map and reduce functions in languages other than Java.  We can use any language in MapReduce program.  Hadoop Streaming  Map input data is passed over standard input to your map function.  The map function processes the data line by line and writes lines to standard output.  A map output key-value pair is written as a single tab-delimited line.  Reduce function reads lines from standard input (sorted by key), and writes its results to standard output. 23
  • 24.
    Hadoop Streaming  Ruby The map function can be expressed in Ruby.  Simulating the map function in Ruby with a Unix pipeline  The reduce function for maximum temperature in Ruby 24
  • 25.
    Hadoop Streaming  Ruby Simulating the whole MapReduce pipeline with a Unix pipeline  Hadoop command to run the whole MapReduce job  When using the combiner which is coded in any streaming language 25
  • 26.
    Hadoop Streaming  Python Streaming supports any programming language that can read from standard input and write to standard output.  The map and reduce script in Python  Test the programs and run the job in the same way we did in Ruby. 26
  • 27.
    Hadoop Pipes  HadoopPipes  The name of the C++ interface to Hadoop MapReduce.  Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function.  The source code for the map and reduce functions in C++ 27
  • 28.
    Hadoop Pipes  Thesource code for the map and reduce functions in C++ 28
  • 29.
    Hadoop Pipes  Compilingand Running  Makefile for C++ MapReduce program  Defining PLATFORM which specifies the operating system, architecture, and data model (e.g., 32- or 64-bit).  To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model.  Next step is to copy the executable code (program) to HDFS.  Next, the sample data is copied from the local filesystem to HDFS. 29
  • 30.
    Hadoop Pipes  Compilingand Running  Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of the executable in HDFS using the –program argument: 30