Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MapReduce
Farzad Nozarian
4/11/15 @AUT
Purpose
This document describes how to set up and configure a single-node Hadoop
installation so that you can quickly perf...
Supported Platforms
• GNU/Linux is supported as a development and production platform.
Hadoop has been demonstrated on GNU...
Required Software
• Java™ must be installed. Recommended Java versions are described at
http://wiki.apache.org/hadoop/Hado...
Prepare to Start the Hadoop Cluster
• Unpack the downloaded Hadoop distribution. In the distribution, edit the
file etc/ha...
Prepare to Start the Hadoop Cluster (Cont.)
• Now you are ready to start your Hadoop cluster in one of the three
supported...
Pseudo-Distributed Configuration
• etc/hadoop/core-site.xml:
• etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>...
MapReduce Execution Pipeline
8
Main components of the MapReduce
execution pipeline
• Driver:
• The main program that initializes a MapReduce job.
• It de...
Main components of the MapReduce
execution pipeline
• Context:
• The driver, mappers, and reducers are executed in differe...
Main components of the MapReduce
execution pipeline
• Input data:
• This is where the data for a MapReduce task is initial...
Main components of the MapReduce
execution pipeline
• InputSplit:
• An InputSplit defines a unit of work for a single map ...
Main components of the MapReduce
execution pipeline
• RecordReader:
• Although the InputSplit defines a data subset for a ...
Main components of the MapReduce
execution pipeline
• Mapper:
• Performs the user-defined work of the first phase of the M...
Main components of the MapReduce
execution pipeline
• Partition:
• A subset of the intermediate key space (k2, v2) produce...
Main components of the MapReduce
execution pipeline
• Shuffle:
• Once at least one map function for a given node is comple...
Main components of the MapReduce
execution pipeline
• Reducer:
• A reducer is responsible for an execution of user-provide...
Main components of the MapReduce
execution pipeline
• OutputFormat:
• The responsibility of the OutputFormat is to define ...
Let’s try it with simple example!
Word Count
(the Hello World! for MapReduce, available in Hadoop sources)
We want to coun...
Driver
20
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Conf...
Mapper class
21
//inside WordCount class
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritabl...
Reducer class
22
//inside WordCount class
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita...
References:
• hadoop.apache.org
• Professional Hadoop Solutions - Boris Lublinsky, Kevin T.
Smith, Alexey Yakubovich - WIL...
Upcoming SlideShare
Loading in …5
×

Apache Hadoop MapReduce Tutorial

2,177 views

Published on

A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.

Published in: Software

Apache Hadoop MapReduce Tutorial

  1. 1. MapReduce Farzad Nozarian 4/11/15 @AUT
  2. 2. Purpose This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce 2
  3. 3. Supported Platforms • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. • Windows is also a supported platform but the followings steps are for Linux only. 3
  4. 4. Required Software • Java™ must be installed. Recommended Java versions are described at http://wiki.apache.org/hadoop/HadoopJavaVersions • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors $ sudo apt-get install ssh $ sudo apt-get install rsync 4
  5. 5. Prepare to Start the Hadoop Cluster • Unpack the downloaded Hadoop distribution. In the distribution, edit the file etc/hadoop/hadoop-env.sh to define some parameters as follows: • Try the following command: This will display the usage documentation for the hadoop script. # set to the root of your Java installation export JAVA_HOME=/usr/lib/jvm/jdk1.7.0 # Assuming your installation directory is /usr/local/hadoop export HADOOP_PREFIX=/usr/local/hadoop $ bin/hadoop 5
  6. 6. Prepare to Start the Hadoop Cluster (Cont.) • Now you are ready to start your Hadoop cluster in one of the three supported modes: • Local (Standalone) Mode • By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. • Pseudo-Distributed Mode • Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. • Fully-Distributed Mode 6
  7. 7. Pseudo-Distributed Configuration • etc/hadoop/core-site.xml: • etc/hadoop/hdfs-site.xml: <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> 7
  8. 8. MapReduce Execution Pipeline 8
  9. 9. Main components of the MapReduce execution pipeline • Driver: • The main program that initializes a MapReduce job. • It defines job-specific configuration, and specifies all of its components: • input and output formats • mapper and reducer • use of a combiner • use of a custom partitioner • The driver can also get back the status of the job execution. 9
  10. 10. Main components of the MapReduce execution pipeline • Context: • The driver, mappers, and reducers are executed in different processes, typically on multiple machines. • A context object is available at any point of MapReduce execution. • It provides a convenient mechanism for exchanging required system and job- wide information. 10
  11. 11. Main components of the MapReduce execution pipeline • Input data: • This is where the data for a MapReduce task is initially stored • This data can reside in HDFS, HBase, or other storage. • InputFormat: • This defines how input data is read and split. • InputFormat is a class that defines the InputSplits that break input data into tasks. • It provides a factory for RecordReader objects that read the file. • Several InputFormats are provided by Hadoop 11
  12. 12. Main components of the MapReduce execution pipeline • InputSplit: • An InputSplit defines a unit of work for a single map task in a MapReduce program. • The InputFormat (invoked directly by a job driver) defines the number of map tasks that make up the mapping phase. • Each map task is given a single InputSplit to work on 12
  13. 13. Main components of the MapReduce execution pipeline • RecordReader: • Although the InputSplit defines a data subset for a map task, it does not describe how to access the data. • The RecordReader class actually reads the data from its source, converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method. • The RecordReader class is defined by the InputFormat. 13
  14. 14. Main components of the MapReduce execution pipeline • Mapper: • Performs the user-defined work of the first phase of the MapReduce program. • It takes input data in the form of a series of key/value pairs (k1, v1), which are used for individual map execution. • The map typically transforms the input pair into an output pair (k2, v2), which is used as an input for shuffle and sort. 14
  15. 15. Main components of the MapReduce execution pipeline • Partition: • A subset of the intermediate key space (k2, v2) produced by each individual mapper is assigned to each reducer. • These subsets (or partitions) are the inputs to the reduce tasks. • Each map task may emit key/value pairs to any partition. • The Partitioner class determines which reducer a given key/value pair will go to. • The default Partitioner computes a hash value for the key, and assigns the partition based on this result. 15
  16. 16. Main components of the MapReduce execution pipeline • Shuffle: • Once at least one map function for a given node is completed, and the keys’ space is partitioned, the run time begins moving the intermediate outputs from the map tasks to where they are required by the reducers. • This process of moving map outputs to the reducers is known as shuffling. • Sort: • The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop to form keys/values (k2, {v2, v2,…}) before they are presented to the reducer. 16
  17. 17. Main components of the MapReduce execution pipeline • Reducer: • A reducer is responsible for an execution of user-provided code for the second phase of job-specific work. • For each key assigned to a given reducer, the reducer’s reduce() method is called once. • This method receives a key, along with an iterator over all the values associated with the key. • The reducer typically transforms the input key/value pairs into output pairs (k3, v3). 17
  18. 18. Main components of the MapReduce execution pipeline • OutputFormat: • The responsibility of the OutputFormat is to define a location of the output data and RecordWriter used for storing the resulting data. • RecordWriter: • A RecordWriter defines how individual output records are written. 18
  19. 19. Let’s try it with simple example! Word Count (the Hello World! for MapReduce, available in Hadoop sources) We want to count the occurrences of every word of a text file 19
  20. 20. Driver 20 public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); … Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path( otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  21. 21. Mapper class 21 //inside WordCount class public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  22. 22. Reducer class 22 //inside WordCount class public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  23. 23. References: • hadoop.apache.org • Professional Hadoop Solutions - Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich - WILEY 23

×