Hadoop
Program walkthrough
Ms. Sarwan Singh
Assistant Director (Systems)
Training Division, NIELIT Chandigarh
Email: sarwan@nielit.gov.in
Mobile : 7087235355
1
Map Reduce Program
Example: WordCount
• Lets walk through an example MapReduce
application to get a flavour for how they work.
• WordCount is a simple application that counts
the number of occurrences of each word in a
given input set.
• This works with a local-standalone, pseudo-
distributed or fully-distributed Hadoop
installation (Single Node Setup).
2
package org.myorg; // user package
import all packages ; // system defined packages
public class WordCount
{
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
public void map ( LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException { ….
} //map method ends
} // map class ends
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException { ……
}//reduce method ends
}// Reduce class ends
public static void main(String[] args) throws Exception { …… }
}// word count ends 3
Importing packages
1. package org.myorg;
3. import java.io.IOException;
4. import java.util.*;
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
4
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map ( LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
} //while ends
} // map method ends
}//class Map ends
Map class
5
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce (Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}//reduce method ends
}// Reduce class ends
Reduce class
6
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Main method
7
Running the program
Usage
• Assuming HADOOP_HOME is the root of the installation
and HADOOP_VERSION is the Hadoop version installed,
compile WordCount.java and create a jar:
Assuming that:
• /usr/joe/wordcount/input - input directory in HDFS
• /usr/joe/wordcount/output - output directory in HDFS
$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop -${HADOOP_VERSION} -
core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
8
Running the program contd…
• Sample text-files as input:
$ bin/hadoop dfs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
• Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
• Output:
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
9
Walk-through Program
• The Mapper implementation, via the map method, processes
one line at a time, as provided by the specified TextInputFormat.
It then splits the line into tokens separated by whitespaces, via
the StringTokenizer, and emits a key-value pair of < <word>, 1>.
• For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
• The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
10
Walk-through Program
• WordCount also specifies a combiner. Hence, the output of
each map is passed through the local combiner (which is
same as the Reducer as per the job configuration) for local
aggregation, after being sorted on the keys.
• The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>
• The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
11
Walk-through Program
• The Reducer implementation, via the reduce method just
sums up the values, which are the occurrence counts for each
key (i.e. words in this example).
• Thus the output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
• The run method specifies various facets of the job, such as the
input/output paths (passed via the command line), key/value
types, input/output formats etc., in the JobConf. It then calls
the JobClient, run Job to submit the and monitor its progress.
12
Cluster of machines running Hadoop at Yahoo
(Source : yahoo)
13

MapReduce wordcount program

  • 1.
    Hadoop Program walkthrough Ms. SarwanSingh Assistant Director (Systems) Training Division, NIELIT Chandigarh Email: sarwan@nielit.gov.in Mobile : 7087235355 1
  • 2.
    Map Reduce Program Example:WordCount • Lets walk through an example MapReduce application to get a flavour for how they work. • WordCount is a simple application that counts the number of occurrences of each word in a given input set. • This works with a local-standalone, pseudo- distributed or fully-distributed Hadoop installation (Single Node Setup). 2
  • 3.
    package org.myorg; //user package import all packages ; // system defined packages public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map ( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { …. } //map method ends } // map class ends public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { …… }//reduce method ends }// Reduce class ends public static void main(String[] args) throws Exception { …… } }// word count ends 3
  • 4.
    Importing packages 1. packageorg.myorg; 3. import java.io.IOException; 4. import java.util.*; 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 4
  • 5.
    public static classMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map ( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } //while ends } // map method ends }//class Map ends Map class 5
  • 6.
    public static classReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce (Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }//reduce method ends }// Reduce class ends Reduce class 6
  • 7.
    public static voidmain(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } Main method 7
  • 8.
    Running the program Usage •Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: Assuming that: • /usr/joe/wordcount/input - input directory in HDFS • /usr/joe/wordcount/output - output directory in HDFS $ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME}/hadoop -${HADOOP_VERSION} - core.jar -d wordcount_classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ . 8
  • 9.
    Running the programcontd… • Sample text-files as input: $ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop • Run the application: $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output • Output: $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 9
  • 10.
    Walk-through Program • TheMapper implementation, via the map method, processes one line at a time, as provided by the specified TextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. • For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> • The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> 10
  • 11.
    Walk-through Program • WordCountalso specifies a combiner. Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys. • The output of the first map: < Bye, 1> < Hello, 1> < World, 2> • The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> 11
  • 12.
    Walk-through Program • TheReducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key (i.e. words in this example). • Thus the output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> • The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in the JobConf. It then calls the JobClient, run Job to submit the and monitor its progress. 12
  • 13.
    Cluster of machinesrunning Hadoop at Yahoo (Source : yahoo) 13