SlideShare a Scribd company logo
1 of 48
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Big Data Analysis using Hadoop!
!
Map-Reduce – An Introduction!
!
Lecture 2!
!
!
Brendan Tierney
[from Hadoop in Practice, Alex Holmes]
HDFS
Architecture
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

MapReduce
•  A batch based, distributed computing framework modelled on Google’s paper on
MapReduce [http://research.google.com/archive/mapreduce.html]
•  MapReduce decomposes work into small parallelised map and reduce tasks which
are scheduled for remote execution on slave nodes
•  Terminology
•  A job is a full programme
•  A task is the execution of a single map or reduce task over a slice of
data called a split
•  A Mapper is a map task
•  A Reducer is a reduce task
•  MapReduce works by manipulating key/value pairs in the general format 
map(key1,value1)➝ list(key2,value2)
reduce(key2,list(value2)) ➝ (key3, value3)
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The input is
divided into
fixed-size pieces
called input
splits
A map task is
created for each
split
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The role of
the
programmer
is to define
the Map and
Reduce
functions
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
The Reduce phase
merges the data,
as defined by the
programmer to
produce the
outputs.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Map 
•  The Map function 
•  The Mapper takes as input a key/value pair which represents a logical
record from the input data source (e.g. a line in a file) 
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain
condition is met
•  e.g. a counting function may produce multiple key/value pairs, one
per element being counted
map(in_key, in_value) ➝ list(temp_key, temp_value)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reduce
•  The Reducer(s)
•  A single Reducer handles all the map output for a unique map output
key
•  A Reducer outputs zero to many key/value pairs 
•  The output is written to HDFS files, to external DBs, or to any data sink...
reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

MapReduce
•  JobTracker - (Master)
•  Controls MapReduce jobs
•  Assigns Map & Reduce tasks to the other nodes on the cluster
•  Monitors the tasks as they are running
•  Relaunches failed tasks on other nodes in the cluster
•  TaskTracker - (Slave)
•  A single TaskTracker per slave node 
•  Manage the execution of the individual tasks on the node
•  Can instantiate many JVMs to handle tasks in parallel
•  Communicates back to the JobTracker (via a heartbeat)
[from Hadoop in Practice, Alex Holmes]
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

[from Hadoop the Definitive Guide,Tom White]
A MapReduce Job
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

[from Hadoop the Definitive Guide,Tom White]
Monitoring progress
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

YARN (Yet Another Resource Negotiator) Framework
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Data Locality !
“This is a local node for local Data” 
•  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is
working on a block of data stored locally on that node vis HDFS
•  If this is not possible, the Mapper will have to transfer the data across the network as
it accesses the data
•  Once all the Map tasks are finished, the map output data is transferred across the
network to the Reducers
•  Although Reducers may run on the same node (physical machine) as the Mappers
there is no concept of data locality for Reducers
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Bottlenecks?
•  Reducers cannot start until all Mappers are finished and the output has been
transferred to the Reducers and sorted
•  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the
Reducers as the Mappers finish
•  The percentage of Mappers which should finish before the Reducers
start retrieving data is configurable
•  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative
execution
•  If a Mapper appears to be running significantly slower than the others, a
new instance of the Mapper will be started on another machine,
operating on the same data (remember replication) 
•  The results of the first Mapper to finish will be used
•  The Mapper which is still running will be terminated by Hadoop
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

The MapReduce Job!
!
Let us build up an example
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

The Scenario
•  Build a Word Counter
•  Using the Shakespeare Poems
•  Count the number of times a word appears
in the data set
•  Use Map-Reduce to do this work
•  Step-by-Step of creating the MR process
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driving Class
Mapper
Reducer
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Setting up the MapReduce Job 

•  A Job object forms the specification for the job
•  Job needs to know:
•  the jar file that the code is in which will be distributed around the cluster; setJarByClass()
•  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath()
•  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath()
•  the Mapper and Reducer classes; setMapperClass() setReducerClass()
•  the output key and value classes; setOutputKeyClass() setOutputValueClass()
•  the Mapper output key and value classes if they are different from the Reducer;
setMapOutputKeyClass() setMapOutputValueClass()
•  the Mapper output key and value classes
•  the name of the job, default is the name of the jar file; setJobName()
•  The default input considers the file as lines of text
•  The default key input is LongWriteable (the byte offset into the file)
•  The default value input is Text (the contents of the line read from the file)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You will typically import these classes into every
MapReduce job you write. We will omit the import
statements in future slides for brevity.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The main method accepts two command-line arguments: the
input and output directories.
The first step is to ensure that we have been given two
command line arguments. If not, print a help message and exit.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create a new job, specify the class which will be called to run
the job, and give it a Job Name.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Give the Job information about the classes for the Mapper and
the reducer
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the format of the intermediate output key and value
produced by the Mapper
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the types for the Reducer output key and value
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the input directory (where the data will be read from)
and the output directory where the data will be written.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

File formats - Inputs
•  The default InputFormat (TextInputFormat) will be used unless you specify otherwise
•  To use an InputFormat other than the default, use e.g.

conf.setInputFormat(KeyValueTextInputFormat.class)
•  By default, FileInputFormat.setInputPaths() will read all files from a specified directory
and send them to Mappers
•  Exceptions: items whose names begin with a period (.) or underscore (_)
•  Globs can be specified to restrict input 
•  For example, /2010/*/01/*
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

File formats - Outputs
•  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will
write their final output
•  The driver can also specify the format of the output data
•  Default is a plain text file 
•  Could be explicitly written as 

conf.setOutputFormat(TextOutputFormat.class);
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Submit the Job and wait for completion
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper
•  The Mapper takes as input a key/value pair which represents a logical record from the
input data source (e.g. a line in a file) 
•  The Mapper may use or ignore the input key
•  E.g. a standard pattern is to read a file one line at a time
•  Key = byte offset into the file where the line starts
•  Value = contents of the line in the file 
•  Typically the key can be considered irrelevant
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain condition is
met
•  e.g. a counting function may produce multiple key/value pairs, one per
element being counted
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Class
•  extends the Mapper <K1, V1, K2, V2> class
•  key and value classes implement the WriteableComparable and
Writeable interfaces
•  most Mappers override the map method which is called once for every
key/value pair in the input
•  void map (K1 key,
V1 value,
Context context) throws IOException,
InterruptedException
•  the default map method is the Identity mapper - maps the inputs directly
to the outputs
•  in general the map input types K1, V1 are different from the map output
types K2, V2
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Class
•  Hadoop provides a number of Mapper implementations:
InverseMapper - swaps the keys and values
TokenCounterMapper - tokenises the input and outputs each token with a 

count of 1
RegexMapper - extracts text matching a regular expression
Example:
job.setMapperClass(TokenCounterMapper.class);
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Code
...
public class WordMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
public void map(LongWritable key, Text value, Context
context)
throws IOException, InterruptedException {
String s = value.toString();
for (String word : s.split("W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Inputs
 Outputs
Writes the outputs
Processes the input text
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

What the mapper does
•  Input to the Mapper:
•  Output from the Mapper:
(“this one I think is called a yink”)
(“he likes to wink, he likes to drink”)
(“he likes to drink and drink and drink”)
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(called,1)
(a, 1)
(yink,1)
(he, 1)
(likes,1)
(to,1)
(wink,1)
(he,1)
(likes,1)
(to,1)
(drink,1)
(he,1)
(likes,1)
(to,1)
(drink 1)
(and,1)
(drink,1)
(and,1)
(drink,1)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Shuffle and sort
•  Shuffle 
•  Integrates the data (key/value pairs) from outputs of each mapper
•  For now, integrates into 1 file
•  Sort 
•  The set of intermediate keys on a single node is automatically
sorted by Hadoop before they are presented to the Reducer
•  Sorted within key
•  Determines what subset of data goes to which Reducer
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

(a, [1])
(and,[1,1])
(called,[1])
(drink,[1,1,1,1])
(he, [1,1,1])
(I, [1])
(is, [1])
(likes,[1,1,1])
(one, [1])
(think, [1])
(this, [1])
(to,[1,1,1])
(wink,[1])
(yink,[1])
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(called,1)
(a, 1)
(yink,1)
(he, 1)
(likes,1)
(to,1)
(wink,1)
(he,1)
(likes,1)
(to,1)
(drink,1)
(he,1)
(likes,1)
(to,1)
(drink 1)
(and,1)
(drink,1)
(and,1)
(drink,1)
(this, [1])
(one, [1])
(I, [1])
(think, [1])
(called,[1])
(is, [1])
(a, [1])
(yink,[1])
(he, [1,1,1])
(likes,[1,1,1])
(to,[1,1,1])
(wink,[1])
(drink,[1,1,1,1])
(and,[1,1])
Mapper
Shuffle (Group)
Sort
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Class
•  extends the Reducer <K2, V2, K3, V3> class
•  key and value classes implement the WriteableComparable and Writeable interfaces
•  void reduce (K2 key,
Iterable<V2> values,
Context context) throws IOException, InterruptedException
•  called once for each input key
•  generates a list of output key/values pairs by iterating over the values associated with the
input key
•  the reduce input types K2, V2 must be the same types as the map output types
•  the reduce output types K3, V3 can be different from the reduce input types
•  the default reduce method is the Identity reducer - outputs each input/value pair directly
•  getConfiguration() - access the Configuration for a Job
•  void setup (Context context) - called once at the beginning of the reduce task
•  void cleanup(Context context) - called at the end of the task to wrap up any
loose ends, closes files, db connections etc.
•  Default number of reducers = 1
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Class
•  Hadoop provides some Reducer implementations
IntSumReducer - sums the values (integers) for a given key 
LongSumReducer - sums the values (longs) for a given key
Example:
job.setReducerClass(IntSumReducer.class);
http://hadooptutorial.info/predefined-mapper-and-reducer-classes/
http://www.programcreek.com/java-api-examples/index.php?
api=org.apache.hadoop.mapreduce.lib.map.InverseMapper
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Inputs
 Outputs
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Processes the input text
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Writes the outputs

More Related Content

What's hot

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 

What's hot (20)

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
MapReduce
MapReduceMapReduce
MapReduce
 
Sqoop
SqoopSqoop
Sqoop
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 

Viewers also liked

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your dataBrendan Tierney
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productBrendan Tierney
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBRainforest QA
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016Brendan Tierney
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryBrendan Tierney
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 

Viewers also liked (20)

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 

Similar to MapReduce Hadoop Word Count

writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similar to MapReduce Hadoop Word Count (20)

Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Recently uploaded

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

MapReduce Hadoop Word Count

  • 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Analysis using Hadoop! ! Map-Reduce – An Introduction! ! Lecture 2! ! ! Brendan Tierney
  • 2. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  • 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com MapReduce •  A batch based, distributed computing framework modelled on Google’s paper on MapReduce [http://research.google.com/archive/mapreduce.html] •  MapReduce decomposes work into small parallelised map and reduce tasks which are scheduled for remote execution on slave nodes •  Terminology •  A job is a full programme •  A task is the execution of a single map or reduce task over a slice of data called a split •  A Mapper is a map task •  A Reducer is a reduce task •  MapReduce works by manipulating key/value pairs in the general format map(key1,value1)➝ list(key2,value2) reduce(key2,list(value2)) ➝ (key3, value3)
  • 4. [from Hadoop in Practice, Alex Holmes] A MapReduce Job
  • 5. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The input is divided into fixed-size pieces called input splits A map task is created for each split
  • 6. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The role of the programmer is to define the Map and Reduce functions
  • 7. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers...
  • 8. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers... The Reduce phase merges the data, as defined by the programmer to produce the outputs.
  • 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Map •  The Map function •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted map(in_key, in_value) ➝ list(temp_key, temp_value)
  • 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reduce •  The Reducer(s) •  A single Reducer handles all the map output for a unique map output key •  A Reducer outputs zero to many key/value pairs •  The output is written to HDFS files, to external DBs, or to any data sink... reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
  • 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com MapReduce •  JobTracker - (Master) •  Controls MapReduce jobs •  Assigns Map & Reduce tasks to the other nodes on the cluster •  Monitors the tasks as they are running •  Relaunches failed tasks on other nodes in the cluster •  TaskTracker - (Slave) •  A single TaskTracker per slave node •  Manage the execution of the individual tasks on the node •  Can instantiate many JVMs to handle tasks in parallel •  Communicates back to the JobTracker (via a heartbeat)
  • 12. [from Hadoop in Practice, Alex Holmes]
  • 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com [from Hadoop the Definitive Guide,Tom White] A MapReduce Job
  • 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com [from Hadoop the Definitive Guide,Tom White] Monitoring progress
  • 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com YARN (Yet Another Resource Negotiator) Framework
  • 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Data Locality ! “This is a local node for local Data” •  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is working on a block of data stored locally on that node vis HDFS •  If this is not possible, the Mapper will have to transfer the data across the network as it accesses the data •  Once all the Map tasks are finished, the map output data is transferred across the network to the Reducers •  Although Reducers may run on the same node (physical machine) as the Mappers there is no concept of data locality for Reducers
  • 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Bottlenecks? •  Reducers cannot start until all Mappers are finished and the output has been transferred to the Reducers and sorted •  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the Reducers as the Mappers finish •  The percentage of Mappers which should finish before the Reducers start retrieving data is configurable •  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative execution •  If a Mapper appears to be running significantly slower than the others, a new instance of the Mapper will be started on another machine, operating on the same data (remember replication) •  The results of the first Mapper to finish will be used •  The Mapper which is still running will be terminated by Hadoop
  • 18.
  • 19. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com The MapReduce Job! ! Let us build up an example
  • 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com The Scenario •  Build a Word Counter •  Using the Shakespeare Poems •  Count the number of times a word appears in the data set •  Use Map-Reduce to do this work •  Step-by-Step of creating the MR process
  • 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driving Class Mapper Reducer
  • 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Setting up the MapReduce Job •  A Job object forms the specification for the job •  Job needs to know: •  the jar file that the code is in which will be distributed around the cluster; setJarByClass() •  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath() •  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath() •  the Mapper and Reducer classes; setMapperClass() setReducerClass() •  the output key and value classes; setOutputKeyClass() setOutputValueClass() •  the Mapper output key and value classes if they are different from the Reducer; setMapOutputKeyClass() setMapOutputValueClass() •  the Mapper output key and value classes •  the name of the job, default is the name of the jar file; setJobName() •  The default input considers the file as lines of text •  The default key input is LongWriteable (the byte offset into the file) •  The default value input is Text (the contents of the line read from the file)
  • 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } You will typically import these classes into every MapReduce job you write. We will omit the import statements in future slides for brevity.
  • 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } The main method accepts two command-line arguments: the input and output directories. The first step is to ensure that we have been given two command line arguments. If not, print a help message and exit.
  • 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Create a new job, specify the class which will be called to run the job, and give it a Job Name.
  • 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Give the Job information about the classes for the Mapper and the reducer
  • 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the format of the intermediate output key and value produced by the Mapper
  • 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the types for the Reducer output key and value
  • 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the input directory (where the data will be read from) and the output directory where the data will be written.
  • 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com File formats - Inputs •  The default InputFormat (TextInputFormat) will be used unless you specify otherwise •  To use an InputFormat other than the default, use e.g. conf.setInputFormat(KeyValueTextInputFormat.class) •  By default, FileInputFormat.setInputPaths() will read all files from a specified directory and send them to Mappers •  Exceptions: items whose names begin with a period (.) or underscore (_) •  Globs can be specified to restrict input •  For example, /2010/*/01/*
  • 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com File formats - Outputs •  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output •  The driver can also specify the format of the output data •  Default is a plain text file •  Could be explicitly written as conf.setOutputFormat(TextOutputFormat.class);
  • 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Submit the Job and wait for completion
  • 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  The Mapper may use or ignore the input key •  E.g. a standard pattern is to read a file one line at a time •  Key = byte offset into the file where the line starts •  Value = contents of the line in the file •  Typically the key can be considered irrelevant •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted
  • 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Class •  extends the Mapper <K1, V1, K2, V2> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  most Mappers override the map method which is called once for every key/value pair in the input •  void map (K1 key, V1 value, Context context) throws IOException, InterruptedException •  the default map method is the Identity mapper - maps the inputs directly to the outputs •  in general the map input types K1, V1 are different from the map output types K2, V2
  • 38. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Class •  Hadoop provides a number of Mapper implementations: InverseMapper - swaps the keys and values TokenCounterMapper - tokenises the input and outputs each token with a 
 count of 1 RegexMapper - extracts text matching a regular expression Example: job.setMapperClass(TokenCounterMapper.class);
  • 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Code ... public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = value.toString(); for (String word : s.split("W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } } Inputs Outputs Writes the outputs Processes the input text
  • 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What the mapper does •  Input to the Mapper: •  Output from the Mapper: (“this one I think is called a yink”) (“he likes to wink, he likes to drink”) (“he likes to drink and drink and drink”) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1)
  • 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Shuffle and sort •  Shuffle •  Integrates the data (key/value pairs) from outputs of each mapper •  For now, integrates into 1 file •  Sort •  The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer •  Sorted within key •  Determines what subset of data goes to which Reducer
  • 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com (a, [1]) (and,[1,1]) (called,[1]) (drink,[1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes,[1,1,1]) (one, [1]) (think, [1]) (this, [1]) (to,[1,1,1]) (wink,[1]) (yink,[1]) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1) (this, [1]) (one, [1]) (I, [1]) (think, [1]) (called,[1]) (is, [1]) (a, [1]) (yink,[1]) (he, [1,1,1]) (likes,[1,1,1]) (to,[1,1,1]) (wink,[1]) (drink,[1,1,1,1]) (and,[1,1]) Mapper Shuffle (Group) Sort
  • 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Class •  extends the Reducer <K2, V2, K3, V3> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  void reduce (K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException •  called once for each input key •  generates a list of output key/values pairs by iterating over the values associated with the input key •  the reduce input types K2, V2 must be the same types as the map output types •  the reduce output types K3, V3 can be different from the reduce input types •  the default reduce method is the Identity reducer - outputs each input/value pair directly •  getConfiguration() - access the Configuration for a Job •  void setup (Context context) - called once at the beginning of the reduce task •  void cleanup(Context context) - called at the end of the task to wrap up any loose ends, closes files, db connections etc. •  Default number of reducers = 1
  • 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Class •  Hadoop provides some Reducer implementations IntSumReducer - sums the values (integers) for a given key LongSumReducer - sums the values (longs) for a given key Example: job.setReducerClass(IntSumReducer.class); http://hadooptutorial.info/predefined-mapper-and-reducer-classes/ http://www.programcreek.com/java-api-examples/index.php? api=org.apache.hadoop.mapreduce.lib.map.InverseMapper
  • 45. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }
  • 46. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Inputs Outputs
  • 47. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Processes the input text
  • 48. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Writes the outputs