Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

HADOOP
Map- Reduce
Prashant Gupta

Combiner
• A Combiner, also known as a Mini-reduce or Mapper side reducer
• The Combiner will receive as input all data emitted by the Mapper
instances on a given node and its output from the Combiner is then
sent to the Reducers,
• The Combiner will be used in between the Map class and the
Reduce class to reduce the volume of data transfer between Map
and Reduce.
• Usage of the Combiner is optional.

When ?
• If a reduce function is both
commutative and associative , we do not need to write any
additional code to take advantage .
job.setCombinerClass(Reduce.class);
• The Combiner should be an instance of the Reducer interface. A
combiner does not have a predefined interface
• If your Reducer itself cannot be used directly as a Combiner
because of commutativity or associativity, you might
still be able to write a third class to use as a Combiner for your job.
• Note – Hadoop does not guarantee that how many times combiners
function will run and how many times it will run for a map output.

Hadoop Reducer is used without
a Combiner

Speculative execution
• One problem with the Hadoop system is that by dividing the tasks
across many nodes, it is possible for a few slow nodes to rate-limit
the rest of the program.
• Hadoop platform will schedule redundant copies of the remaining
tasks across several nodes which do not have other work to
perform. This process is known as speculative execution.
• When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy.

• Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by
configuration ;
• mapred.map.tasks.speculative.execution
• mapred.reduce.tasks.speculative.execution
• There is a hard limit of 10% of slots used for speculation across all
hadoop jobs. This is not configurable right now. However there is a
per-job option to cap the ratio of speculated tasks to total tasks:
mapreduce.job.speculative.speculativecap=0.1

Locating Stragglers
 Hadoop monitors each task progress using a progress score
between 0 and 1
 If a task’s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler

COUNTERS
• Counters are used to determine if and how often a
particular event occurred during a job execution.
• 4 categories of counters in Hadoop
• File system,
• Job
• Map Reduce Framework,
• Custom counter

Custom Counters
• MapReduce allows you to define your own custom
counters. Custom counters are useful for counting
specific records such as Bad Records, as the framework
counts only total records. Custom counters can also be
used to count outliers such as example maximum and
minimum values, and for summations.

Steps to write custome counter
• define a enum (mapper or reducer , anywhere based upon requirement );
• public static enum MATCH_COUNTER {
Score_above_400,
Score_below_20,Temp_abv_55;
}
context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);

Data Types
• Hadoop MapReduce uses typed data at all times when it
interacts with user-provided Mappers and Reducers.
• In wordCount, you must have seen LongWritable,
IntWrtitable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types.
LongWritable is equivalent to long, IntWritable to int and
Text to String.

Hadoop writable classes (data
types) vs Java Data types
Java Hadoop
Byte Bytewritable
int Intwritable /Vintwritable/
float floatwritable
long Longwritable / VLongwritable
Double DoubleWritable
String Text / Nullwritable

• What is a Writable in Hadoop?
• Why does Hadoop use Writable(s)?
• Limitation of primitive Hadoop Writable classes
• Custom Writable

Writable in Hadoop
• It is fairly easy to understand the relation between them and Java’s
primitive types. LongWritable is equivalent to long, IntWritable to int
and Text to String.
• Writable in an interface in Hadoop and types in Hadoop must
implement this interface. Hadoop provides these writable wrappers
for almost all Java primitive types and some other types
• To implement the Writable interface we require two methods ;
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}

Why does Hadoop use
Writable(s)
• As we already know, data needs to be transmitted between different
nodes in a distributed computing environment.
• This requires serialization and deserialization of data to convert the
data that is in structured format to byte stream and vice-versa.
• Hadoop therefore uses simple and efficient serialization protocol to
serialize data between map and reduce phase and these are called
Writable(s).

WritableComparable
• interface is just a subinterface of the Writable and
java.lang.Comparable interfaces.
• For implementing a WritableComparable we must have compareTo
method apart from readFields and write methods.
• Comparison of types is crucial for MapReduce, where there is a
sorting phase during which keys are compared with one another.
public interface WritableComparable extends Writable,
Comparable{
}

• public interface WritableComparable extends Writable, Comparable
{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}
• WritableComparables can be compared to each other, typically via
Comparators. Any type which is to be used as a key in the Hadoop
Map-Reduce framework should implement this interface.
• Any type which is to be used as a value in the Hadoop Map-Reduce
framework should implement the Writable interface.

Limitation of primitive
Hadoop Writable classes
• Writable that can be used in simple applications like wordcount, but
clearly these cannot serve our purpose all the time.
• Now if you want to still use the primitive Hadoop Writable(s), you
would have to convert the value into a string and transmit it. However
it gets very messy when you have to deal with string manipulations.

INPUT Format
• The InputFormat class is one of the fundamental classes in the
Hadoop Map Reduce framework. This class is responsible for
defining two main things:
 Data splits
 Record reader
• Data split is a fundamental concept in Hadoop Map Reduce
framework which defines both the size of individual Map tasks and
its potential execution server.
• The Record Reader is responsible for actual reading records from
the input file and submitting them (as key/value pairs) to the
mapper.

• public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException;
}

MultiInputs
• We use MultipleInputs class which supports MapReduce
jobs that have multiple input paths with a different
InputFormat and Mapper for each path.
• MultipleInputs is a feature that supports different input
formats in the MapReduce.

• Step :1 Add configuration in driver class
MultipleInputs.addInputPath(job,new
Path(args[0]),TextInputFormat.class,myMapper1.class);
MultipleInputs.addInputPath(job,new
Path(args[1]),TextInputFormat.class, myMapper2.class);
• Step :2 Write different Mapper for different the file path ;
myMapper1 extend mapper<Ki,Vi,Ko,Vo> {
}
• myMapper2 extend mapper<Ki,Vi,Ko,Vo>{
}

MultipleOutputFormat
• FileOutputFormat and its subclasses generate a set of
files in the output directory.
• There is one file per reducer, and files are named by the
partition number: part-00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files or to produce multiple files per
reducer.

• Step -1
MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT",
TextOutputFormat.class, Text.class, DoubleWritable.class);
• Step-2
Overide setup() method in reducer class and create an instance of
multiOutputs() ;
public void setup(Context context) throws IOException,
InterruptedException {
mos = new MultipleOutputs<Text, DoubleWritable>(context); }
• Step-3
We will use multiOutputs() instance in reduce() method to write data
to the ouput
mos.write(“NAMED_OUTPUT", outputKey, outputValue);

DISTRIBUTED CACHE
• If you are writing Map Reduce Applications, where you want some
files to be shared across all nodes in Hadoop Cluster. It can be
simple properties file or can be executable jar file.
• This Distributed Cache is configured with Job Configuration, What it
does is, it provides read only data to all machine on the cluster.
• The framework will copy the necessary files on to the slave node
before any tasks for the job are executed on that node

• Step 1 : Put file to HDFS
# hdfs -put /rakesh/someFolder /user/rakesh/cachefile1
• Step 2: Add cachefile in Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1
"),job.getConfiguration());
• Step 3: Access Cached file ;
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new
FileInputStream(cacheFiles[0].toString());

Mapreduce 1.0 vs Mapreduce 2.0
• one easy way to differentiate between Hadoop old api and new api
is packages.
• old api packages - mapred org.apache.hadoop.mapred package,
• new api packages – mapreduce -org.apache.hadoop.mapreduce
package.

Joins
• Joins is one of the interesting features available in MapReduce.
• When processing large data sets the need for joining data by a
common key can be very useful.
• By joining data you can further gain insight such as joining with
timestamps to correlate events with a time a day.
• Joins is one of the interesting features available in MapReduce.
MapReduce can perform joins between very large
datasets.Implementation of join depends on how large the datasets
are and how they are partiotioned . If the join is performed by the
mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.

Map-Side Join
• A map-side join between large inputs works by performing the join
before the data reaches the map function.
• For this to work, though, the inputs to each map must be partitioned
and sorted in a particular way.
• Each input data set must be divided into the same number of
partitions, and it must be sorted by the same key (the join key) in
each source.
• All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits
the description of the output of a MapReduce job.

Reduce side Join
• Reduce-Side joins are more simple than Map-Side joins since the
input datasets need not to be structured. But it is less efficient as
both datasets have to go through the MapReduce shuffle phase. the
records with the same key are brought together in the reducer. We
can also use the Secondary Sort technique to control the order of
the records.
• How it is done?
The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer.
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.

• In each reducer, the data values from both datasets, for
keys assigned to the reducer, are available, to be
processed as required.
• A secondary sort needs to be done to ensure the
ordering of the values sent to the reducer.
• If the input files are of different formats, we would need
separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and
associate the specific mapper to the same.

Improving MapReduce
Performance
• Use Compression technique (LZO,GZIP,Snappy….)
• Tune the number of map and reduce tasks appropriately
• Write a Combiner
• Use the most appropriate and compact Writable type for your data
• Reuse Writables
• Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for-
improving-mapreduce-performance/

Yet Another Resource Negotiator
(YARN)
• YARN (Yet Another Resource Negotiator) is the resource
management layer for the Apache Hadoop ecosystem.
In a YARN cluster, there are two types of hosts;
• The ResourceManager is the master daemon that communicates
with the client, tracks resources on the cluster, and orchestrates
work by assigning tasks to NodeManagers.
• A NodeManager is a worker daemon that launches and tracks
processes spawned on worker hosts.

• Containers are an important YARN concept. You can think of a
container as a request to hold resources on the YARN cluster.
• Use of a YARN cluster begins with a request from a client consisting
of an application. The ResourceManager negotiates the necessary
resources for a container and launches an ApplicationMaster to
represent the submitted application.
• Using a resource-request protocol, the ApplicationMaster negotiates
resource containers for the application at each node. Upon
execution of the application, the ApplicationMaster monitors the
container until completion. When the application is complete, the
ApplicationMaster unregisters its container with the
ResourceManager, and the cycle is complete.

Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

Similar to Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More (20)

More from Prashant Gupta

More from Prashant Gupta (9)

Recently uploaded

Recently uploaded (20)

Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

Editor's Notes