2. Combiner
• A Combiner, also known as a Mini-reduce or Mapper side reducer
• The Combiner will receive as input all data emitted by the Mapper
instances on a given node and its output from the Combiner is then
sent to the Reducers,
• The Combiner will be used in between the Map class and the
Reduce class to reduce the volume of data transfer between Map
and Reduce.
• Usage of the Combiner is optional.
3. When ?
• If a reduce function is both
commutative and associative , we do not need to write any
additional code to take advantage .
job.setCombinerClass(Reduce.class);
• The Combiner should be an instance of the Reducer interface. A
combiner does not have a predefined interface
• If your Reducer itself cannot be used directly as a Combiner
because of commutativity or associativity, you might
still be able to write a third class to use as a Combiner for your job.
• Note – Hadoop does not guarantee that how many times combiners
function will run and how many times it will run for a map output.
8. Speculative execution
• One problem with the Hadoop system is that by dividing the tasks
across many nodes, it is possible for a few slow nodes to rate-limit
the rest of the program.
• Hadoop platform will schedule redundant copies of the remaining
tasks across several nodes which do not have other work to
perform. This process is known as speculative execution.
• When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy.
9. • Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by
configuration ;
• mapred.map.tasks.speculative.execution
• mapred.reduce.tasks.speculative.execution
• There is a hard limit of 10% of slots used for speculation across all
hadoop jobs. This is not configurable right now. However there is a
per-job option to cap the ratio of speculated tasks to total tasks:
mapreduce.job.speculative.speculativecap=0.1
10. Locating Stragglers
Hadoop monitors each task progress using a progress score
between 0 and 1
If a task’s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler
11. COUNTERS
• Counters are used to determine if and how often a
particular event occurred during a job execution.
• 4 categories of counters in Hadoop
• File system,
• Job
• Map Reduce Framework,
• Custom counter
16. Custom Counters
• MapReduce allows you to define your own custom
counters. Custom counters are useful for counting
specific records such as Bad Records, as the framework
counts only total records. Custom counters can also be
used to count outliers such as example maximum and
minimum values, and for summations.
17. Steps to write custome counter
• define a enum (mapper or reducer , anywhere based upon requirement );
• public static enum MATCH_COUNTER {
Score_above_400,
Score_below_20,Temp_abv_55;
}
context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
18. Data Types
• Hadoop MapReduce uses typed data at all times when it
interacts with user-provided Mappers and Reducers.
• In wordCount, you must have seen LongWritable,
IntWrtitable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types.
LongWritable is equivalent to long, IntWritable to int and
Text to String.
19. Hadoop writable classes (data
types) vs Java Data types
Java Hadoop
Byte Bytewritable
int Intwritable /Vintwritable/
float floatwritable
long Longwritable / VLongwritable
Double DoubleWritable
String Text / Nullwritable
20. • What is a Writable in Hadoop?
• Why does Hadoop use Writable(s)?
• Limitation of primitive Hadoop Writable classes
• Custom Writable
21. Writable in Hadoop
• It is fairly easy to understand the relation between them and Java’s
primitive types. LongWritable is equivalent to long, IntWritable to int
and Text to String.
• Writable in an interface in Hadoop and types in Hadoop must
implement this interface. Hadoop provides these writable wrappers
for almost all Java primitive types and some other types
• To implement the Writable interface we require two methods ;
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
22. Why does Hadoop use
Writable(s)
• As we already know, data needs to be transmitted between different
nodes in a distributed computing environment.
• This requires serialization and deserialization of data to convert the
data that is in structured format to byte stream and vice-versa.
• Hadoop therefore uses simple and efficient serialization protocol to
serialize data between map and reduce phase and these are called
Writable(s).
23. WritableComparable
• interface is just a subinterface of the Writable and
java.lang.Comparable interfaces.
• For implementing a WritableComparable we must have compareTo
method apart from readFields and write methods.
• Comparison of types is crucial for MapReduce, where there is a
sorting phase during which keys are compared with one another.
public interface WritableComparable extends Writable,
Comparable{
}
24. • public interface WritableComparable extends Writable, Comparable
{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}
• WritableComparables can be compared to each other, typically via
Comparators. Any type which is to be used as a key in the Hadoop
Map-Reduce framework should implement this interface.
• Any type which is to be used as a value in the Hadoop Map-Reduce
framework should implement the Writable interface.
25. Limitation of primitive
Hadoop Writable classes
• Writable that can be used in simple applications like wordcount, but
clearly these cannot serve our purpose all the time.
• Now if you want to still use the primitive Hadoop Writable(s), you
would have to convert the value into a string and transmit it. However
it gets very messy when you have to deal with string manipulations.
26.
27. INPUT Format
• The InputFormat class is one of the fundamental classes in the
Hadoop Map Reduce framework. This class is responsible for
defining two main things:
Data splits
Record reader
• Data split is a fundamental concept in Hadoop Map Reduce
framework which defines both the size of individual Map tasks and
its potential execution server.
• The Record Reader is responsible for actual reading records from
the input file and submitting them (as key/value pairs) to the
mapper.
28. • public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException;
}
30. MultiInputs
• We use MultipleInputs class which supports MapReduce
jobs that have multiple input paths with a different
InputFormat and Mapper for each path.
• MultipleInputs is a feature that supports different input
formats in the MapReduce.
32. • Step :1 Add configuration in driver class
MultipleInputs.addInputPath(job,new
Path(args[0]),TextInputFormat.class,myMapper1.class);
MultipleInputs.addInputPath(job,new
Path(args[1]),TextInputFormat.class, myMapper2.class);
• Step :2 Write different Mapper for different the file path ;
myMapper1 extend mapper<Ki,Vi,Ko,Vo> {
}
• myMapper2 extend mapper<Ki,Vi,Ko,Vo>{
}
33. MultipleOutputFormat
• FileOutputFormat and its subclasses generate a set of
files in the output directory.
• There is one file per reducer, and files are named by the
partition number: part-00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files or to produce multiple files per
reducer.
34. • Step -1
MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT",
TextOutputFormat.class, Text.class, DoubleWritable.class);
• Step-2
Overide setup() method in reducer class and create an instance of
multiOutputs() ;
public void setup(Context context) throws IOException,
InterruptedException {
mos = new MultipleOutputs<Text, DoubleWritable>(context); }
• Step-3
We will use multiOutputs() instance in reduce() method to write data
to the ouput
mos.write(“NAMED_OUTPUT", outputKey, outputValue);
35. DISTRIBUTED CACHE
• If you are writing Map Reduce Applications, where you want some
files to be shared across all nodes in Hadoop Cluster. It can be
simple properties file or can be executable jar file.
• This Distributed Cache is configured with Job Configuration, What it
does is, it provides read only data to all machine on the cluster.
• The framework will copy the necessary files on to the slave node
before any tasks for the job are executed on that node
36.
37. • Step 1 : Put file to HDFS
# hdfs -put /rakesh/someFolder /user/rakesh/cachefile1
• Step 2: Add cachefile in Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1
"),job.getConfiguration());
• Step 3: Access Cached file ;
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new
FileInputStream(cacheFiles[0].toString());
38. Mapreduce 1.0 vs Mapreduce 2.0
• one easy way to differentiate between Hadoop old api and new api
is packages.
• old api packages - mapred org.apache.hadoop.mapred package,
• new api packages – mapreduce -org.apache.hadoop.mapreduce
package.
40. Joins
• Joins is one of the interesting features available in MapReduce.
• When processing large data sets the need for joining data by a
common key can be very useful.
• By joining data you can further gain insight such as joining with
timestamps to correlate events with a time a day.
• Joins is one of the interesting features available in MapReduce.
MapReduce can perform joins between very large
datasets.Implementation of join depends on how large the datasets
are and how they are partiotioned . If the join is performed by the
mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
41. Map-Side Join
• A map-side join between large inputs works by performing the join
before the data reaches the map function.
• For this to work, though, the inputs to each map must be partitioned
and sorted in a particular way.
• Each input data set must be divided into the same number of
partitions, and it must be sorted by the same key (the join key) in
each source.
• All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits
the description of the output of a MapReduce job.
42. Reduce side Join
• Reduce-Side joins are more simple than Map-Side joins since the
input datasets need not to be structured. But it is less efficient as
both datasets have to go through the MapReduce shuffle phase. the
records with the same key are brought together in the reducer. We
can also use the Secondary Sort technique to control the order of
the records.
• How it is done?
The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer.
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.
43. • In each reducer, the data values from both datasets, for
keys assigned to the reducer, are available, to be
processed as required.
• A secondary sort needs to be done to ensure the
ordering of the values sent to the reducer.
• If the input files are of different formats, we would need
separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and
associate the specific mapper to the same.
44. Improving MapReduce
Performance
• Use Compression technique (LZO,GZIP,Snappy….)
• Tune the number of map and reduce tasks appropriately
• Write a Combiner
• Use the most appropriate and compact Writable type for your data
• Reuse Writables
• Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for-
improving-mapreduce-performance/
45. Yet Another Resource Negotiator
(YARN)
• YARN (Yet Another Resource Negotiator) is the resource
management layer for the Apache Hadoop ecosystem.
In a YARN cluster, there are two types of hosts;
• The ResourceManager is the master daemon that communicates
with the client, tracks resources on the cluster, and orchestrates
work by assigning tasks to NodeManagers.
• A NodeManager is a worker daemon that launches and tracks
processes spawned on worker hosts.
46. • Containers are an important YARN concept. You can think of a
container as a request to hold resources on the YARN cluster.
• Use of a YARN cluster begins with a request from a client consisting
of an application. The ResourceManager negotiates the necessary
resources for a container and launches an ApplicationMaster to
represent the submitted application.
• Using a resource-request protocol, the ApplicationMaster negotiates
resource containers for the application at each node. Upon
execution of the application, the ApplicationMaster monitors the
container until completion. When the application is complete, the
ApplicationMaster unregisters its container with the
ResourceManager, and the cycle is complete.
When a MapReduce Job is run on a large dataset, Hadoop Mapper generates large chunks of intermediate data that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion.
So reducing this network congestion , MapReduce framework offers ‘Combiner’
In MapReduce a job is broken into several tasks which will execute in parallel. This model of execution is sensitive to slow tasks (even if they are very few in number) as they will slowdown the overall execution of a job. Therefore, Hadoop detects such slow tasks and runs (duplicate) backup tasks for such tasks. This is calledspeculative execution. Speculating more tasks can help jobs finish faster - but can also waste CPU cycles. Conversely - speculating fewer tasks can save CPU cycles - but cause jobs to finish slower. The options documented here allow the users to control the aggressiveness of the speculation algorithms and choose the right balance between efficiency and latency.
The FILE_BYTES_WRITTEN counter is incremented for each byte written to the local file system. These writes occur during the map phase when the mappers write their intermediate results to the local file system. They also occur during the shuffle phase when the reducers spill intermediate results to their local disks while sorting.
The off-the-shelf Hadoop counters that correspond to MAPRFS_BYTES_READ and MAPRFS_BYTES_WRITTEN are HDFS_BYTES_READ and HDFS_BYTES_WRITTEN.
The amount of data read and written will depend on the compression algorithm you use, if any.
The table above describes the counters that apply to Hadoop jobs.
The DATA_LOCAL_MAPS indicates how many map tasks executed on local file systems. Optimally, all the map tasks will execute on local data to exploit locality of reference, but this isn’t always possible.
The FALLOW_SLOTS_MILLIS_MAPS indicates how much time map tasks wait in the queue after the slots are reserved but before the map tasks execute. A high number indicates a possible mismatch between the number of slots configured for a task tracker and how many resources are actually available.
The SLOTS_MILLIS_* counters show how much time in milliseconds expired for the tasks. This value indicates wall clock time for the map and reduce tasks.
The TOTAL_LAUNCHED_MAPS counter defines how many map tasks were launched for the job, including failed tasks. Optimally, this number is the same as the number of splits for the job.
The COMBINE_* counters show how many records were read and written by the optional combiner. If you don’t specify a combiner, these counters will be 0.
The CPU statistics are gathered from /proc/cpuinfo and indicate how much total time was spent executing map and reduce tasks for a job.
The garbage collection counter is reported from GarbageCollectorMXBean.getCollectionTime().
The MAP*RECORDS are incremented for every successful record read and written by the mappers. Records that the map tasks failed to read or write are not included in these counters.
The PHYSICAL_MEMORY_BYTES statistics are gathered from /proc/meminfo and indicate how much RAM (not including swap space) was consumed by all the tasks.
All the counters, whether custom or framework, are stored in the JobTracker JVM memory, so there’s a practical limit to the number of counters you should use. The rule of thumb is to use less than 100, but this will vary based on physical memory capacity.
Serialization : it is a mechanism of writing the state of an object into a byte stream.
A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface.
More technically , To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object.
The reverse operation of serialization is called deserialization.
bjects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission. Hadoop provides several stock classes which implement Writable: Text (which stores String data), IntWritable, LongWritable, FloatWritable, BooleanWritable, and several others. The entire list is in theorg.apache.hadoop.io package of the Hadoop source (see the API reference - http://hadoop.apache.org/docs/current/api/index.html).
Custom writable :
public class MyWritable implements Writable {
// Some data private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public static MyWritable read(DataInput in) throws IOException {
MyWritable w = new MyWritable();
w.readFields(in);
return w;
} }
public interface Comparable{
public int compareTo(Object obj);
}
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Any split implementation extends the Apache base abstract class - InputSplit, defining a split length and locations. A split length is the size of the split data (in bytes), while locations is the list of node names where the data for the split would be local. Split locations are a way for a scheduler to decide on which particular machine to execute this split. A very simple[1] a job tracker works as follows:
Receive a heartbeat form one of the task trackers, reporting map slot availability.
Find queued up split for which the available node is "local".
Submit split to the task tracker for the execution.
Locality can mean different things depending on storage mechanisms and the overall execution strategy. In the case of HDFS, for example, a split typically corresponds to a physical data block size and locations is a set of machines (with the set size defined by a replication factor) where this block is physically located. This is how FileInputFormat calculates splits.
HIPI is a framework for image processing of the image file with MapReduce.
Code example : http://www.lichun.cc/blog/2012/05/hadoop-multipleinputs-usage/
. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
How big is the DistributedCache?
The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB.
Where does the DistributedCache store data?
/tmp/hadoop-<user.name>/mapred/local/taskTracker/archive
If both datasets are too large for either to be copied to each node in the cluster, we can still join them using MapReduce with a map-side or reduce-side join, depending on how the data is structured. One common example of this case is a user database and a log of some user activity (such as access logs). For a popular service, it is not feasible to distribute the user database (or the logs) to all the MapReduce nodes. Before diving into the implementation let us understand the problem thoroughly.
A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.
If we have two datasets, for example, one dataset having user ids, names and the other having the user activity over the application. In-order to find out which user have performed what activity on the application we might need to join these two datasets such as both user names and the user activity will be joined together. Join can be applied based on the dataset size if one dataset is very small to be distributed across the cluster then we can use Side Data Distribution technique
Almost every Hadoop job that generates an non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, the reduced amount of disk IO during the shuffle will usually save time overall.
Whenever a job needs to output a significant amount of data, LZO compression can also increase performance on the output side. Since writes are replicated 3x by default, each GB of outpunnnnnnnt data you save will save 3GB of disk writes.In order to enable LZO compression, check out our recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.
The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.
Conclusion
Summarizing the important concepts presented in this section:
A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers.
The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resourceson the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total.
A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container.
An application in YARN comprises three parts:
The application client, which is how a program is run on the cluster.
An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
A MapReduce application consists of map tasks and reduce tasks.
A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.