SlideShare a Scribd company logo
1 of 50
HADOOP
Map- Reduce
Prashant Gupta
Combiner
• A Combiner, also known as a Mini-reduce or Mapper side reducer
• The Combiner will receive as input all data emitted by the Mapper
instances on a given node and its output from the Combiner is then
sent to the Reducers,
• The Combiner will be used in between the Map class and the
Reduce class to reduce the volume of data transfer between Map
and Reduce.
• Usage of the Combiner is optional.
When ?
• If a reduce function is both
commutative and associative , we do not need to write any
additional code to take advantage .
job.setCombinerClass(Reduce.class);
• The Combiner should be an instance of the Reducer interface. A
combiner does not have a predefined interface
• If your Reducer itself cannot be used directly as a Combiner
because of commutativity or associativity, you might
still be able to write a third class to use as a Combiner for your job.
• Note – Hadoop does not guarantee that how many times combiners
function will run and how many times it will run for a map output.
Hadoop Reducer is used without
a Combiner
Speculative execution
• One problem with the Hadoop system is that by dividing the tasks
across many nodes, it is possible for a few slow nodes to rate-limit
the rest of the program.
• Hadoop platform will schedule redundant copies of the remaining
tasks across several nodes which do not have other work to
perform. This process is known as speculative execution.
• When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy.
• Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by
configuration ;
• mapred.map.tasks.speculative.execution
• mapred.reduce.tasks.speculative.execution
• There is a hard limit of 10% of slots used for speculation across all
hadoop jobs. This is not configurable right now. However there is a
per-job option to cap the ratio of speculated tasks to total tasks:
mapreduce.job.speculative.speculativecap=0.1
Locating Stragglers
 Hadoop monitors each task progress using a progress score
between 0 and 1
 If a task’s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler
COUNTERS
• Counters are used to determine if and how often a
particular event occurred during a job execution.
• 4 categories of counters in Hadoop
• File system,
• Job
• Map Reduce Framework,
• Custom counter
COUNTER continue …
Custom Counters
• MapReduce allows you to define your own custom
counters. Custom counters are useful for counting
specific records such as Bad Records, as the framework
counts only total records. Custom counters can also be
used to count outliers such as example maximum and
minimum values, and for summations.
Steps to write custome counter
• define a enum (mapper or reducer , anywhere based upon requirement );
• public static enum MATCH_COUNTER {
Score_above_400,
Score_below_20,Temp_abv_55;
}
context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
Data Types
• Hadoop MapReduce uses typed data at all times when it
interacts with user-provided Mappers and Reducers.
• In wordCount, you must have seen LongWritable,
IntWrtitable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types.
LongWritable is equivalent to long, IntWritable to int and
Text to String.
Hadoop writable classes (data
types) vs Java Data types
Java Hadoop
Byte Bytewritable
int Intwritable /Vintwritable/
float floatwritable
long Longwritable / VLongwritable
Double DoubleWritable
String Text / Nullwritable
• What is a Writable in Hadoop?
• Why does Hadoop use Writable(s)?
• Limitation of primitive Hadoop Writable classes
• Custom Writable
Writable in Hadoop
• It is fairly easy to understand the relation between them and Java’s
primitive types. LongWritable is equivalent to long, IntWritable to int
and Text to String.
• Writable in an interface in Hadoop and types in Hadoop must
implement this interface. Hadoop provides these writable wrappers
for almost all Java primitive types and some other types
• To implement the Writable interface we require two methods ;
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
Why does Hadoop use
Writable(s)
• As we already know, data needs to be transmitted between different
nodes in a distributed computing environment.
• This requires serialization and deserialization of data to convert the
data that is in structured format to byte stream and vice-versa.
• Hadoop therefore uses simple and efficient serialization protocol to
serialize data between map and reduce phase and these are called
Writable(s).
WritableComparable
• interface is just a subinterface of the Writable and
java.lang.Comparable interfaces.
• For implementing a WritableComparable we must have compareTo
method apart from readFields and write methods.
• Comparison of types is crucial for MapReduce, where there is a
sorting phase during which keys are compared with one another.
public interface WritableComparable extends Writable,
Comparable{
}
• public interface WritableComparable extends Writable, Comparable
{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}
• WritableComparables can be compared to each other, typically via
Comparators. Any type which is to be used as a key in the Hadoop
Map-Reduce framework should implement this interface.
• Any type which is to be used as a value in the Hadoop Map-Reduce
framework should implement the Writable interface.
Limitation of primitive
Hadoop Writable classes
• Writable that can be used in simple applications like wordcount, but
clearly these cannot serve our purpose all the time.
• Now if you want to still use the primitive Hadoop Writable(s), you
would have to convert the value into a string and transmit it. However
it gets very messy when you have to deal with string manipulations.
INPUT Format
• The InputFormat class is one of the fundamental classes in the
Hadoop Map Reduce framework. This class is responsible for
defining two main things:
 Data splits
 Record reader
• Data split is a fundamental concept in Hadoop Map Reduce
framework which defines both the size of individual Map tasks and
its potential execution server.
• The Record Reader is responsible for actual reading records from
the input file and submitting them (as key/value pairs) to the
mapper.
• public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException;
}
INPUT Format
MultiInputs
• We use MultipleInputs class which supports MapReduce
jobs that have multiple input paths with a different
InputFormat and Mapper for each path.
• MultipleInputs is a feature that supports different input
formats in the MapReduce.
MutipleInput Files
• Step :1 Add configuration in driver class
MultipleInputs.addInputPath(job,new
Path(args[0]),TextInputFormat.class,myMapper1.class);
MultipleInputs.addInputPath(job,new
Path(args[1]),TextInputFormat.class, myMapper2.class);
• Step :2 Write different Mapper for different the file path ;
myMapper1 extend mapper<Ki,Vi,Ko,Vo> {
}
• myMapper2 extend mapper<Ki,Vi,Ko,Vo>{
}
MultipleOutputFormat
• FileOutputFormat and its subclasses generate a set of
files in the output directory.
• There is one file per reducer, and files are named by the
partition number: part-00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files or to produce multiple files per
reducer.
• Step -1
MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT",
TextOutputFormat.class, Text.class, DoubleWritable.class);
• Step-2
Overide setup() method in reducer class and create an instance of
multiOutputs() ;
public void setup(Context context) throws IOException,
InterruptedException {
mos = new MultipleOutputs<Text, DoubleWritable>(context); }
• Step-3
We will use multiOutputs() instance in reduce() method to write data
to the ouput
mos.write(“NAMED_OUTPUT", outputKey, outputValue);
DISTRIBUTED CACHE
• If you are writing Map Reduce Applications, where you want some
files to be shared across all nodes in Hadoop Cluster. It can be
simple properties file or can be executable jar file.
• This Distributed Cache is configured with Job Configuration, What it
does is, it provides read only data to all machine on the cluster.
• The framework will copy the necessary files on to the slave node
before any tasks for the job are executed on that node
• Step 1 : Put file to HDFS
# hdfs -put /rakesh/someFolder /user/rakesh/cachefile1
• Step 2: Add cachefile in Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1
"),job.getConfiguration());
• Step 3: Access Cached file ;
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new
FileInputStream(cacheFiles[0].toString());
Mapreduce 1.0 vs Mapreduce 2.0
• one easy way to differentiate between Hadoop old api and new api
is packages.
• old api packages - mapred org.apache.hadoop.mapred package,
• new api packages – mapreduce -org.apache.hadoop.mapreduce
package.
Continue …..
Joins
• Joins is one of the interesting features available in MapReduce.
• When processing large data sets the need for joining data by a
common key can be very useful.
• By joining data you can further gain insight such as joining with
timestamps to correlate events with a time a day.
• Joins is one of the interesting features available in MapReduce.
MapReduce can perform joins between very large
datasets.Implementation of join depends on how large the datasets
are and how they are partiotioned . If the join is performed by the
mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
Map-Side Join
• A map-side join between large inputs works by performing the join
before the data reaches the map function.
• For this to work, though, the inputs to each map must be partitioned
and sorted in a particular way.
• Each input data set must be divided into the same number of
partitions, and it must be sorted by the same key (the join key) in
each source.
• All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits
the description of the output of a MapReduce job.
Reduce side Join
• Reduce-Side joins are more simple than Map-Side joins since the
input datasets need not to be structured. But it is less efficient as
both datasets have to go through the MapReduce shuffle phase. the
records with the same key are brought together in the reducer. We
can also use the Secondary Sort technique to control the order of
the records.
• How it is done?
The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer.
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.
• In each reducer, the data values from both datasets, for
keys assigned to the reducer, are available, to be
processed as required.
• A secondary sort needs to be done to ensure the
ordering of the values sent to the reducer.
• If the input files are of different formats, we would need
separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and
associate the specific mapper to the same.
Improving MapReduce
Performance
• Use Compression technique (LZO,GZIP,Snappy….)
• Tune the number of map and reduce tasks appropriately
• Write a Combiner
• Use the most appropriate and compact Writable type for your data
• Reuse Writables
• Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for-
improving-mapreduce-performance/
Yet Another Resource Negotiator
(YARN)
• YARN (Yet Another Resource Negotiator) is the resource
management layer for the Apache Hadoop ecosystem.
In a YARN cluster, there are two types of hosts;
• The ResourceManager is the master daemon that communicates
with the client, tracks resources on the cluster, and orchestrates
work by assigning tasks to NodeManagers.
• A NodeManager is a worker daemon that launches and tracks
processes spawned on worker hosts.
• Containers are an important YARN concept. You can think of a
container as a request to hold resources on the YARN cluster.
• Use of a YARN cluster begins with a request from a client consisting
of an application. The ResourceManager negotiates the necessary
resources for a container and launches an ApplicationMaster to
represent the submitted application.
• Using a resource-request protocol, the ApplicationMaster negotiates
resource containers for the application at each node. Upon
execution of the application, the ApplicationMaster monitors the
container until completion. When the application is complete, the
ApplicationMaster unregisters its container with the
ResourceManager, and the cycle is complete.
Thank You

More Related Content

What's hot

Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining systemramya marichamy
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
OLAP operations
OLAP operationsOLAP operations
OLAP operationskunj desai
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processingVijayasankariS
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 

What's hot (20)

01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Text Mining
Text MiningText Mining
Text Mining
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
NoSql
NoSqlNoSql
NoSql
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 

Similar to Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

Similar to Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More (20)

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

More from Prashant Gupta

More from Prashant Gupta (9)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Sqoop
SqoopSqoop
Sqoop
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Hadoop MapReduce Fundamentals: Combiners, Counters, Cache and More

  • 2. Combiner • A Combiner, also known as a Mini-reduce or Mapper side reducer • The Combiner will receive as input all data emitted by the Mapper instances on a given node and its output from the Combiner is then sent to the Reducers, • The Combiner will be used in between the Map class and the Reduce class to reduce the volume of data transfer between Map and Reduce. • Usage of the Combiner is optional.
  • 3. When ? • If a reduce function is both commutative and associative , we do not need to write any additional code to take advantage . job.setCombinerClass(Reduce.class); • The Combiner should be an instance of the Reducer interface. A combiner does not have a predefined interface • If your Reducer itself cannot be used directly as a Combiner because of commutativity or associativity, you might still be able to write a third class to use as a Combiner for your job. • Note – Hadoop does not guarantee that how many times combiners function will run and how many times it will run for a map output.
  • 4.
  • 5.
  • 6. Hadoop Reducer is used without a Combiner
  • 7.
  • 8. Speculative execution • One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. • Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. • When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy.
  • 9. • Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by configuration ; • mapred.map.tasks.speculative.execution • mapred.reduce.tasks.speculative.execution • There is a hard limit of 10% of slots used for speculation across all hadoop jobs. This is not configurable right now. However there is a per-job option to cap the ratio of speculated tasks to total tasks: mapreduce.job.speculative.speculativecap=0.1
  • 10. Locating Stragglers  Hadoop monitors each task progress using a progress score between 0 and 1  If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler
  • 11. COUNTERS • Counters are used to determine if and how often a particular event occurred during a job execution. • 4 categories of counters in Hadoop • File system, • Job • Map Reduce Framework, • Custom counter
  • 13.
  • 14.
  • 15.
  • 16. Custom Counters • MapReduce allows you to define your own custom counters. Custom counters are useful for counting specific records such as Bad Records, as the framework counts only total records. Custom counters can also be used to count outliers such as example maximum and minimum values, and for summations.
  • 17. Steps to write custome counter • define a enum (mapper or reducer , anywhere based upon requirement ); • public static enum MATCH_COUNTER { Score_above_400, Score_below_20,Temp_abv_55; } context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
  • 18. Data Types • Hadoop MapReduce uses typed data at all times when it interacts with user-provided Mappers and Reducers. • In wordCount, you must have seen LongWritable, IntWrtitable and Text. It is fairly easy to understand the relation between them and Java’s primitive types. LongWritable is equivalent to long, IntWritable to int and Text to String.
  • 19. Hadoop writable classes (data types) vs Java Data types Java Hadoop Byte Bytewritable int Intwritable /Vintwritable/ float floatwritable long Longwritable / VLongwritable Double DoubleWritable String Text / Nullwritable
  • 20. • What is a Writable in Hadoop? • Why does Hadoop use Writable(s)? • Limitation of primitive Hadoop Writable classes • Custom Writable
  • 21. Writable in Hadoop • It is fairly easy to understand the relation between them and Java’s primitive types. LongWritable is equivalent to long, IntWritable to int and Text to String. • Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop provides these writable wrappers for almost all Java primitive types and some other types • To implement the Writable interface we require two methods ; public interface Writable { void readFields(DataInput in); void write(DataOutput out); }
  • 22. Why does Hadoop use Writable(s) • As we already know, data needs to be transmitted between different nodes in a distributed computing environment. • This requires serialization and deserialization of data to convert the data that is in structured format to byte stream and vice-versa. • Hadoop therefore uses simple and efficient serialization protocol to serialize data between map and reduce phase and these are called Writable(s).
  • 23. WritableComparable • interface is just a subinterface of the Writable and java.lang.Comparable interfaces. • For implementing a WritableComparable we must have compareTo method apart from readFields and write methods. • Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys are compared with one another. public interface WritableComparable extends Writable, Comparable{ }
  • 24. • public interface WritableComparable extends Writable, Comparable { void readFields(DataInput in); void write(DataOutput out); int compareTo(WritableComparable o) } • WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. • Any type which is to be used as a value in the Hadoop Map-Reduce framework should implement the Writable interface.
  • 25. Limitation of primitive Hadoop Writable classes • Writable that can be used in simple applications like wordcount, but clearly these cannot serve our purpose all the time. • Now if you want to still use the primitive Hadoop Writable(s), you would have to convert the value into a string and transmit it. However it gets very messy when you have to deal with string manipulations.
  • 26.
  • 27. INPUT Format • The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:  Data splits  Record reader • Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. • The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper.
  • 28. • public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; }
  • 30. MultiInputs • We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path. • MultipleInputs is a feature that supports different input formats in the MapReduce.
  • 32. • Step :1 Add configuration in driver class MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,myMapper1.class); MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class, myMapper2.class); • Step :2 Write different Mapper for different the file path ; myMapper1 extend mapper<Ki,Vi,Ko,Vo> { } • myMapper2 extend mapper<Ki,Vi,Ko,Vo>{ }
  • 33. MultipleOutputFormat • FileOutputFormat and its subclasses generate a set of files in the output directory. • There is one file per reducer, and files are named by the partition number: part-00000, part-00001, etc. • There is sometimes a need to have more control over the naming of the files or to produce multiple files per reducer.
  • 34. • Step -1 MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT", TextOutputFormat.class, Text.class, DoubleWritable.class); • Step-2 Overide setup() method in reducer class and create an instance of multiOutputs() ; public void setup(Context context) throws IOException, InterruptedException { mos = new MultipleOutputs<Text, DoubleWritable>(context); } • Step-3 We will use multiOutputs() instance in reduce() method to write data to the ouput mos.write(“NAMED_OUTPUT", outputKey, outputValue);
  • 35. DISTRIBUTED CACHE • If you are writing Map Reduce Applications, where you want some files to be shared across all nodes in Hadoop Cluster. It can be simple properties file or can be executable jar file. • This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster. • The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node
  • 36.
  • 37. • Step 1 : Put file to HDFS # hdfs -put /rakesh/someFolder /user/rakesh/cachefile1 • Step 2: Add cachefile in Job Configuration Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1 "),job.getConfiguration()); • Step 3: Access Cached file ; Path[] cacheFiles = context.getLocalCacheFiles(); FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());
  • 38. Mapreduce 1.0 vs Mapreduce 2.0 • one easy way to differentiate between Hadoop old api and new api is packages. • old api packages - mapred org.apache.hadoop.mapred package, • new api packages – mapreduce -org.apache.hadoop.mapreduce package.
  • 40. Joins • Joins is one of the interesting features available in MapReduce. • When processing large data sets the need for joining data by a common key can be very useful. • By joining data you can further gain insight such as joining with timestamps to correlate events with a time a day. • Joins is one of the interesting features available in MapReduce. MapReduce can perform joins between very large datasets.Implementation of join depends on how large the datasets are and how they are partiotioned . If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join.
  • 41. Map-Side Join • A map-side join between large inputs works by performing the join before the data reaches the map function. • For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. • Each input data set must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. • All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.
  • 42. Reduce side Join • Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records. • How it is done? The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer. • Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
  • 43. • In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required. • A secondary sort needs to be done to ensure the ordering of the values sent to the reducer. • If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
  • 44. Improving MapReduce Performance • Use Compression technique (LZO,GZIP,Snappy….) • Tune the number of map and reduce tasks appropriately • Write a Combiner • Use the most appropriate and compact Writable type for your data • Reuse Writables • Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for- improving-mapreduce-performance/
  • 45. Yet Another Resource Negotiator (YARN) • YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. In a YARN cluster, there are two types of hosts; • The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers. • A NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts.
  • 46. • Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. • Use of a YARN cluster begins with a request from a client consisting of an application. The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application. • Using a resource-request protocol, the ApplicationMaster negotiates resource containers for the application at each node. Upon execution of the application, the ApplicationMaster monitors the container until completion. When the application is complete, the ApplicationMaster unregisters its container with the ResourceManager, and the cycle is complete.
  • 47.
  • 48.
  • 49.

Editor's Notes

  1. The Combiner is a "mini-reduce" process which operates only on data generated by one machine.
  2. commutativity – a*b =b*a 3 + 4 = 4 + 3   or  2 × 5 = 5 × 2 Associativity - (x ∗ y) ∗ z = x ∗ (y ∗ z)  (2+3) + 4 = 2+(3+4)
  3. When a MapReduce Job is run on a large dataset, Hadoop Mapper generates large chunks of intermediate data that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion.  So  reducing this network congestion , MapReduce framework offers  ‘Combiner’ 
  4. In MapReduce a job is broken into several tasks which will execute in parallel. This model of execution is sensitive to slow tasks (even if they are very few in number) as they will slowdown the overall execution of a job. Therefore, Hadoop detects such slow tasks and runs (duplicate) backup tasks for such tasks. This is calledspeculative execution. Speculating more tasks can help jobs finish faster - but can also waste CPU cycles. Conversely - speculating fewer tasks can save CPU cycles - but cause jobs to finish slower. The options documented here allow the users to control the aggressiveness of the speculation algorithms and choose the right balance between efficiency and latency.
  5. The FILE_BYTES_WRITTEN counter is incremented for each byte written to the local file system. These writes occur during the map phase when the mappers write their intermediate results to the local file system. They also occur during the shuffle phase when the reducers spill intermediate results to their local disks while sorting. The off-the-shelf Hadoop counters that correspond to MAPRFS_BYTES_READ and MAPRFS_BYTES_WRITTEN are HDFS_BYTES_READ and HDFS_BYTES_WRITTEN. The amount of data read and written will depend on the compression algorithm you use, if any.
  6. The table above describes the counters that apply to Hadoop jobs. The DATA_LOCAL_MAPS indicates how many map tasks executed on local file systems. Optimally, all the map tasks will execute on local data to exploit locality of reference, but this isn’t always possible. The FALLOW_SLOTS_MILLIS_MAPS indicates how much time map tasks wait in the queue after the slots are reserved but before the map tasks execute. A high number indicates a possible mismatch between the number of slots configured for a task tracker and how many resources are actually available. The SLOTS_MILLIS_* counters show how much time in milliseconds expired for the tasks. This value indicates wall clock time for the map and reduce tasks. The TOTAL_LAUNCHED_MAPS counter defines how many map tasks were launched for the job, including failed tasks. Optimally, this number is the same as the number of splits for the job.
  7. The COMBINE_* counters show how many records were read and written by the optional combiner. If you don’t specify a combiner, these counters will be 0. The CPU statistics are gathered from /proc/cpuinfo and indicate how much total time was spent executing map and reduce tasks for a job. The garbage collection counter is reported from GarbageCollectorMXBean.getCollectionTime(). The MAP*RECORDS are incremented for every successful record read and written by the mappers. Records that the map tasks failed to read or write are not included in these counters. The PHYSICAL_MEMORY_BYTES statistics are gathered from /proc/meminfo and indicate how much RAM (not including swap space) was consumed by all the tasks.
  8. All the counters, whether custom or framework, are stored in the JobTracker JVM memory, so there’s a practical limit to the number of counters you should use. The rule of thumb is to use less than 100, but this will vary based on physical memory capacity.
  9. Serialization : it is a mechanism of writing the state of an object into a byte stream. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface. More technically , To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. The reverse operation of serialization is called deserialization.
  10. bjects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission. Hadoop provides several stock classes which implement Writable: Text (which stores String data), IntWritable, LongWritable, FloatWritable, BooleanWritable, and several others. The entire list is in theorg.apache.hadoop.io package of the Hadoop source (see the API reference - http://hadoop.apache.org/docs/current/api/index.html).
  11. Custom writable : public class MyWritable implements Writable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public static MyWritable read(DataInput in) throws IOException { MyWritable w = new MyWritable(); w.readFields(in); return w; } }
  12. public interface Comparable{ public int compareTo(Object obj); }
  13. WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
  14. Any split implementation extends the Apache base abstract class - InputSplit, defining a split length and locations. A split length is the size of the split data (in bytes), while locations is the list of node names where the data for the split would be local. Split locations are a way for a scheduler to decide on which particular machine to execute this split. A very simple[1] a job tracker works as follows: Receive a heartbeat form one of the task trackers, reporting map slot availability. Find queued up split for which the available node is "local". Submit split to the task tracker for the execution. Locality can mean different things depending on storage mechanisms and the overall execution strategy. In the case of HDFS, for example, a split typically corresponds to a physical data block size and locations is a set of machines (with the set size defined by a replication factor) where this block is physically located. This is how FileInputFormat calculates splits.
  15. HIPI is a framework for image processing of the image file with MapReduce.
  16. Code example : http://www.lichun.cc/blog/2012/05/hadoop-multipleinputs-usage/
  17. MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,CounterMapper.class); MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,CountertwoMapper.class);
  18. . Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. How big is the DistributedCache? The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB. Where does the DistributedCache store data? /tmp/hadoop-<user.name>/mapred/local/taskTracker/archive
  19. If both datasets are too large for either to be copied to each node in the cluster, we can still join them using MapReduce with a map-side or reduce-side join, depending on how the data is structured. One common example of this case is a user database and a log of some user activity (such as access logs). For a popular service, it is not feasible to distribute the user database (or the logs) to all the MapReduce nodes. Before diving into the implementation let us understand the problem thoroughly.
  20. A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this. If we have two datasets, for example, one dataset having user ids, names and the other having the user activity over the application. In-order to find out which user have performed what activity on the application we might need to join these two datasets such as both user names and the user activity will be joined together. Join can be applied based on the dataset size if one dataset is very small to be distributed across the cluster then we can use Side Data Distribution technique
  21. Almost every Hadoop job that generates an non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, the reduced amount of disk IO during the shuffle will usually save time overall. Whenever a job needs to output a significant amount of data, LZO compression can also increase performance on the output side. Since writes are replicated 3x by default, each GB of outpunnnnnnnt data you save will save 3GB of disk writes. In order to enable LZO compression, check out our recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.
  22. The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.
  23. Conclusion Summarizing the important concepts presented in this section: A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster. In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers. The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resourceson the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total. A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container. An application in YARN comprises three parts: The application client, which is how a program is run on the cluster. An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application. One or more tasks that do the actual work (runs in a process) in the container allocated by YARN. A MapReduce application consists of map tasks and reduce tasks. A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.