SlideShare a Scribd company logo
1 of 10
CS 6301: Special Topics in Computer Science
CLOUD COMPUTING
Project #1 Report
Prabhakar Ganesamurthy (pxg130030)
Abstract
Developed a MapReduce program to compute number of crimes of each crime type per region
from a large dataset. The regions are defined by a 6 digit number. The region definitions used in
the program are
in the format (1XXXXX,1XXXXX), (12XXXX,12XXXX), (123XXX,123XXX).Under different
settings such as different number of mapper and reducer tasks, region definitions, input as a
single large file, many small files, inconsistent input etc., Hadoop's behavior was studied. File
distribution, mapper and reducer distribution, performance of mapper, reducer tasks, memory
usage under these settings are studied and discussed in this report.
Cluster used: cluster04@master
File Distribution in Hadoop:
By default block size in hadoop is set to 64MB. So to study the distribution of files among the
data nodes
a file lesser than 64MB in size (15MB.csv) and a file greater than 64MB in size(137MB.csv) were
uploaded on 2/21/2014 at 18:03
This snapshot shows the commands for uploading the files to hadoop.

This snapshot shows the files (137MB.csv and 15MB.csv) as contents of /user/pxg130030/ along
with timestamp.

In cluster04 the data nodes are slave02 and slave03.
The logs of slave02 and slave03 were accessed and the following snapshots show how the two
files are distributed.
hadoop-hadoop-datanode-slave2.log.2014-02-21(15MB.csv):

The 15MB.csv file is being sent from 192.168.0.120 (master) to 192.168.0.122 (slave02). As
15MB<64MB the file is not split and it is stored as a whole in the block.
hadoop-hadoop-datanode-slave2.log.2014-02-21(137MB.csv):

The 137.csv file is sent from 192.168.0.120 (master) to 192.168.0.122 (slave02).
As 137MB>64MB the file is split up into three pieces(64+64+9.88) and stored.
hadoop-hadoop-datanode-slave3.log.2014-02-21(15MB.csv and 137MB.csv)

The 15MB.csv file is being sent from 192.168.0.122 (slave02) to 192.168.0.123 (slave03). As
15MB<64MB the file is not split and it is stored as a whole in the block.
The 137.csv file is sent from 192.168.0.122 (slave02) to 192.168.0.123 (slave03).
As 137MB>64MB the file is split up into three pieces(64+64+9.88) and stored.
Note that the file is copied from slave02 and not master.

Inference:
• Files greater than block size are split up and stored in the block
• Files lesser than block size are not split up.
• Files are duplicated among data-nodes (which enables parallel processing in hadoop)
• The files are copied to from master to slave02 to slave 03.
• The files are split up and stored in blocks in slave03 first and then in slave02(from
timestamp in log file)
Distribution & Performance of Mapper and Reducer Tasks:
The various settings under which the performance of mapper and reducer tasks were studied are as
follows:
Input files:
1. Single large Input file
2. Many small Input files (the large file split into 1341 files)
Mapper and Reducer Numbers: 1, 2, or 5.
Region definitions considered: (1XXXXX,1XXXXX), (12XXXX,12XXXX), (123XXX,123XXX)
Parameters studied: Execution time per task and job, memory usage.
Distribution of Mapper and Reducer Tasks:
As there only two task nodes in the cluster, the mapper tasks are spit up among them i.e.,
slave02 and slave03 as shown in the following log.

The settings are:
For Many Small Input Files:
Map
Reduce
Definition
Output Folder Name
1
1
1
manyMap1Red1Def1
1
1
2
manyMap1Red1Def2
1
1
3
manyMap1Red1Def3
5
5
1
manyMap5Red5Def1
Execution time:
The execution times were calculated from the corresponding log files.
The Observation is visualized below:
In all the cases the Mapping takes more time than reducing.
There are a total of 1341 files and they are mapped to the mappers.
From the chart it is observed that when the number of reducers is 5, the time taken for reducing is
more. This is because the number of slave nodes is 2 and at most 2 reducers can work at a time and
the remaining reducers have to wait for the working reducers to complete. This increases the total
time taken for reduction.

There is slight increase in time for Mapping and Reduce while using region definition 2 and 3
compared to definition 1. This is because the number of map and reduce records in the order
Def1<Def2<Def3 ((1XXXXX,1XXXXX) < (12XXXX,12XXXX) <(123XXX,123XXX)). Hence the
slight increase in Def2 and Def3.
The execution time per task for manyMap5Red5Def1 as follows:
Map Tasks

Reduce Tasks:

The last reduce task has very execution time. This is because the output of Map tasks are
shuffled and sorted and divided among the Reducers. The first 4 reducers got equal amount of
records whereas the remaining small of amount of records are sent to the last record. Hence the
small execution time.
Map vs Reduce:

A similar per-task and per-job execution time analysis is done.
Memory Usage:
The memory usage of each setting is obtained from the corresponding log files.
It is visualized below.

From the above chart the first 3 bars represent Def1, Def2, Def3 region definitions that are run
under same number of Map and Reduce tasks. As the number of records are in the order
Def3>Def2>Def1, it explains why the order of memory usage is Def3>Def2>Def1.
Having 5 reducers which is more than the actual number of slave nodes results in the use of
more memory.

For Single large Input file:
Map
1
1
1
1
1
1
2
2
2
2
2
2
5
5

Reduce
1
1
1
2
2
2
1
1
1
2
2
2
1
5

Definition
1
2
3
1
2
3
1
2
3
1
2
3
1
1

Output Folder Name
singleMap1Red1Def1
singleMap1Red1Def2
singleMap1Red1Def3
singleMap1Red2Def1
singleMap1Red2Def2
singleMap1Red2Def3
singleMap2Red1Def1
singleMap2Red1Def2
singleMap2Red1Def3
singleMap2Red2Def1
singleMap2Red2Def2
singleMap2Red2Def3
singleMap5Red1Def1
singleMap5Red5Def1

Execution Time:
The execution times were calculated from the corresponding log files.
The Observation is visualized below:

From the above chart, it is seen that the more the number of records to process(for example region
definition 3), the more the execution time for mapper and reducer. Also as the total number of slave
nodes is 2, when the number of reducers is 2 or more, the reducers have to wait for previous
reducers to complete thereby increasing the total time taken for reduce tasks.
In the setting of Map 5 Red 5 and Def 1, the number of reducers is very high that the total time
taken for reducer tasks is more than that of mapper tasks. This is bad setting is corrected in the next
case Map5 Red5 Def1 , where the total time taken for reduce task is significantly lowered.

The per task execution time of the setting Map 5 Red 1 Def 1 is shown below:
The single large file is split up into 33 parts and sent to the Mappers.
Map vs Reduce:

Memory Usage:
The memory usage of each setting is obtained from the corresponding log files.
It is visualized below.

Like observed before for many small inputs, memory usage is more when the number of records to
process is more(def3) and when there are more reducers than the number of slave
nodes(map5red5def1).
Single Large Input File Vs Many Small Input Files:
Contents of the single large file and many small input files are the same. But there is difference in
execution time and memory usage.
Execution Time:

From the above chart, it is clear that the time taken to process many small inputs is much higher
compared to the time taken to process a single large file. This is because while processing the single
large file, it is split into 33 parts whereas while processing many small inputs each file is sent to
mapper if it's not too big for the mapper to process making it 1341 parts(1341 files). Hence the
difference.
Memory usage:

From the above chart is evident that memory usage when processing many small files is very high
compared that of single large file. This is because each mapper and reducer don't get to process to
their complete capacity in small input files as a small input file may contain data very small
compared to the capacity of data a mapper can handle. In processing of a single large file, the file is
split in a way that a mapper gets the maximum amount data it can process, hence memory usage is
minimized.
Shuffling and Sorting:
Shuffling and sorting tasks occur after mapper tasks. The shuffle and sort the output records of the
mapper and send them to the reducers. So the reducer input is always sorted.

Error handling:
A runtime error was introduced while processing a set of input files in hadoop and how hadoop
handled the error was observed.
Error introduced: After starting the processing of input files, one of the input files was removed.
Hadoop generated the following log:

Hadoop throws a FileNotFoundException during the m_000008(map) task as the corresponding file
was deleted during runtime from the HDFS.
Conclusion:
From the above analysis of Hadoop's behavior under different settings are studied.
The following can be inferred from this analysis
• The mappers-reducers should be configured properly and sensibly. i.e., the number of
mappers should be configured according to the input size and the number of reducers
should be configured according to the number of available slaves.(less or equal to)
• Hadoop always performs better on a large file rather than a set of small files
• Data distribution in HDFS
• Error Handling by Hadoop

More Related Content

What's hot

Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsDataWorks Summit
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating SystemRaj Mohan
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 

What's hot (20)

Lab 10 nmr n1_2011
Lab 10 nmr n1_2011Lab 10 nmr n1_2011
Lab 10 nmr n1_2011
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Group13
Group13Group13
Group13
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MapReduce
MapReduceMapReduce
MapReduce
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 

Viewers also liked

Why bengal will vote for narendra modi
Why bengal will vote for narendra modiWhy bengal will vote for narendra modi
Why bengal will vote for narendra modiSekhar Saha
 
Campaign Design Decoded
Campaign Design DecodedCampaign Design Decoded
Campaign Design DecodedSekhar Saha
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
Nanopix cashew types
Nanopix cashew typesNanopix cashew types
Nanopix cashew typesnanopix
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulationPrabhakar Ganesamurthy
 
Nanopix cashew types
Nanopix cashew typesNanopix cashew types
Nanopix cashew typesnanopix
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
Evolution of Cooking Pots
Evolution of Cooking PotsEvolution of Cooking Pots
Evolution of Cooking PotsSekhar Saha
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
Borosil Recipes & Tips
Borosil Recipes & TipsBorosil Recipes & Tips
Borosil Recipes & TipsSekhar Saha
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_pptkfly
 
看圖說話制弓弦
看圖說話制弓弦看圖說話制弓弦
看圖說話制弓弦JOHN Chang
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulationPrabhakar Ganesamurthy
 
Restaurant automation analysis&designdoc_v3.1
Restaurant automation analysis&designdoc_v3.1Restaurant automation analysis&designdoc_v3.1
Restaurant automation analysis&designdoc_v3.1Prabhakar Ganesamurthy
 

Viewers also liked (17)

Why bengal will vote for narendra modi
Why bengal will vote for narendra modiWhy bengal will vote for narendra modi
Why bengal will vote for narendra modi
 
Campaign Design Decoded
Campaign Design DecodedCampaign Design Decoded
Campaign Design Decoded
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
Nanopix cashew types
Nanopix cashew typesNanopix cashew types
Nanopix cashew types
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
 
Nanopix cashew types
Nanopix cashew typesNanopix cashew types
Nanopix cashew types
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
Evolution of Cooking Pots
Evolution of Cooking PotsEvolution of Cooking Pots
Evolution of Cooking Pots
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
Borosil Recipes & Tips
Borosil Recipes & TipsBorosil Recipes & Tips
Borosil Recipes & Tips
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
New who am i_ppt
New who am i_pptNew who am i_ppt
New who am i_ppt
 
看圖說話制弓弦
看圖說話制弓弦看圖說話制弓弦
看圖說話制弓弦
 
Gallup report
Gallup reportGallup report
Gallup report
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
 
Restaurant automation analysis&designdoc_v3.1
Restaurant automation analysis&designdoc_v3.1Restaurant automation analysis&designdoc_v3.1
Restaurant automation analysis&designdoc_v3.1
 

Similar to Report

White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuningAnil Reddy
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
 

Similar to Report (20)

White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
BIG DATA Session 7 8
BIG DATA Session 7 8BIG DATA Session 7 8
BIG DATA Session 7 8
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
final report
final reportfinal report
final report
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 

Report

  • 1. CS 6301: Special Topics in Computer Science CLOUD COMPUTING Project #1 Report Prabhakar Ganesamurthy (pxg130030) Abstract Developed a MapReduce program to compute number of crimes of each crime type per region from a large dataset. The regions are defined by a 6 digit number. The region definitions used in the program are in the format (1XXXXX,1XXXXX), (12XXXX,12XXXX), (123XXX,123XXX).Under different settings such as different number of mapper and reducer tasks, region definitions, input as a single large file, many small files, inconsistent input etc., Hadoop's behavior was studied. File distribution, mapper and reducer distribution, performance of mapper, reducer tasks, memory usage under these settings are studied and discussed in this report. Cluster used: cluster04@master File Distribution in Hadoop: By default block size in hadoop is set to 64MB. So to study the distribution of files among the data nodes a file lesser than 64MB in size (15MB.csv) and a file greater than 64MB in size(137MB.csv) were uploaded on 2/21/2014 at 18:03 This snapshot shows the commands for uploading the files to hadoop. This snapshot shows the files (137MB.csv and 15MB.csv) as contents of /user/pxg130030/ along with timestamp. In cluster04 the data nodes are slave02 and slave03. The logs of slave02 and slave03 were accessed and the following snapshots show how the two files are distributed. hadoop-hadoop-datanode-slave2.log.2014-02-21(15MB.csv): The 15MB.csv file is being sent from 192.168.0.120 (master) to 192.168.0.122 (slave02). As 15MB<64MB the file is not split and it is stored as a whole in the block.
  • 2. hadoop-hadoop-datanode-slave2.log.2014-02-21(137MB.csv): The 137.csv file is sent from 192.168.0.120 (master) to 192.168.0.122 (slave02). As 137MB>64MB the file is split up into three pieces(64+64+9.88) and stored. hadoop-hadoop-datanode-slave3.log.2014-02-21(15MB.csv and 137MB.csv) The 15MB.csv file is being sent from 192.168.0.122 (slave02) to 192.168.0.123 (slave03). As 15MB<64MB the file is not split and it is stored as a whole in the block. The 137.csv file is sent from 192.168.0.122 (slave02) to 192.168.0.123 (slave03). As 137MB>64MB the file is split up into three pieces(64+64+9.88) and stored. Note that the file is copied from slave02 and not master. Inference: • Files greater than block size are split up and stored in the block • Files lesser than block size are not split up. • Files are duplicated among data-nodes (which enables parallel processing in hadoop) • The files are copied to from master to slave02 to slave 03. • The files are split up and stored in blocks in slave03 first and then in slave02(from timestamp in log file)
  • 3. Distribution & Performance of Mapper and Reducer Tasks: The various settings under which the performance of mapper and reducer tasks were studied are as follows: Input files: 1. Single large Input file 2. Many small Input files (the large file split into 1341 files) Mapper and Reducer Numbers: 1, 2, or 5. Region definitions considered: (1XXXXX,1XXXXX), (12XXXX,12XXXX), (123XXX,123XXX) Parameters studied: Execution time per task and job, memory usage. Distribution of Mapper and Reducer Tasks: As there only two task nodes in the cluster, the mapper tasks are spit up among them i.e., slave02 and slave03 as shown in the following log. The settings are: For Many Small Input Files: Map Reduce Definition Output Folder Name 1 1 1 manyMap1Red1Def1 1 1 2 manyMap1Red1Def2 1 1 3 manyMap1Red1Def3 5 5 1 manyMap5Red5Def1 Execution time: The execution times were calculated from the corresponding log files. The Observation is visualized below:
  • 4. In all the cases the Mapping takes more time than reducing. There are a total of 1341 files and they are mapped to the mappers. From the chart it is observed that when the number of reducers is 5, the time taken for reducing is more. This is because the number of slave nodes is 2 and at most 2 reducers can work at a time and the remaining reducers have to wait for the working reducers to complete. This increases the total time taken for reduction. There is slight increase in time for Mapping and Reduce while using region definition 2 and 3 compared to definition 1. This is because the number of map and reduce records in the order Def1<Def2<Def3 ((1XXXXX,1XXXXX) < (12XXXX,12XXXX) <(123XXX,123XXX)). Hence the slight increase in Def2 and Def3. The execution time per task for manyMap5Red5Def1 as follows: Map Tasks Reduce Tasks: The last reduce task has very execution time. This is because the output of Map tasks are shuffled and sorted and divided among the Reducers. The first 4 reducers got equal amount of records whereas the remaining small of amount of records are sent to the last record. Hence the small execution time.
  • 5. Map vs Reduce: A similar per-task and per-job execution time analysis is done. Memory Usage: The memory usage of each setting is obtained from the corresponding log files. It is visualized below. From the above chart the first 3 bars represent Def1, Def2, Def3 region definitions that are run under same number of Map and Reduce tasks. As the number of records are in the order Def3>Def2>Def1, it explains why the order of memory usage is Def3>Def2>Def1. Having 5 reducers which is more than the actual number of slave nodes results in the use of more memory. For Single large Input file:
  • 6. Map 1 1 1 1 1 1 2 2 2 2 2 2 5 5 Reduce 1 1 1 2 2 2 1 1 1 2 2 2 1 5 Definition 1 2 3 1 2 3 1 2 3 1 2 3 1 1 Output Folder Name singleMap1Red1Def1 singleMap1Red1Def2 singleMap1Red1Def3 singleMap1Red2Def1 singleMap1Red2Def2 singleMap1Red2Def3 singleMap2Red1Def1 singleMap2Red1Def2 singleMap2Red1Def3 singleMap2Red2Def1 singleMap2Red2Def2 singleMap2Red2Def3 singleMap5Red1Def1 singleMap5Red5Def1 Execution Time: The execution times were calculated from the corresponding log files. The Observation is visualized below: From the above chart, it is seen that the more the number of records to process(for example region definition 3), the more the execution time for mapper and reducer. Also as the total number of slave nodes is 2, when the number of reducers is 2 or more, the reducers have to wait for previous reducers to complete thereby increasing the total time taken for reduce tasks. In the setting of Map 5 Red 5 and Def 1, the number of reducers is very high that the total time taken for reducer tasks is more than that of mapper tasks. This is bad setting is corrected in the next case Map5 Red5 Def1 , where the total time taken for reduce task is significantly lowered. The per task execution time of the setting Map 5 Red 1 Def 1 is shown below:
  • 7. The single large file is split up into 33 parts and sent to the Mappers. Map vs Reduce: Memory Usage:
  • 8. The memory usage of each setting is obtained from the corresponding log files. It is visualized below. Like observed before for many small inputs, memory usage is more when the number of records to process is more(def3) and when there are more reducers than the number of slave nodes(map5red5def1). Single Large Input File Vs Many Small Input Files: Contents of the single large file and many small input files are the same. But there is difference in execution time and memory usage. Execution Time: From the above chart, it is clear that the time taken to process many small inputs is much higher compared to the time taken to process a single large file. This is because while processing the single large file, it is split into 33 parts whereas while processing many small inputs each file is sent to mapper if it's not too big for the mapper to process making it 1341 parts(1341 files). Hence the difference.
  • 9. Memory usage: From the above chart is evident that memory usage when processing many small files is very high compared that of single large file. This is because each mapper and reducer don't get to process to their complete capacity in small input files as a small input file may contain data very small compared to the capacity of data a mapper can handle. In processing of a single large file, the file is split in a way that a mapper gets the maximum amount data it can process, hence memory usage is minimized. Shuffling and Sorting: Shuffling and sorting tasks occur after mapper tasks. The shuffle and sort the output records of the mapper and send them to the reducers. So the reducer input is always sorted. Error handling:
  • 10. A runtime error was introduced while processing a set of input files in hadoop and how hadoop handled the error was observed. Error introduced: After starting the processing of input files, one of the input files was removed. Hadoop generated the following log: Hadoop throws a FileNotFoundException during the m_000008(map) task as the corresponding file was deleted during runtime from the HDFS. Conclusion: From the above analysis of Hadoop's behavior under different settings are studied. The following can be inferred from this analysis • The mappers-reducers should be configured properly and sensibly. i.e., the number of mappers should be configured according to the input size and the number of reducers should be configured according to the number of available slaves.(less or equal to) • Hadoop always performs better on a large file rather than a set of small files • Data distribution in HDFS • Error Handling by Hadoop