SlideShare a Scribd company logo
1 of 11
Download to read offline
International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 
AN EXPERIMENTAL EVALUATION OF PERFORMANCE 
OF A HADOOP CLUSTER ON REPLICA MANAGEMENT 
Muralikrishnan Ramane, SharmilaKrishnamoorthy and Sasikala Gowtham 
Department of Information Technology, University College of Engineering Villupuram, 
Tamilnadu, India. 
ABSTRACT 
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. 
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets 
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File 
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data 
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an 
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster 
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive 
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit 
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced 
the task completion time, but once the volume of the data being processed increases there is a considerable 
cutback in computational speeds due to update cost. Further the threshold level for balance between the update 
cost and replication factor is identified and presented graphically. 
KEYWORDS 
Replica Management; MapReduce Framework; Hadoop Cluster; Big Data. 
1.INTRODUCTION 
A distributed system is a pool of autonomous compute nodes [1] connected by swift networks that 
appear as a single workstation. In reality, solving complex problems involves division of problem into 
sub tasks and each of which is solved by one or more compute nodes which communicate with each 
other by message passing. The current inclination towards Big Data analytics has lead to such compute 
intensive tasks. 
Big Data, [2] is termed for a collection of data sets which are large and complex and difficult to 
process using traditional data processing tools. The need for Big Data management is to ensure high 
levels of data accessibility for business intelligence and big data analytics. This condition needs 
applications capable of distributed processing involving terabytes of information saved in a variety of 
file formats. 
DOI:10.5121/ijcsa.2014.4507 87
International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 
Hadoop [3] is a well-known and a successful open source implementation of the MapReduce 
programming model in the realm of distributed processing. The Hadoop runtime system coupled with 
HDFS provides parallelism and concurrency to achieve system reliability. The major categories of 
machine roles in a Hadoop deployment are Client machines, Master nodes and Slave nodes. The 
Master nodes supervise storing of data and running parallel computations on all that data using Map 
Reduce. The NameNode supervises and coordinates the data storage function in HDFS, while the 
JobTracker supervises and coordinates the parallel processing of data using Map Reduce. Slave Nodes 
are the vast majority of machines and do all the cloudy work of storing the data and running the 
computations. Each slave runs a DataNode and a TaskTracker daemon that communicates with and 
receives instructions from their master nodes. The TaskTracker daemon is a slave to the JobTracker 
likewise the DataNode daemon to the NameNode. 
HDFS [4] file system is designed for storing huge files with streaming data access patterns, running on 
clusters of commodity hardware. An HDFS cluster has two type of node operating in a master-slave 
pattern: A NameNode(Master) managing the file system namespace, file System tree and the metadata 
for all the files and directories in the tree and some number of DataNode(Workers) managing the data 
blocks of files. The HDFS is so large that replicas of files are constantly created to meet performance 
and availability requirements. 
A replica [5] is usually created so as the new storage location offers better performance and 
availability for accesses to or from a particular location. In the Hadoop architecture the replica is 
commonly selected based on storage and network feasibility which makes it fault tolerant so as to 
recover from failing DataNode. It does replicates files based on a rack aware cluster setup in which by 
default it replicates each file at three principle locations within the cluster; first copy is stored on local 
node and the other two copies are stored on remote rack. Additional replicas are stored randomly on 
any rack which could be configured and overridden using scripts. 
The rest of this paper is organized as follows. Section II, discusses related studies on data replication 
schemes in cluster; Section III describes the proposed system model of a Hadoop cluster and the data 
locality problem; Section IV evaluates the performance of the system by conducting experiments on 
varying data replication levels. Finally, Section V concludes and discusses the future scope of this 
work. 
88 
2.RELATED STUDIES 
The purpose of data replication in HDFS is primarily to improve the availability of data. Replication of 
a data fileserves the purpose of system reliability where if one or more nodes fail in a cluster. 
Recently, studies were done to improve fault tolerance of data in the presence of failure and few of 
those are discussed below. 
Abad.C.L, Yi Lu, Campbell.R.H; [6] proposed a data replication and placement algorithm (DARE)that 
adapts to the fluctuations in workload. It assumes the scheduler is unaware to the data replication 
policy and was implemented and evaluated using the Hadoop framework. When local data is not 
available the node retrieves data from a remote node and process the assigned task and discards the 
data after completion. DARE benefits from existing remote data retrievals and selects a subset of the 
data and creates a replica without consuming additional network and computation resources.Node run
International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 
the algorithm independently to create replicas that are likely to be heavily accessed. The authors 
designed a probabilistic dynamic replication algorithm with the following features: 
89 
1. Nodes sample assigned tasks and replicates popular files in a distributed manner. 
2. Correlated data accesses are distributed over diverse nodes as old replicas deleted and new 
replicas are created. 
Experiments proved 7-times improvement in data locality and 70% improvement in cluster scheduling. 
Reduces job turnaround time by 16% in dedicated clusters and 19% in virtualized public clouds. 
SangwonSeo, Ingook Jang; [7] proposed optimization schemes such as prefetching and pre-shuffling 
to solve shared environment problems. Both the above schemes were implemented in a High 
Performance MapReduce Engine (HPMR). In an Intra block fetching an input split or an intermediate 
output is prefetched whereas the whole candidate data block is prefetched in the interblock 
prefetching. The pre-shuffling scheme reduces the amount of intermediate output to shuffle and at the 
time of pre-shuffling HPMR looks over an input split before the mapphase begins and predicts the 
target reducer where the key-value pairs are segregated. A new task scheduler was designed forpre-shuffling 
and is used only for the reduce phase. Prefetching schemes improve data locality and Pre-shuffling 
schemes significantly reduces the shuffling overhead during the reduce phase. The schemes 
provided following contributions: 
1. Performance degradation analysis of Hadoopin ashared MapReduce computation environment. 
2. Prefetching and Pre-shufflingschemes to improve MapReduce performance when physical 
nodes are shared by multiple users. 
3. HPMR reduces network overhead and exploits data locality compatible with both dedicated 
and shared environments. 
Khanli.L.M, Isazadeh.A; [8] proposed an algorithm to decrease access latency by predicting the future 
usage of files. Predictive Hierarchal Fast Spread (PHFS) pre-replicates data in a hierarchal data grid 
using two phases: collecting data access statistics and applying data mining techniques like clustering 
and association rule mining all over the system. File sare assigned value  which is between 0 and 1 
for representing relationships between files. 
Files are arranged according to value of  which is called the PWS(predictive working set). PHFS 
utilizes the PWS of a file and replicates all members of PWS including the file and all files on the path 
from source to client. PHFS tries to improve data locality by predicting the user’s future demands and 
pre-replicating them in advance thereby achieving highe ravail ability with optimized usage of 
resources. 
Jungha Lee, Jong Beom Lim; [9] proposed a data replication scheme (ADRAP)that is adaptive to 
overhead, associated with the data locality problem. The algorithm works based on access count 
prediction to reduce the data transfer time and improves data locality thereby reducing total processing 
time. The scheme adaptively determines the required replication factor by evaluating data access 
patterns and recent replication factor for a particular data file.The paper contributes the following:
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
90 
1. Optimizes replication factor and effectively avoids overhead caused by data replication. 
2. Dynamically determines data replication requirements. 
3. Minimizes processing time of MapReduce jobs by improving the data locality. 
Zaharia.M, Borthakur.D; [10] proposed a delay scheduling method that illustrates the conflict between 
fairness in scheduling and data locality by designing a fair scheduler for a 600-node Hadoopcluster at 
Facebook. Delay scheduling schedules jobs according to fairness and waits for asmall amount of time 
letting other jobs to launch tasks. It achieves nearly optimal data locality in a variety of workloads and 
increases throughput by up to 2x while preserving fairness. 
The algorithm is applicable under a wide variety of scheduling policies beyond fair sharing such as the 
Hadoop Fair Scheduler. HFS has two main goals:Fair sharing and Data locality. To achieve the goal 
the scheduler reallocates resources between jobs when the number of jobs changes by killing running 
tasks to make room for the new job and waiting for running tasks to finish. 
Delay scheduling performs well in typical Hadoop workloads and is applicable beyond fair sharing. 
Delay scheduling in HFS is generalized to implement a hierarchical scheduling policy motivated by 
the needs of Facebook’s users. The scheduler divides slots between users based on weighted fair 
sharing at top-level and allows users to schedule their own jobs using either FIFO or fair sharing. 
3.PROPOSED WORK 
The purpose of this research is to evaluate the performance of the Hadoop cluster and to design a rack 
aware Hadoop cluster. To achieve the purpose a data replication scheme is adapted to fit the system, 
implemented using a rack aware Hadoop cluster. In such a cluster tasks are run manually with varying 
levels of data replication. The setup and the shell scripts required for implementation are presented in 
detail in the following paragraphs.This research work makes a tiny contribution on; 
• Minimizing processingtime and data transfer load between racks by improving data locality. 
3.1.Rack Aware Hadoop Clusters 
TheData availability and locality are interrelated domains in the realm of distributed processing which 
when not handled appropriately leads to performance issues. When a scheduled task is about to run 
which does not have the data required for it’s processing, have to load the data from another node 
causing poor throughput. This paper deals with similar type situation where cluster of nodes are 
involved and each node belonging to the same or different rack. 
For the purpose of experimental evaluation this paper utilizes a Hadoop cluster setup with one Master 
node and seven slave nodes each configured manually to define the rack number it belongs. This paper 
utilizes an improved data placement policy to prevent data loss and improve network performance.
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
Data blocks are replicated to multiple machines to prevent data loss due to machine failures. In the 
case of a switch or power failure the Name Node stores information locality of Data Nodes in the 
network topology and utilizes the information to decide placing data replicas within the cluster. An 
assumption that two machines in the same rack have more bandwidth and lower latency between each 
other than two machines in two different racks is considered. It is also assumed cross-rack latency is 
higher than in-rack latency most of the time. The system is designed for evaluation is shown in figure 
1. 
91 
MN-1 
NN-1 
RACK 1 
A B C D 
SN-1 
DN-1 
DN-2 
A 
B 
DATA BLOCKS 
DN-1 
RACK 2 
DN-1 
B D 
SWITCH 
Figure 1. Rack aware Hadoop Cluster Setup 
A 
MN - MASTER 
NODE 
SN - SLAVE 
NODE 
DN - DATA NODE 
NN - NAME NODE 
SWITCH 
B C D 
C 
D 
SWITCH 
SN-2 
SN-3 
DN-2 
A 
A 
SWITCH 
SN-6 
SN-7 
DN-2 
B 
D 
SWITCH 
RACK 4 
SN-4 
DN-1 
SN-5 
DN-2 
C 
C 
RACK 3
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
92 
3.2.Access Count Prediction 
Maintaining different replication factors per data file and assuming that a higher replication factor for a 
file with higher access count does not always guarantee better data locality. Data is replicated 
cautiously for if the replication factor is higher than access count for the particular file then the 
probability of being processed with node locality is higher than that of the opposite case. To normalize 
the replication factor, a method that predicts the next access count for a data file is required. 
The method initializes the variables then proceeds with calculating the average time interval between 
data accesses and finally predicts the next access thereby calculating the number of future accesses. 
The predicted access count is evaluated in comparison with the current replication factor to determine 
the optimal replication factor. In addition, the number of rack-off locality nodes is effectively reduced 
by the replica placement policy. To predict access count for individual files this work utilizes 
Langrange’s interpolation using a polynomial expression. The mathematical formula is given below: 
…….. (1) 
In the equation, 
Let N be the number of points, 
Let xibe the the i-th point, and 
Let fibe the function of xi. 
To calculate the predicted access count, 
• substitute x by time t, where t is time of access and 
• y by an access count at t. 
3.3. Data Placement Policy 
In the process of evaluating the data placement policy this work utilized Hadoop clusters that are 
arranged in racks. In-rack nodes has much more desirable network traffic than off-rack nodes. The 
cluster administrator uses the configuration variable net.topology.script.file.nameto decide on which 
rack the nodes belong to and this script is configured so that each node runs the script to determine its 
rack id. In a default installation nodes are assumed to belong to the same rack with similar rack id. 
The data placement policy is designed with a Hadoop cluster in mind which usually is of varying sizes 
and so varies accordingly. In the case of a small cluster, servers are connected by a single switch with 
two levels of locality on-machine and off-machine. But for larger installations it must be kept in mind 
that data replicas exist on multiple machines and spans multiple racks.A rack aware hadoop file system
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
is created with the use of scripts which allows the master node to map the network topology of the 
cluster. The script is in executable form which allows it to return the rack address of each node. 
The script returns the stdouton a list of rack names, one for each input which are provided as 
arguments such as IP addresses of nodes in the cluster ordered consistently. Mapping scripts specify 
the key topology.script.file.name in conf/hadoop-site.xml. The script also provides a command to 
return rack id’s and by default Hadoop will try to send a set of IP addresses as command line 
arguments. Rack ids are represented as hierarchical path names where every node has a default rack id 
/default-rack and for large installations rack id’s are denoted using the entire topology starting from 
top level switch to rack names as it is given here /top-switch-name/rack-name. 
93 
3.3.1. Configuring Rack Awareness 
To configure a Hadoop cluster into a rack-aware system data must be divided into multiple file blocks 
and store them on different machines among the cluster. In the opposite case, when a Hadoop file 
system is not rack-aware then there is a possibility that Hadoop will place all the copies of the data 
blocks in same rack. Failing to configure a Hadoop into a rack-aware system may result in loss of 
data,but rack failure is not recurrent and this can be avoided by utilizing Hadoop configuration files. 
Hadoop is configured using the topology property where the actual setup is provided as an input file to 
the script for rack identification.The following script performs rack identification based on IP 
addresses given a hierarchical IP addressing scheme enforced by the network administrator. 
Configuring rack awareness in Hadoop involves two steps: 
• Configure the “topology.script.file.name” in core-site.xml 
property 
nametopology.node.switch.mapping.impl/name 
value 
org.apache.hadoop.net.ScriptBasedMapping 
/value 
/property 
property 
nametopology.script.file.name/name 
valuecore/rack-awareness.sh/value 
/property 
Rack-awareness.sh
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
94 
HADOOP_CONF=/usr/local/hadoop/conf 
while [ $# -gt 0 ] ; do 
nodeArg=$1 
exec ${HADOOP_CONF}/topology.data 
result=”” 
while read line ; do 
ar=( $line ) 
if [ ${ar[0]}” = “$nodeArg” ] ; then 
result=”${ar[1]}” 
fi 
done 
shift 
if [ -z $result ] ; then 
echo -n “/default/rack “ 
else 
echo -n “$result “ 
fi 
done 
Topology.data 
(master)Machine1.pc /dc1/rack1 
(slave)Machine2.pc /dc1/rack1 
(slave)Machine3.pc /dc2/rack2 
(slave)Machine4.pc /dc2/rack2 
(slave)Machine5.pc /dc3/rack3 
(slave)Machine6.pc /dc3/rack3 
(slave)Machine7.pc /dc4/rack4 
(slave)Machine8.pc /dc4/rack4 
4.EXPERIMENTAL EVALUATION 
Experiments were conducted on the rack-aware Hadoop cluster to evaluate its performance in terms of 
data availability. It involves two scenarios: one involving too many data files and other with no data 
files but complex computations. The former one is WordCount application and the latter is Pi value 
calculation, also both the above experiments were conducted in Hadoop 2.2.0. 
Machine Configuration: 8 Nodes (Similar) 
Processor: Intel Core i5 3.3 GHZ 
RAM: 8 GB 
HDD: 400 GB 
Nodes within a rack are connected by one Ethernet Switch and one Fast Ethernet switch is used 
between racks.The size of data block is set to 64Mb with an increasing replication factor starting from
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
1. Specifically, 8 jobs are run ranging from 1 to 8 replication levels which are not greater than number 
of nodes available in the cluster. The result of the map phase only is experimented i.e. the completion 
time and data locality of the map phase is averaged over 8 runs. As noticed, in terms of throughput, the 
tasks with node locality is better than tasks with rack-off locality. 
95 
4.1. Results and Graph 
Performance evaluations show that as replication levels increase the task completion time gets 
significantly reduced for computation involving no data files. But for computations that involve data 
files the completion time reduces and then again shoots up due to update cost. Both experiments were 
conducted on replication levels ranging from one to eight which is not higher than number of nodes in 
the cluster. 
4.1.1. Experiment for Pi Value 
Figure 2. Replication Levels for PI Value 
Figure 2 shows that the data replication scheme used in PI value calculation reduces the task 
completion time. By comparing, with increasing replication factors there is some increase in the 
performance and when the replication level is increased by 3, its completion time is 337 s, and further 
reduces considerably to 8.12 s at replication level 8. This shows that as replication factor increases 
multiple map phases are introduced and thus the computation speeds up.
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
96 
4.1.2. Experiment for WordCount 
Figure 3. Replication Levels for Word Count 
Figure 3 shows the task completion time for WordCount application where the completion time 
reduces linearly with replication factor increase but once it reaches the threshold level the performance 
starts to deteriorate. This shows that the computations involving data files do not linearly improve in 
performance as replication increases. By default, the replication level in HDFS is set to 3 which will 
reduce the performance speed and thus the completion time is 132220 seconds. On increased 
replication levels the computation speed boosts up but once it reaches the threshold the time comes 
down from 1300430 s to 1608600 s. 
Performance evaluations show that as replication levels increase the task completion time gets 
significantly reduced for computation involving no data files. But for computations that involve data 
files the completion time reduces and then again shoots up due to update cost. Both experiments were 
conducted on replication levels ranging from one to eight which is not higher than number of nodes in 
the cluster. 
CONCLUSIONS 
Data Replication in Hadoop framework to investigate the data locality problem was experimented and 
proved that replication improves performance and also decreases it as the threshold limit is crossed. 
Further the replication factor was supported by an access count prediction algorithm for data files 
using Lagrange’s interpolation which optimizes the replication factor per data file. Performance 
evaluation showed that our data replication scheme reduces the task completion time. The task 
completion time for Pi value calculation started with 680 s and came down to 8.12 s and similarly the 
WordCount application does start with completion time of 1384300 s and came down to 1300430 s. 
But once the threshold limit is reached the completion time shoots up to 1608600 s.
International Journal on Computational Sciences  Applications (IJCSA) Vol.4, No.5, October 2014 
97 
ACKNOWLEDGEMENTS 
Theauthors would like to thank all those who have shared their valuable inputs, especially 
Ganeshpandi and Lakshmi in the Department of Information Technology, University College of 
Engineering Villupuram, Tamilnadu, India for their insights, suggestions and time throughout the 
course of this work. 
REFERENCES 
[1] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... Gruber, R. E. (2008). 
Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 
(TOCS), 26(2), 4. 
[2] Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. Big data and cloud computing: current state and 
future opportunities. In Proceedings of the 14th International Conference on Extending Database 
Technology, pp. 530-533. ACM, 2011. 
[3] Hadoop : http://hadoop.apache.org/ 
[4] Hadoop DFS User Guide. http://hadoop.apache.org/. 
[5] Fadika, Zacharia, MadhusudhanGovindaraju, Richard Canon, and LavanyaRamakrishnan. Evaluating 
hadoop for data-intensive scientific operations. InCloud Computing (CLOUD), 2012 IEEE 5th International 
Conference on, pp. 67-74. IEEE, 2012. 
[6] K. Elissa, Abad, Cristina L., Yi Lu, and Roy H. Campbell. DARE: Adaptive data replication for efficient 
cluster scheduling. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pp. 159- 
168. Ieee, 2011. “Title of paper if known,” unpublished. 
[7] Seo, Sangwon, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, and SeungryoulMaeng. HPMR: 
Prefetching and pre-shuffling in shared MapReduce computation environment. In Cluster Computing and 
Workshops, 2009. CLUSTER'09. IEEE International Conference on, pp. 1-8. IEEE, 2009. 
[8] Khanli, Leyli Mohammad, AyazIsazadeh, and Tahmuras N. Shishavan. PHFS: A dynamic replication 
method, to decrease access latency in the multi-tier data grid. Future Generation Computer Systems 27, no. 
3 (2011): 233-244. 
[9] Lee, Jungha, JongBeom Lim, HeonchangYuDaeyong Jung, KwangSik Chung, and JoonMin Gil. Adaptive 
Data Replication Scheme Based on Access Count Prediction in Hadoop. The World Congress in Computer 
Science, Computer Engineering, and Applied Computing, 2013. 
[10] Zaharia, Matei, DhrubaBorthakur, JoydeepSenSarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In 
Proceedings of the 5th European conference on Computer systems, pp. 265-278. ACM, 2010. 
Author 
MuralikrishnanRamane has completed post graduate degree in the field of Computer Science 
Engineering, specialized in Distributed Computing Systems.He is currently working as a Lecturer 
at University College of Engineering Villupuram,Villupuram,India.Toh is credit he has three 
InternationalJournaland one International Conference publications.His research interests include 
Data Managem nt in Distributed Environments such as Grid, P2P and Cloud.

More Related Content

What's hot

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction EMC
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 

What's hot (17)

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
hadoop
hadoophadoop
hadoop
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Cppt
CpptCppt
Cppt
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Hadoop
HadoopHadoop
Hadoop
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 
paper
paperpaper
paper
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 

Viewers also liked

Motion detection applied to
Motion detection applied toMotion detection applied to
Motion detection applied toijcsa
 
Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computingijcsa
 
Image Inpainting System Model Based on Evaluation
Image Inpainting System Model Based on  EvaluationImage Inpainting System Model Based on  Evaluation
Image Inpainting System Model Based on Evaluationijcsa
 
A N A UTOMATION S YSTEM F OR T RACKING C OMMUNITY C ERTIFICATE
A N  A UTOMATION  S YSTEM  F OR  T RACKING  C OMMUNITY  C ERTIFICATEA N  A UTOMATION  S YSTEM  F OR  T RACKING  C OMMUNITY  C ERTIFICATE
A N A UTOMATION S YSTEM F OR T RACKING C OMMUNITY C ERTIFICATEijcsa
 
A novelty approach on forgery digital image
A novelty approach on forgery digital imageA novelty approach on forgery digital image
A novelty approach on forgery digital imageijcsa
 
State of the art parallel approaches for
State of the art parallel approaches forState of the art parallel approaches for
State of the art parallel approaches forijcsa
 
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...ijcsa
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
 
Application of hidden markov model in question answering systems
Application of hidden markov model in question answering systemsApplication of hidden markov model in question answering systems
Application of hidden markov model in question answering systemsijcsa
 
TEST CASE PRIORITIZATION USING FUZZY LOGIC BASED ON REQUIREMENT PRIORITIZING
TEST CASE PRIORITIZATION USING FUZZY LOGIC  BASED ON REQUIREMENT PRIORITIZINGTEST CASE PRIORITIZATION USING FUZZY LOGIC  BASED ON REQUIREMENT PRIORITIZING
TEST CASE PRIORITIZATION USING FUZZY LOGIC BASED ON REQUIREMENT PRIORITIZINGijcsa
 
A new approach to the solution of economic dispatch using particle Swarm opt...
A new approach to the solution of economic dispatch using particle Swarm  opt...A new approach to the solution of economic dispatch using particle Swarm  opt...
A new approach to the solution of economic dispatch using particle Swarm opt...ijcsa
 
EVENT DRIVEN ROUTING PROTOCOLS FOR WIRELESS SENSOR NETWORK- A SURVEY
EVENT DRIVEN ROUTING PROTOCOLS FOR  WIRELESS SENSOR NETWORK- A SURVEYEVENT DRIVEN ROUTING PROTOCOLS FOR  WIRELESS SENSOR NETWORK- A SURVEY
EVENT DRIVEN ROUTING PROTOCOLS FOR WIRELESS SENSOR NETWORK- A SURVEYijcsa
 
Classification of retinal vessels into
Classification of retinal vessels intoClassification of retinal vessels into
Classification of retinal vessels intoijcsa
 
On the 1-2-3-edge weighting and Vertex coloring of complete graph
On the 1-2-3-edge weighting and Vertex coloring of complete graphOn the 1-2-3-edge weighting and Vertex coloring of complete graph
On the 1-2-3-edge weighting and Vertex coloring of complete graphijcsa
 
A coustic pseudo spectrum based fault Localization in motorcycles
A coustic pseudo spectrum based fault Localization in motorcyclesA coustic pseudo spectrum based fault Localization in motorcycles
A coustic pseudo spectrum based fault Localization in motorcyclesijcsa
 
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...ijcsa
 
Automated traffic sign board
Automated traffic sign boardAutomated traffic sign board
Automated traffic sign boardijcsa
 
A S T A S42 O P E N E S P
A S T  A S42 O P E N  E S PA S T  A S42 O P E N  E S P
A S T A S42 O P E N E S PMarina Estrella
 
I See Jesus In Your Eyes
I See Jesus In Your EyesI See Jesus In Your Eyes
I See Jesus In Your EyesSHINE Fest
 

Viewers also liked (20)

Motion detection applied to
Motion detection applied toMotion detection applied to
Motion detection applied to
 
Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
 
Image Inpainting System Model Based on Evaluation
Image Inpainting System Model Based on  EvaluationImage Inpainting System Model Based on  Evaluation
Image Inpainting System Model Based on Evaluation
 
A N A UTOMATION S YSTEM F OR T RACKING C OMMUNITY C ERTIFICATE
A N  A UTOMATION  S YSTEM  F OR  T RACKING  C OMMUNITY  C ERTIFICATEA N  A UTOMATION  S YSTEM  F OR  T RACKING  C OMMUNITY  C ERTIFICATE
A N A UTOMATION S YSTEM F OR T RACKING C OMMUNITY C ERTIFICATE
 
A novelty approach on forgery digital image
A novelty approach on forgery digital imageA novelty approach on forgery digital image
A novelty approach on forgery digital image
 
State of the art parallel approaches for
State of the art parallel approaches forState of the art parallel approaches for
State of the art parallel approaches for
 
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...
EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN PREDICTION OF THE CAR ...
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
Application of hidden markov model in question answering systems
Application of hidden markov model in question answering systemsApplication of hidden markov model in question answering systems
Application of hidden markov model in question answering systems
 
TEST CASE PRIORITIZATION USING FUZZY LOGIC BASED ON REQUIREMENT PRIORITIZING
TEST CASE PRIORITIZATION USING FUZZY LOGIC  BASED ON REQUIREMENT PRIORITIZINGTEST CASE PRIORITIZATION USING FUZZY LOGIC  BASED ON REQUIREMENT PRIORITIZING
TEST CASE PRIORITIZATION USING FUZZY LOGIC BASED ON REQUIREMENT PRIORITIZING
 
A new approach to the solution of economic dispatch using particle Swarm opt...
A new approach to the solution of economic dispatch using particle Swarm  opt...A new approach to the solution of economic dispatch using particle Swarm  opt...
A new approach to the solution of economic dispatch using particle Swarm opt...
 
EVENT DRIVEN ROUTING PROTOCOLS FOR WIRELESS SENSOR NETWORK- A SURVEY
EVENT DRIVEN ROUTING PROTOCOLS FOR  WIRELESS SENSOR NETWORK- A SURVEYEVENT DRIVEN ROUTING PROTOCOLS FOR  WIRELESS SENSOR NETWORK- A SURVEY
EVENT DRIVEN ROUTING PROTOCOLS FOR WIRELESS SENSOR NETWORK- A SURVEY
 
Classification of retinal vessels into
Classification of retinal vessels intoClassification of retinal vessels into
Classification of retinal vessels into
 
On the 1-2-3-edge weighting and Vertex coloring of complete graph
On the 1-2-3-edge weighting and Vertex coloring of complete graphOn the 1-2-3-edge weighting and Vertex coloring of complete graph
On the 1-2-3-edge weighting and Vertex coloring of complete graph
 
A coustic pseudo spectrum based fault Localization in motorcycles
A coustic pseudo spectrum based fault Localization in motorcyclesA coustic pseudo spectrum based fault Localization in motorcycles
A coustic pseudo spectrum based fault Localization in motorcycles
 
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...
EVALUATION THE EFFICIENCY OF ARTIFICIAL BEE COLONY AND THE FIREFLY ALGORITHM ...
 
Automated traffic sign board
Automated traffic sign boardAutomated traffic sign board
Automated traffic sign board
 
Xing Fu
Xing  FuXing  Fu
Xing Fu
 
A S T A S42 O P E N E S P
A S T  A S42 O P E N  E S PA S T  A S42 O P E N  E S P
A S T A S42 O P E N E S P
 
I See Jesus In Your Eyes
I See Jesus In Your EyesI See Jesus In Your Eyes
I See Jesus In Your Eyes
 

Similar to An experimental evaluation of performance

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

Similar to An experimental evaluation of performance (20)

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Hadoop
HadoopHadoop
Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
hadoop
hadoophadoop
hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Recently uploaded

Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxHome
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....santhyamuthu1
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfodunowoeminence2019
 
Multicomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfMulticomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfGiovanaGhasary1
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxLMW Machine Tool Division
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical SensorTanvir Moin
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxrealme6igamerr
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
solar wireless electric vechicle charging system
solar wireless electric vechicle charging systemsolar wireless electric vechicle charging system
solar wireless electric vechicle charging systemgokuldongala
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationMohsinKhanA
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfJulia Kaye
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxYogeshKumarKJMIT
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvementVijayMuni2
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Akarthi keyan
 

Recently uploaded (20)

Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptx
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
 
Multicomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfMulticomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdf
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical Sensor
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
solar wireless electric vechicle charging system
solar wireless electric vechicle charging systemsolar wireless electric vechicle charging system
solar wireless electric vechicle charging system
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software Simulation
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptx
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvement
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part A
 

An experimental evaluation of performance

  • 1. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 AN EXPERIMENTAL EVALUATION OF PERFORMANCE OF A HADOOP CLUSTER ON REPLICA MANAGEMENT Muralikrishnan Ramane, SharmilaKrishnamoorthy and Sasikala Gowtham Department of Information Technology, University College of Engineering Villupuram, Tamilnadu, India. ABSTRACT Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced the task completion time, but once the volume of the data being processed increases there is a considerable cutback in computational speeds due to update cost. Further the threshold level for balance between the update cost and replication factor is identified and presented graphically. KEYWORDS Replica Management; MapReduce Framework; Hadoop Cluster; Big Data. 1.INTRODUCTION A distributed system is a pool of autonomous compute nodes [1] connected by swift networks that appear as a single workstation. In reality, solving complex problems involves division of problem into sub tasks and each of which is solved by one or more compute nodes which communicate with each other by message passing. The current inclination towards Big Data analytics has lead to such compute intensive tasks. Big Data, [2] is termed for a collection of data sets which are large and complex and difficult to process using traditional data processing tools. The need for Big Data management is to ensure high levels of data accessibility for business intelligence and big data analytics. This condition needs applications capable of distributed processing involving terabytes of information saved in a variety of file formats. DOI:10.5121/ijcsa.2014.4507 87
  • 2. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 Hadoop [3] is a well-known and a successful open source implementation of the MapReduce programming model in the realm of distributed processing. The Hadoop runtime system coupled with HDFS provides parallelism and concurrency to achieve system reliability. The major categories of machine roles in a Hadoop deployment are Client machines, Master nodes and Slave nodes. The Master nodes supervise storing of data and running parallel computations on all that data using Map Reduce. The NameNode supervises and coordinates the data storage function in HDFS, while the JobTracker supervises and coordinates the parallel processing of data using Map Reduce. Slave Nodes are the vast majority of machines and do all the cloudy work of storing the data and running the computations. Each slave runs a DataNode and a TaskTracker daemon that communicates with and receives instructions from their master nodes. The TaskTracker daemon is a slave to the JobTracker likewise the DataNode daemon to the NameNode. HDFS [4] file system is designed for storing huge files with streaming data access patterns, running on clusters of commodity hardware. An HDFS cluster has two type of node operating in a master-slave pattern: A NameNode(Master) managing the file system namespace, file System tree and the metadata for all the files and directories in the tree and some number of DataNode(Workers) managing the data blocks of files. The HDFS is so large that replicas of files are constantly created to meet performance and availability requirements. A replica [5] is usually created so as the new storage location offers better performance and availability for accesses to or from a particular location. In the Hadoop architecture the replica is commonly selected based on storage and network feasibility which makes it fault tolerant so as to recover from failing DataNode. It does replicates files based on a rack aware cluster setup in which by default it replicates each file at three principle locations within the cluster; first copy is stored on local node and the other two copies are stored on remote rack. Additional replicas are stored randomly on any rack which could be configured and overridden using scripts. The rest of this paper is organized as follows. Section II, discusses related studies on data replication schemes in cluster; Section III describes the proposed system model of a Hadoop cluster and the data locality problem; Section IV evaluates the performance of the system by conducting experiments on varying data replication levels. Finally, Section V concludes and discusses the future scope of this work. 88 2.RELATED STUDIES The purpose of data replication in HDFS is primarily to improve the availability of data. Replication of a data fileserves the purpose of system reliability where if one or more nodes fail in a cluster. Recently, studies were done to improve fault tolerance of data in the presence of failure and few of those are discussed below. Abad.C.L, Yi Lu, Campbell.R.H; [6] proposed a data replication and placement algorithm (DARE)that adapts to the fluctuations in workload. It assumes the scheduler is unaware to the data replication policy and was implemented and evaluated using the Hadoop framework. When local data is not available the node retrieves data from a remote node and process the assigned task and discards the data after completion. DARE benefits from existing remote data retrievals and selects a subset of the data and creates a replica without consuming additional network and computation resources.Node run
  • 3. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.5, October 2014 the algorithm independently to create replicas that are likely to be heavily accessed. The authors designed a probabilistic dynamic replication algorithm with the following features: 89 1. Nodes sample assigned tasks and replicates popular files in a distributed manner. 2. Correlated data accesses are distributed over diverse nodes as old replicas deleted and new replicas are created. Experiments proved 7-times improvement in data locality and 70% improvement in cluster scheduling. Reduces job turnaround time by 16% in dedicated clusters and 19% in virtualized public clouds. SangwonSeo, Ingook Jang; [7] proposed optimization schemes such as prefetching and pre-shuffling to solve shared environment problems. Both the above schemes were implemented in a High Performance MapReduce Engine (HPMR). In an Intra block fetching an input split or an intermediate output is prefetched whereas the whole candidate data block is prefetched in the interblock prefetching. The pre-shuffling scheme reduces the amount of intermediate output to shuffle and at the time of pre-shuffling HPMR looks over an input split before the mapphase begins and predicts the target reducer where the key-value pairs are segregated. A new task scheduler was designed forpre-shuffling and is used only for the reduce phase. Prefetching schemes improve data locality and Pre-shuffling schemes significantly reduces the shuffling overhead during the reduce phase. The schemes provided following contributions: 1. Performance degradation analysis of Hadoopin ashared MapReduce computation environment. 2. Prefetching and Pre-shufflingschemes to improve MapReduce performance when physical nodes are shared by multiple users. 3. HPMR reduces network overhead and exploits data locality compatible with both dedicated and shared environments. Khanli.L.M, Isazadeh.A; [8] proposed an algorithm to decrease access latency by predicting the future usage of files. Predictive Hierarchal Fast Spread (PHFS) pre-replicates data in a hierarchal data grid using two phases: collecting data access statistics and applying data mining techniques like clustering and association rule mining all over the system. File sare assigned value which is between 0 and 1 for representing relationships between files. Files are arranged according to value of which is called the PWS(predictive working set). PHFS utilizes the PWS of a file and replicates all members of PWS including the file and all files on the path from source to client. PHFS tries to improve data locality by predicting the user’s future demands and pre-replicating them in advance thereby achieving highe ravail ability with optimized usage of resources. Jungha Lee, Jong Beom Lim; [9] proposed a data replication scheme (ADRAP)that is adaptive to overhead, associated with the data locality problem. The algorithm works based on access count prediction to reduce the data transfer time and improves data locality thereby reducing total processing time. The scheme adaptively determines the required replication factor by evaluating data access patterns and recent replication factor for a particular data file.The paper contributes the following:
  • 4. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 90 1. Optimizes replication factor and effectively avoids overhead caused by data replication. 2. Dynamically determines data replication requirements. 3. Minimizes processing time of MapReduce jobs by improving the data locality. Zaharia.M, Borthakur.D; [10] proposed a delay scheduling method that illustrates the conflict between fairness in scheduling and data locality by designing a fair scheduler for a 600-node Hadoopcluster at Facebook. Delay scheduling schedules jobs according to fairness and waits for asmall amount of time letting other jobs to launch tasks. It achieves nearly optimal data locality in a variety of workloads and increases throughput by up to 2x while preserving fairness. The algorithm is applicable under a wide variety of scheduling policies beyond fair sharing such as the Hadoop Fair Scheduler. HFS has two main goals:Fair sharing and Data locality. To achieve the goal the scheduler reallocates resources between jobs when the number of jobs changes by killing running tasks to make room for the new job and waiting for running tasks to finish. Delay scheduling performs well in typical Hadoop workloads and is applicable beyond fair sharing. Delay scheduling in HFS is generalized to implement a hierarchical scheduling policy motivated by the needs of Facebook’s users. The scheduler divides slots between users based on weighted fair sharing at top-level and allows users to schedule their own jobs using either FIFO or fair sharing. 3.PROPOSED WORK The purpose of this research is to evaluate the performance of the Hadoop cluster and to design a rack aware Hadoop cluster. To achieve the purpose a data replication scheme is adapted to fit the system, implemented using a rack aware Hadoop cluster. In such a cluster tasks are run manually with varying levels of data replication. The setup and the shell scripts required for implementation are presented in detail in the following paragraphs.This research work makes a tiny contribution on; • Minimizing processingtime and data transfer load between racks by improving data locality. 3.1.Rack Aware Hadoop Clusters TheData availability and locality are interrelated domains in the realm of distributed processing which when not handled appropriately leads to performance issues. When a scheduled task is about to run which does not have the data required for it’s processing, have to load the data from another node causing poor throughput. This paper deals with similar type situation where cluster of nodes are involved and each node belonging to the same or different rack. For the purpose of experimental evaluation this paper utilizes a Hadoop cluster setup with one Master node and seven slave nodes each configured manually to define the rack number it belongs. This paper utilizes an improved data placement policy to prevent data loss and improve network performance.
  • 5. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 Data blocks are replicated to multiple machines to prevent data loss due to machine failures. In the case of a switch or power failure the Name Node stores information locality of Data Nodes in the network topology and utilizes the information to decide placing data replicas within the cluster. An assumption that two machines in the same rack have more bandwidth and lower latency between each other than two machines in two different racks is considered. It is also assumed cross-rack latency is higher than in-rack latency most of the time. The system is designed for evaluation is shown in figure 1. 91 MN-1 NN-1 RACK 1 A B C D SN-1 DN-1 DN-2 A B DATA BLOCKS DN-1 RACK 2 DN-1 B D SWITCH Figure 1. Rack aware Hadoop Cluster Setup A MN - MASTER NODE SN - SLAVE NODE DN - DATA NODE NN - NAME NODE SWITCH B C D C D SWITCH SN-2 SN-3 DN-2 A A SWITCH SN-6 SN-7 DN-2 B D SWITCH RACK 4 SN-4 DN-1 SN-5 DN-2 C C RACK 3
  • 6. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 92 3.2.Access Count Prediction Maintaining different replication factors per data file and assuming that a higher replication factor for a file with higher access count does not always guarantee better data locality. Data is replicated cautiously for if the replication factor is higher than access count for the particular file then the probability of being processed with node locality is higher than that of the opposite case. To normalize the replication factor, a method that predicts the next access count for a data file is required. The method initializes the variables then proceeds with calculating the average time interval between data accesses and finally predicts the next access thereby calculating the number of future accesses. The predicted access count is evaluated in comparison with the current replication factor to determine the optimal replication factor. In addition, the number of rack-off locality nodes is effectively reduced by the replica placement policy. To predict access count for individual files this work utilizes Langrange’s interpolation using a polynomial expression. The mathematical formula is given below: …….. (1) In the equation, Let N be the number of points, Let xibe the the i-th point, and Let fibe the function of xi. To calculate the predicted access count, • substitute x by time t, where t is time of access and • y by an access count at t. 3.3. Data Placement Policy In the process of evaluating the data placement policy this work utilized Hadoop clusters that are arranged in racks. In-rack nodes has much more desirable network traffic than off-rack nodes. The cluster administrator uses the configuration variable net.topology.script.file.nameto decide on which rack the nodes belong to and this script is configured so that each node runs the script to determine its rack id. In a default installation nodes are assumed to belong to the same rack with similar rack id. The data placement policy is designed with a Hadoop cluster in mind which usually is of varying sizes and so varies accordingly. In the case of a small cluster, servers are connected by a single switch with two levels of locality on-machine and off-machine. But for larger installations it must be kept in mind that data replicas exist on multiple machines and spans multiple racks.A rack aware hadoop file system
  • 7. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 is created with the use of scripts which allows the master node to map the network topology of the cluster. The script is in executable form which allows it to return the rack address of each node. The script returns the stdouton a list of rack names, one for each input which are provided as arguments such as IP addresses of nodes in the cluster ordered consistently. Mapping scripts specify the key topology.script.file.name in conf/hadoop-site.xml. The script also provides a command to return rack id’s and by default Hadoop will try to send a set of IP addresses as command line arguments. Rack ids are represented as hierarchical path names where every node has a default rack id /default-rack and for large installations rack id’s are denoted using the entire topology starting from top level switch to rack names as it is given here /top-switch-name/rack-name. 93 3.3.1. Configuring Rack Awareness To configure a Hadoop cluster into a rack-aware system data must be divided into multiple file blocks and store them on different machines among the cluster. In the opposite case, when a Hadoop file system is not rack-aware then there is a possibility that Hadoop will place all the copies of the data blocks in same rack. Failing to configure a Hadoop into a rack-aware system may result in loss of data,but rack failure is not recurrent and this can be avoided by utilizing Hadoop configuration files. Hadoop is configured using the topology property where the actual setup is provided as an input file to the script for rack identification.The following script performs rack identification based on IP addresses given a hierarchical IP addressing scheme enforced by the network administrator. Configuring rack awareness in Hadoop involves two steps: • Configure the “topology.script.file.name” in core-site.xml property nametopology.node.switch.mapping.impl/name value org.apache.hadoop.net.ScriptBasedMapping /value /property property nametopology.script.file.name/name valuecore/rack-awareness.sh/value /property Rack-awareness.sh
  • 8. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 94 HADOOP_CONF=/usr/local/hadoop/conf while [ $# -gt 0 ] ; do nodeArg=$1 exec ${HADOOP_CONF}/topology.data result=”” while read line ; do ar=( $line ) if [ ${ar[0]}” = “$nodeArg” ] ; then result=”${ar[1]}” fi done shift if [ -z $result ] ; then echo -n “/default/rack “ else echo -n “$result “ fi done Topology.data (master)Machine1.pc /dc1/rack1 (slave)Machine2.pc /dc1/rack1 (slave)Machine3.pc /dc2/rack2 (slave)Machine4.pc /dc2/rack2 (slave)Machine5.pc /dc3/rack3 (slave)Machine6.pc /dc3/rack3 (slave)Machine7.pc /dc4/rack4 (slave)Machine8.pc /dc4/rack4 4.EXPERIMENTAL EVALUATION Experiments were conducted on the rack-aware Hadoop cluster to evaluate its performance in terms of data availability. It involves two scenarios: one involving too many data files and other with no data files but complex computations. The former one is WordCount application and the latter is Pi value calculation, also both the above experiments were conducted in Hadoop 2.2.0. Machine Configuration: 8 Nodes (Similar) Processor: Intel Core i5 3.3 GHZ RAM: 8 GB HDD: 400 GB Nodes within a rack are connected by one Ethernet Switch and one Fast Ethernet switch is used between racks.The size of data block is set to 64Mb with an increasing replication factor starting from
  • 9. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 1. Specifically, 8 jobs are run ranging from 1 to 8 replication levels which are not greater than number of nodes available in the cluster. The result of the map phase only is experimented i.e. the completion time and data locality of the map phase is averaged over 8 runs. As noticed, in terms of throughput, the tasks with node locality is better than tasks with rack-off locality. 95 4.1. Results and Graph Performance evaluations show that as replication levels increase the task completion time gets significantly reduced for computation involving no data files. But for computations that involve data files the completion time reduces and then again shoots up due to update cost. Both experiments were conducted on replication levels ranging from one to eight which is not higher than number of nodes in the cluster. 4.1.1. Experiment for Pi Value Figure 2. Replication Levels for PI Value Figure 2 shows that the data replication scheme used in PI value calculation reduces the task completion time. By comparing, with increasing replication factors there is some increase in the performance and when the replication level is increased by 3, its completion time is 337 s, and further reduces considerably to 8.12 s at replication level 8. This shows that as replication factor increases multiple map phases are introduced and thus the computation speeds up.
  • 10. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 96 4.1.2. Experiment for WordCount Figure 3. Replication Levels for Word Count Figure 3 shows the task completion time for WordCount application where the completion time reduces linearly with replication factor increase but once it reaches the threshold level the performance starts to deteriorate. This shows that the computations involving data files do not linearly improve in performance as replication increases. By default, the replication level in HDFS is set to 3 which will reduce the performance speed and thus the completion time is 132220 seconds. On increased replication levels the computation speed boosts up but once it reaches the threshold the time comes down from 1300430 s to 1608600 s. Performance evaluations show that as replication levels increase the task completion time gets significantly reduced for computation involving no data files. But for computations that involve data files the completion time reduces and then again shoots up due to update cost. Both experiments were conducted on replication levels ranging from one to eight which is not higher than number of nodes in the cluster. CONCLUSIONS Data Replication in Hadoop framework to investigate the data locality problem was experimented and proved that replication improves performance and also decreases it as the threshold limit is crossed. Further the replication factor was supported by an access count prediction algorithm for data files using Lagrange’s interpolation which optimizes the replication factor per data file. Performance evaluation showed that our data replication scheme reduces the task completion time. The task completion time for Pi value calculation started with 680 s and came down to 8.12 s and similarly the WordCount application does start with completion time of 1384300 s and came down to 1300430 s. But once the threshold limit is reached the completion time shoots up to 1608600 s.
  • 11. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.5, October 2014 97 ACKNOWLEDGEMENTS Theauthors would like to thank all those who have shared their valuable inputs, especially Ganeshpandi and Lakshmi in the Department of Information Technology, University College of Engineering Villupuram, Tamilnadu, India for their insights, suggestions and time throughout the course of this work. REFERENCES [1] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4. [2] Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology, pp. 530-533. ACM, 2011. [3] Hadoop : http://hadoop.apache.org/ [4] Hadoop DFS User Guide. http://hadoop.apache.org/. [5] Fadika, Zacharia, MadhusudhanGovindaraju, Richard Canon, and LavanyaRamakrishnan. Evaluating hadoop for data-intensive scientific operations. InCloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pp. 67-74. IEEE, 2012. [6] K. Elissa, Abad, Cristina L., Yi Lu, and Roy H. Campbell. DARE: Adaptive data replication for efficient cluster scheduling. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pp. 159- 168. Ieee, 2011. “Title of paper if known,” unpublished. [7] Seo, Sangwon, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, and SeungryoulMaeng. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pp. 1-8. IEEE, 2009. [8] Khanli, Leyli Mohammad, AyazIsazadeh, and Tahmuras N. Shishavan. PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid. Future Generation Computer Systems 27, no. 3 (2011): 233-244. [9] Lee, Jungha, JongBeom Lim, HeonchangYuDaeyong Jung, KwangSik Chung, and JoonMin Gil. Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop. The World Congress in Computer Science, Computer Engineering, and Applied Computing, 2013. [10] Zaharia, Matei, DhrubaBorthakur, JoydeepSenSarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pp. 265-278. ACM, 2010. Author MuralikrishnanRamane has completed post graduate degree in the field of Computer Science Engineering, specialized in Distributed Computing Systems.He is currently working as a Lecturer at University College of Engineering Villupuram,Villupuram,India.Toh is credit he has three InternationalJournaland one International Conference publications.His research interests include Data Managem nt in Distributed Environments such as Grid, P2P and Cloud.