With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
Hadoop is an open- source implementation of MapReduce. Hadoop is useful for storing the large volume
of data into Hadoop Distributed File System (In distribute Manner) and that data get processed by MapReduce model
in parallel. MapReduce is scalable and efficient programming model to perform large scale data intensive application.
In Homogeneous Environment, it decreases the data transmission overhead because Hadoop schedules data location
and all nodes have ideal work with computing speed in a cluster. Unhappily, it’s difficult to take data locality in
heterogeneous or shared environments. The performance of MapReduce is improved using data Prefetching
mechanism which can fetch the data from slower remote node to faster node and as a result job execution time is
reduced. This mechanism hides the overhead of data processing and data transmission when data is not local. Data
prefetching design reduces data transmission overhead effectively.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Sharing of cluster resources among multiple Workflow Applicationsijcsit
Many computational solutions can be expressed as workflows. A Cluster of processors is a shared
resource among several users and hence the need for a scheduler which deals with multi-user jobs
presented as workflows. The scheduler must find the number of processors to be allotted for each workflow
and schedule tasks on allotted processors. In this work, a new method to find optimal and maximum
number of processors that can be allotted for a workflow is proposed. Regression analysis is used to find
the best possible way to share available processors, among suitable number of submitted workflows. An
instance of a scheduler is created for each workflow, which schedules tasks on the allotted processors.
Towards this end, a new framework to receive online submission of workflows, to allot processors to each
workflow and schedule tasks, is proposed and experimented using a discrete-event based simulator. This
space-sharing of processors among multiple workflows shows better performance than the other methods
found in literature. Because of space-sharing, an instance of a scheduler must be used for each workflow
within the allotted processors. Since the number of processors for each workflow is known only during
runtime, a static schedule can not be used. Hence a hybrid scheduler which tries to combine the advantages
of static and dynamic scheduler is proposed. Thus the proposed framework is a promising solution to
multiple workflows scheduling on cluster.
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
The MapReduce programming model simplifies
large-scale data processing on commodity cluster by
exploiting parallel map tasks and reduces tasks.
Although many efforts have been made to improve
the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which
plays a critical role in performance enhancement.
Traditionally, a hash function is used to partition
intermediate data among reduce tasks, which,
however, is not traffic-efficient because network
topology and data size associated with each key are
not taken into consideration. In this paper, we study
to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition
scheme. Furthermore, we jointly consider the
aggregator placement problem, where each
aggregator can reduce merged traffic from multiple
map tasks. A decomposition-based distributed
algorithm is proposed to deal with the large-scale
optimization problem for big data application and an
online algorithm is also designed to adjust data
partition and aggregation in a dynamic manner.
Finally, extensive simulation results demonstrate that
our proposals can significantly reduce network traffic
cost under both offline and online cases.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
Hadoop is an open- source implementation of MapReduce. Hadoop is useful for storing the large volume
of data into Hadoop Distributed File System (In distribute Manner) and that data get processed by MapReduce model
in parallel. MapReduce is scalable and efficient programming model to perform large scale data intensive application.
In Homogeneous Environment, it decreases the data transmission overhead because Hadoop schedules data location
and all nodes have ideal work with computing speed in a cluster. Unhappily, it’s difficult to take data locality in
heterogeneous or shared environments. The performance of MapReduce is improved using data Prefetching
mechanism which can fetch the data from slower remote node to faster node and as a result job execution time is
reduced. This mechanism hides the overhead of data processing and data transmission when data is not local. Data
prefetching design reduces data transmission overhead effectively.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Sharing of cluster resources among multiple Workflow Applicationsijcsit
Many computational solutions can be expressed as workflows. A Cluster of processors is a shared
resource among several users and hence the need for a scheduler which deals with multi-user jobs
presented as workflows. The scheduler must find the number of processors to be allotted for each workflow
and schedule tasks on allotted processors. In this work, a new method to find optimal and maximum
number of processors that can be allotted for a workflow is proposed. Regression analysis is used to find
the best possible way to share available processors, among suitable number of submitted workflows. An
instance of a scheduler is created for each workflow, which schedules tasks on the allotted processors.
Towards this end, a new framework to receive online submission of workflows, to allot processors to each
workflow and schedule tasks, is proposed and experimented using a discrete-event based simulator. This
space-sharing of processors among multiple workflows shows better performance than the other methods
found in literature. Because of space-sharing, an instance of a scheduler must be used for each workflow
within the allotted processors. Since the number of processors for each workflow is known only during
runtime, a static schedule can not be used. Hence a hybrid scheduler which tries to combine the advantages
of static and dynamic scheduler is proposed. Thus the proposed framework is a promising solution to
multiple workflows scheduling on cluster.
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
The MapReduce programming model simplifies
large-scale data processing on commodity cluster by
exploiting parallel map tasks and reduces tasks.
Although many efforts have been made to improve
the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which
plays a critical role in performance enhancement.
Traditionally, a hash function is used to partition
intermediate data among reduce tasks, which,
however, is not traffic-efficient because network
topology and data size associated with each key are
not taken into consideration. In this paper, we study
to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition
scheme. Furthermore, we jointly consider the
aggregator placement problem, where each
aggregator can reduce merged traffic from multiple
map tasks. A decomposition-based distributed
algorithm is proposed to deal with the large-scale
optimization problem for big data application and an
online algorithm is also designed to adjust data
partition and aggregation in a dynamic manner.
Finally, extensive simulation results demonstrate that
our proposals can significantly reduce network traffic
cost under both offline and online cases.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
Similar to LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
1. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
DOI: 10.5121/ijdps.2023.14601 1
LOAD BALANCING LARGE DATA SETS IN A
HADOOP CLUSTER
Andriavelonera Anselme A.1
, Rivosoaniaina Alain N.1
,Rakotomalala
Francis1
,Mahatody Thomas2
and Manantsoa Victor2
1
Laboratory for Mathematical and Computer Applied to the Development Systems,
University of Fianarantsoa, Madagascar
2
Professor on the University of Fianarantsoa, Madagascar
ABSTRACT
With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
KEYWORDS
Load balancing, Server, Cluster, Hadoop, Mapreduce & Yarn.
1. INTRODUCTION
Data size is considered large when it exceeds human analysis capabilities and the usual IT tools
for database management.
Currently, most research is focused on load balancing independent tasks [4] [5]. As a result, a
potential constraint is the management of resources across multiple computers to optimize
processing. The existence of a computer network means that tasks can be distributed from one
overloaded server to another underloaded machine. Load balancing is the result of a set of
techniques admitting a balanced distribution of the load on a system's available resources [3].
1.1. Objective and problematic of the research
In terms of performance, it's still difficult to achieve the best response time when large amounts
of data arrive instantaneously on the server. It is therefore necessary to improve database
performance in order to process large quantities of data efficiently. Indeed, the aim of this
research is to ensure a fair distribution of loads across a cluster of Hadoop nodes. To prevent
server saturation, the state of the nodes needs to be monitored in real time during as the loadlarge
amounts of data. This requires a performance study of the Hadoop database to determine its
processing capacity. However, the implementation of parallel programming with Mapreduce
offers the possibility of measuring the performance of the NoSQL database. For security reasons,
it is essential to monitor server status in real time. This is provided by cluster node administration
tools such as Ambari and Yarn. These are tools designed to manage and monitor resource
2. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
2
utilization during job execution. Some research has developed MapReduce-based model to
predict task execution times in a heterogeneous cluster [2].
With the various potential threats to today's IT infrastructure, it is essential that the system
administrator implements a server load balancing strategy to prevent overloading. This begs the
question: how can loads be effectively managed within a cluster of nodes when processing
massive data?
1.2. Contribution
As part of the improvement, the first aspect is to check the time it takes to load data to the cluster,
then the settings required for block size before this data replicates across the nodes. The latter
should be mastered by the system and network administrator to predict a server's load limit.
However, the study on load balancing will focus on the implementation of Mapreduce parallel
programming to test and simulate the load limit of a cluster of nodes according to the number of
connected users and shared data volumes. However, we propose a new approach to improve load
balancing of a Hadoop node cluster during instantaneous loading of large data by implementing
Mapreduce parallel programming with the necessary parameter settings.
2. STATE OF ART
In this section, we'll cover the key elements, such as the overall architecture of MapReduce, file
read and write operations in HDFS, as well as previous work related to load balancing in a
Hadoop cluster.
2.1. MapReduce paradigm
MapReduce is a programming model proposed by Google. It enables parallel processing of large
quantities of data within a cluster of nodes, and automates parallelism and distribution of
computation. Since the system manages parallel execution, the programmer's role is to implement
the two functions Map and Reduce. The Map function transforms input data into a series of
key/value pairs. It assembles the data into key/value pairs according to context. Each node
executing the Map function works on one or more pieces of the initial data. This data must be
split into several fragments for parallel processing and to perform the Map operation at each
cluster. In his article, the author has implemented the MapReduceApriori (MRA) algorithm on
the Apache Hadoop cluster. This algorithm uses two distinct functions, namely Map and Reduce,
with the aim of detecting repeated sets of k-elements [6].
Map function = Map (key1, value1) → List (key2, value2)
The Reduce function processes the values of each of the distinct keys generated by the Map
operation. Next, the key/value pairs are forwarded to the different nodes, provided that they have
the same key and are in the same Reduce node. Finally, each machine performs the Reduce
operation for that key. The Reduce function is written as:
Reduce (key2, List (value2)) → List (value2)
The operation of MapReduce is summarized in Figure 1 below:
3. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
3
Figure 1: Overall MapReduce architecture
The use of MapReduce ensures efficient use of machine resources and is suitable for solving a
wide range of computational problems. A number of optimizations aim to reduce the amount of
data sent over the network. Big data is categorized according to its volume, variety and velocity.
Most of this data is unstructured, quasi-structured or semi-structured, and heterogeneous in
nature.
The sheer volume and heterogeneity of data, coupled with the speed at which it is generated,
makes it difficult for today's IT infrastructure to manage it. The classification of algorithms
according to various quality measures that affect MapReduce performance [1].
Traditional systems for managing, storing and analyzing data lack the tools to analyze it. They
need to be stored in a distributed file system architecture such as Hadoop, which is most widely
used to store and manage high-volume data. In his article, Gaykar proposes a method for
identifying faulty nodes within a large distributed Hadoop environment using machine learning
techniques. The method involves collecting the history of each data node, then applying several
statistical and machine-learning algorithms to determine the state of the nodes [9].
2.2. Mapreduce Technique
MapReduce is a powerful large-scale data processing platform for sorting and merging huge
datasets in parallel, with research underway since the 1980s [22]. Several new large-scale data
processing platforms have emerged, inspired by MapReduce, to improve their scalability,
including Dryad, Hyracks and Spark. These platforms all integrate MapReduce components,
while expanding the options available for query execution. Some have decided to improve
MapReduce, as described in the article by Chen He, who has developed a new scheduling method
aimed at improving the geographical proximity of data for mapping tasks. This technique also
incorporates a FIFO scheduler developed for Hadoop. Experimental results show that this
approach significantly improves the geographical proximity of data, thus reducing the response
time of mapping tasks. What's more, unlike the delay algorithm, this approach does not involve
the need for tedious parameter adjustment [23]. The study conducted by Vishal examined various
existing models for optimizing Map-Reduce job scheduling with regard to their use in the context
of cloud computing [24].
Mohammad Reza Ahmadi's method is based on the use of a genetic algorithm with a parallel
executive structure to improve the processing of relational data in MapReduce clusters. This
approach proves particularly advantageous as data volume increases [25]. The MapReduce
paradigm has generated a great deal of interest in both academic and industrial circles
[26].Various MapReduce-related tools are available, three of which stand out as particularly
significant: Hadoop, Apache HIVE and Sqoop. Each of these platforms has its own MapReduce
4. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
4
mechanism [27]. Research into techniques for evolving MapReduce remains active, adjusting to
the various data. An in-depth analysis of MapReduce and its application in optimization
algorithms can be found in Khezr's article [28].
2.3. Reading and writing a file in HDFS
The results of experiments carried out by NathierMilhem et al. show that HDFS with a
distributed cache mechanism offers superior learning performance compared with conventional
HDFS, which has no cache, and with HDFS with a centralized cache mechanism [6].
To read a block in HDFS, the HDFS client queries the NameNode to retrieve the list of
DataNodes hosting its various replicas, the client then reads the block from the DataNode, and if
there's a problem with the block, it reads the block from another DataNode. The client reads the
nearest replica.
The figure 2 shows the read/write process for a file in the HDFS System:
Figure 2: Read and write a file in HDFS.
Recently, the article by Vijay offers an identification of the factors that contribute to solving
performance problems, while exploring other strategies for improving the efficiency of
MapReduce queries within a cloud-based environment [7].
To write a file to an HDFS system, follow these steps: (a) use the main Hadoop management
command: hadoop with the fs option. If you want to store a file on the HDFS system. (b) The
program will divide the file into 64kb blocks. (c) The NameNode tells it which DataNodes to
contact. (d) The client directly contacts the DataNode concerned and asks it to store the block. (e)
The DataNodes will inform the NameNode to replicate the data between them to avoid any data
loss. (f) The cycle is repeated for the next block.The inherent constraint of HDFS is that it is only
extensible at the level of data nodes, without the possibility of extending name nodes or metadata
management.
The proposed Dynamic Federated Metadata Management (DFMM) approach implements
metadata management at the name node level. This DFMM method configures multiple name
5. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
5
nodes using the HDFS federation technique, thus distributing metadata across multiple name
nodes thanks to the dispatching technique. The results obtained highlight the superiority of the
proposed DFMM architecture, combined with the sharding technique, in terms of high scalability
and metadata management efficiency. A comparison between the existing HDFS architecture and
the DFMM architecture suggests that the latter, combined with sharding, improves performance
and ensures more efficient metadata management [8].
2.4. Related work on load balancing
To optimize the processing of massive data in a cluster of Hadoop nodes, researchers have
developed an improved approach to job scheduling. This method takes into account both task size
and expected execution time. In addition, they have improved Hadoop performance by
integrating an aggregation node into the default architecture of the HDFS distributed file system
[11]. Other research has exploited AEML, an acceleration engine specially designed to balance
the workload across multiple GPUs. They developed an execution model dedicated to
heterogeneous tasks [12]. In order to overcome the limitations of the fair scheduling algorithm,
Bawankule proposes two strategies to improve the resource utilization efficiency and
performance of Hadoop [13].
In his article, Weyu Fu has improved Hadoop's resource utilization and performance by
optimizing the allocation of reduction tasks. This optimization is based on algorithms inspired by
the behavior of colonies of ants and hives, to ensure more efficient load balancing management
[14].In another paper, load balancing between available fog nodes was investigated to reduce task
processing response time, with results indicating a significant improvement in response time
[15].Some authors have improved a load balancing architecture using the hybrid load balancing
algorithm.
The proposed model was designed to improve resource utilization by implementing load
balancing at the fog layer [16].However, improving load balancing can have a significant impact
on service quality, which translates positively into the user experienceIn his work, the author
explored different task scheduling and load balancing algorithms, also introducing a new seven-
category classification for the HadoopMapReduce case, and providing a detailed analysis of each
of these categories [17]. Mostafa's study methodically analyzes load balancing algorithms in fog
computing, classifying them into four categories, and discusses the main outstanding challenges
and future trends for these algorithms [18].
Load balancing ensures the balanced distribution of workloads generated by a number of data-
connected devices. In-depth investigations have been carried out to integrate this practice into
cloud computing. Ashish presents a study of load balancing algorithms in fog computing,
including an analysis based on various computational metrics as well as a simulation tool for its
deployment [19]. Another paper presents a load balancing method that takes energy efficiency
into account in the context of fog computing, by establishing an energy consumption model
related to the workload on the nodes. In addition, it prioritizes tasks according to the
manufacturing cluster and introduces a multi-agent system to perform distributed scheduling
within the manufacturing cluster [20].Pradeep's paper, aims to explain the importance of critical
data analysis in big data and to show how load balancing in cloud computing allows different
offload factors to be compared using parameters such as task execution time, distribution of tasks
across virtual machines and energy consumption [21].
6. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
6
2.5. Synthesis
After reviewing the most relevant work on load balancing using the Hadoop cluster and
MapReduce, we can see that most research has focused on two main areas:
- Proposed optimization algorithms: researchers have focused their work on developing
specific algorithms to improve load balancing in a Hadoop environment. These
algorithms are designed to efficiently distribute computing tasks between cluster nodes,
reducing load variations and optimizing system performance.
- Load balancing in the context of big data: other areas of research have focused on load
balancing in big data. In this environment, they ensure that data is distributed evenly
between cluster nodes.
Both approaches aim to improve the efficiency and performance of Hadoop clusters by ensuring
a balanced distribution of resources, whether at processing or data management level.
Researchers continue to focus on these areas to solve load balancing problems in distributed data
processing environments.In fact, there are many different methods and techniques for exploiting
Hadoop and MapReduce in data processing. From a technical and operational point of view, we
can see that the implementation of these tools offers various opportunities for companies. Indeed,
to continue exploiting the strengths of Hadoop and Mapreduce, it is essential to supervise loads at
the level of a cluster of nodes during data processing.
As a result, we propose a new architecture illustrated in Figure 3, leveraging the power of
Logstash to improve data loading to the Hadoop cluster. We have used this tool to manage log
files efficiently. However, during processing, the performance metric at node cluster level needs
to be evaluated via the MapReduce program implementation, which simultaneously generates the
load limit. However, we are also keen to evaluate the performance of the Hadoop system in
combination with other big data management tools.
3. EXPERIMENTATION
3.1. Working Environment
The hardware architecture during the experiment is characterized by:
- Master node: Core De Duo processor (2 x 2.5 GHz), RAM: 2GB, Operating System:
Centos 7 and 80GB hard disk.
- Data node 1:Single-core processor2.5 GHz, RAM: 1GB, Operating System: Centos 7
and 80GB hard disk.
- Data node 2:Single-core processor2.5 GHz, RAM: 1GB, Operating System: Centos 7
and 80GB hard disk.
3.2. Performance Metrics
Table 2 below shows the execution time for processing Mapreduce tasks as a function of the
number of nodes and file size:
7. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
7
Table 1: Experimental data
3 Map et 1 Reduce
File size in megabytes Number of Nodes Execution time in seconds
121,30
2 54
3 37
216,35 2 93
3 64
During the execution of a job, the NodeMaster (NodeM) automatically distributes data to client
machines or data nodes. During processing, system log files generate a performance report. This
factor reflects the query execution time in relation to the number of nodes in the Hadoop cluster.
NodeM is calculated according to the following formula:
NodeM =
𝑻
𝑵
- T:represents the test execution time.
- N:the number of nodes on which our query is executed.
a) Variance observation :
V(x)=
𝟏
𝒑
∑(𝒙𝒊 − 𝒙
̅)2
b) Covariances:
cov(x,y) = 𝝈 =
𝟏
𝒏
∑ (xi- x ̅)(yi- y ̅)
c) Standard error:
𝝈(𝒙) = √𝑽(𝒙)
d) Correlation coefficient: [x - 𝝈, x + 𝝈]
e) Change of variable :
N(m, 𝝈) T =
𝒙−𝒎
𝝈
3.3. Learning
Let X be a random variable with normal distribution N(m,𝛔)thenT =
𝒙−𝒎
𝛔
follows a reduced entry
normal distribution N (0, 1); hence
- P (m – 𝜎< X < m + 𝜎) = P(59,7-19,7< X <59,7+19,7)= 0,64 = 64%, Cluster at normal
load.
- P (m – 2𝜎< X < m + 2𝜎 ) = P(59,7-2*19,7< X <59,7+2*19,7)= 0,96=96%, Overloaded
cluster.
- P (m – 3𝜎< X < m + 3𝜎 ) = P(59,7-3*19,7< X <59,7+3*19,7) = 0,998 = 99,8 %,
Saturated cluster.
To improve data processing at the level of a cluster of Hadoop nodes, we propose the architecture
described in Figure 3 below as a new approach:
8. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
8
Figure3: Architecture for a new approach
(1): Transfer input data to Logstash.
(2): Receipt of data in the Hadoop database.
(3): Computing and data processing using Mapreduce parallel programming.
(4): Data storage in Hadoop Distributed File System or HDFS.
(5):YARN, which manages and monitors a cluster of nodes.
4. RESULTS
During the experimentation, we were able to draw the following several points:
Cluster stability varies according to the number of nodes.
The block size configuration is essential in relation to input data size, and should be
automatic.
Data replication ensures the reliability of an HDFS system in the event of failure of one
of its nodes. This ensures system availability.
Input data migration through the Logstash pipeline ensures automatic logging of all
incoming data.
Cluster monitoring is carried out via the AMBARI server administration interface, which
also alerts the system administrator when the cluster is overloaded.
After MapReduce job execution, we can summarize the following results:
9. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
9
Figure 4: Mapreduce algorithm execution times
The graph shown above illustrates a decrease in execution time as the number of nodes in the
cluster is increased. During the experiment, we evaluated the execution of MapReduce jobs using
both 2 and 3 nodes, on two different file sizes, one of 121.30 MB and the other of 216.35 MB.
The results show that the performance criterion of response time favors the multi-node
architecture.
5. CONCLUSIONS
In this paper, we studied the performance of a cluster of nodes during the loading of large data.
Many experiments were carried out to understand cluster management and the steps involved in
setting up a data pipeline. Using Logstash simplifies the process of transferring data to the
Hadoop database. We carried out our experiments on virtual machines. Since managing and
simulating nodes at cluster level consumes a lot of resources, the choice of tool and block size
settings are very important before controlling the flow of data through the nodes.
This work is limited by the need to reconfigure and rewrite the MapReduce program each time
we want to achieve greater precision in measuring the performance of a cluster of nodes. The
system administrator will need to make these adjustments for the node cluster based on the
MapReduce program in order to achieve better, improved results. In terms of prospects, the idea
will be to model an intelligent system for systematically predicting the real-time load of a cluster
of nodes; with this in mind, experiments should be carried out in a physical architecture
specifically adapted to the hardware requirements.
REFERENCES
[1] KhushbooKalia, Neeraj Gupta, 2020, “Analysis of hadoopMapReduce scheduling in heterogeneous
environment”, Ain Shams Engineering Journal.
[2] AbolfazlGandomi, Ali Movaghar, MidiaReshadi& Ahmad Khademzadeh, 2020, “Designing a
MapReduce performance model in distributed heterogeneous platforms based on benchmarking
approach”, The Journal of Supercomputing.
[3] F. Dong and G. Akl, 2006, “Scheduling Algorithms for Grid Computing: State of the Art and Open
Problems”, Technical Report No. 2006-504.
[4] H. Johansson, J. Steensland, 2006, “A performance characterization of load balancing algorithms for
parallel SAMR applications”, Technical Report 2006-047, Uppsala University.
10. International Journal of Distributed and Parallel systems (IJDPS) Vol 14, No. 1/2/3/4/5/6, November 2023
10
[5] H. Shan, L. Oliker, R. Biswas, W. Smith, 2004, “Scheduling in heterogeneous grid environments:
The effects of data migration” In Proc. of ADCOM2004,International Conference on Advanced
Computing and Communication.
[6] NathierMilhem, LaithAbualigah, Mohammad H. Nadimi-Shahraki, Heming Jia, Absalom E.
Ezugwu&Abdelazim G. Hussien, 2022, “Enhanced MapReduce Performance for the Distributed
Parallel Computing: Application of the Big Data”, Studies in Computational Intelligence book
series SCI, volume 1071.
[7] Vijay, V., Nanda, R., 2023,“A Priori Study on Factors Affecting MapReduce Performance in Cloud-
Based Environment”, Proceedings of Seventh International Congress on Information and
Communication Technology.
[8] Praveen M Dhulavvagol, S G Totad, 2023, “Performance Enhancement of Distributed System Using
HDFS Federation and Sharding”, International Conference on Machine Learning and Data
Engineering.
[9] Gaykar, Reshma S.; Khanaa, Velu; Joshi, Shashank D, 2022, “Faulty Node Detection in HDFS
Using Machine Learning Techniques”, Revue intelligenceArtificielle.
[10] Y. Gao, S. Feng and Z. Li, 2020, “A Distributed Cache Mechanism of HDFS to Improve Learning
Performance for Deep Reinforcement Learning,” IEEE Intl Conf on Parallel & Distributed
Processing with Applications.
[11] R. Alanazi, F. Alhazmi, H. Chung & Y. Nah, 2020, “A multi-optimization technique for
improvement of Hadoop performance with a dynamic job execution method based on artificial
neural network,” SN Computer Science, vol. 1, no. 3, pp. 184–211.
[12] Zhuo Tang, Lifan. Du, Xuedong Zhang, Li Yang & Ken Li, 2021, “AEML: an acceleration engine
for multi-GPU load-balancing in distributed heterogeneous environment,” IEEE Transactions on
Computers, vol. 71, no. 6, pp. 1–1357.
[13] K. L. Bawankule, R. K. Dewang, and A. K. Singh, 2021, “Load Balancing Approach for a
MapReduce Job Running on a Heterogeneous Hadoop Cluster”, International Conference on
Distributed Computing and Internet Technology, pp. 289–298, Springer, Cham.
[14] Weiyu Fu, Lixia Wang, 2022, “Load Balancing Algorithms for Hadoop Cluster in Unbalanced
Environment, Hindawi Computational Intelligence and Neuroscience.
[15] A.B. Manju, S. Sumathy, 2018, “Efficient Load Balancing Algorithm for Task Preprocessing in Fog
Computing Environment”, Smart Intelligent Computing and Applications pp 291–298.
[16] Mandeep Kaur &Rajni Aron, 2021, “FOCALB: Fog Computing Architecture of Load Balancing for
Scientific Workflow Applications”, Journal of Grid Computing volume 19, Article number: 40.
[17] EinollahJafarnejadGhomi& Amir, 2017, “Load-balancing algorithms in cloud computing”, Journal
of Network and Computer Applications, Volume 88, Pages 50-71.
[18] MostafaHaghiKashani&EbrahimMahdipour, 2022 “Load Balancing Algorithms in Fog Computing”,
IEEE Xplore logo.
[19] Ashish Chandak&Niranjan Kumar Ray, 2019, “A Review of Load Balancing in Fog Computing”,
International Conference on Information Technology (ICIT).
[20] Jiafu Wan; Baotong Chen; Shiyong Wang; Min Xia; Di Li &Chengliang Liu, 2018, “Fog
Computing for Energy-Aware Load Balancing and Scheduling in Smart Factory”, IEEE
Transactions on Industrial Informatics.
[21] Pradeep Kumar Shriwas, 2022, “OPJU International Technology Conference on Emerging
Technologies for Sustainable Development”, IEEE Xplore.
AUTHOR
ANDRIAVELONERA Anselme Alexandre, PhD student at EDMI (Ecole
Doctorale de Modélisation Informatique), University of Fianarantsoa, Madagascar
10 years of experience in System and Network Administration IT Instructor on
the studies direction, INSCAE Madagascar.