MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This survey proposes two classes of algorithms to minimize the make span and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 - 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
In recent years the data mining applications become musty and outmoded over time. Energy wastage is the major
problem more in big data analytics and applications. More workload and more computational time will increase high energy
cost and decrease efficiency. The Incremental computational time processing is a promising approach to refreshing mining
results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose
Energy efficiency Map Reduce Scheduling Algorithm, a novel incremental processing extension to reduce the Map, the most
widely used framework for mining big data. Map reduce is a programming model for processing and generating large amount of data in parallel processing time. In this paper, Energy Efficiency reduce Map (EEMP) is algorithm provide more energy
and less maps in big data. Priority based scheduling is a task will allocate the schedules based on necessary and utilization of
the Jobs. For reducing the maps, it will reduce the system computational time so easily energy has improved in terms of big data applications.. Final results show the experimental comparison of the different algorithms involved in the paper.
Hadoop is an open- source implementation of MapReduce. Hadoop is useful for storing the large volume
of data into Hadoop Distributed File System (In distribute Manner) and that data get processed by MapReduce model
in parallel. MapReduce is scalable and efficient programming model to perform large scale data intensive application.
In Homogeneous Environment, it decreases the data transmission overhead because Hadoop schedules data location
and all nodes have ideal work with computing speed in a cluster. Unhappily, it’s difficult to take data locality in
heterogeneous or shared environments. The performance of MapReduce is improved using data Prefetching
mechanism which can fetch the data from slower remote node to faster node and as a result job execution time is
reduced. This mechanism hides the overhead of data processing and data transmission when data is not local. Data
prefetching design reduces data transmission overhead effectively.
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
Equalizing the amount of processing time for each reducer instead of equalizing the amount of data each process in heterogeneous environment. A lightweight strategy to address the data skew problem among the reductions of MapReduce applications. MapReduce has been widely used in various applications, including web indexing, log analysis, data mining, scientific simulations and machine translations. The data skew refers to the imbalance in the amount of data assigned to each task.Using an innovative sampling method which can achieve a highly accurate approximation to the distribution of the intermediate data by sampling only a small fraction during the map processing and to reduce the data in reducer side. Prioritizing the sampling tasks for partitioning decision and splitting of large keys is supported when application semantics permit.Thus providing a reduced data of total ordered output as a result by range partitioner. In the proposed system, the data reduction is by predicting the reduction orders in parallel data processing using feature and instance selection. The accuracy of the data scale and data skew is effectively improved by CHI-ICF data reduction technique. In the existing system normal data distribution is calculated instead here still efficient distribution of data using the feature selection by χ 2 statistics (CHI) and instance selection by Iterative case filter (ICF) is processed.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
A novel methodology for task distributionijesajournal
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems
(HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core
processors acts as computing elements. So, an efficient task distribution methodology is essential for
obtaining high performance in modern embedded systems. In this paper, we present a novel methodology
for task distribution called Minimum Laxity First (MLF) algorithm that takes the advantage of runtime
reconfiguration of FPGA in order to effectively utilize the available resources. The MLF algorithm is a list
based dynamic scheduling algorithm that uses attributes of tasks as well computing resources as cost
function to distribute the tasks of an application to HRCS. In this paper, an on chip HRCS computing
platform is configured on Virtex 5 FPGA using Xilinx EDK. The real time applications JPEG, OFDM
transmitters are represented as task graph and then the task are distributed, statically as well dynamically,
to the platform HRCS in order to evaluate the performance of the designed task distribution model. Finally,
the performance of MLF algorithm is compared with existing static scheduling algorithms. The comparison
shows that the MLF algorithm outperforms in terms of efficient utilization of resources on chip and also
speedup an application execution.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
In recent years the data mining applications become musty and outmoded over time. Energy wastage is the major
problem more in big data analytics and applications. More workload and more computational time will increase high energy
cost and decrease efficiency. The Incremental computational time processing is a promising approach to refreshing mining
results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose
Energy efficiency Map Reduce Scheduling Algorithm, a novel incremental processing extension to reduce the Map, the most
widely used framework for mining big data. Map reduce is a programming model for processing and generating large amount of data in parallel processing time. In this paper, Energy Efficiency reduce Map (EEMP) is algorithm provide more energy
and less maps in big data. Priority based scheduling is a task will allocate the schedules based on necessary and utilization of
the Jobs. For reducing the maps, it will reduce the system computational time so easily energy has improved in terms of big data applications.. Final results show the experimental comparison of the different algorithms involved in the paper.
Hadoop is an open- source implementation of MapReduce. Hadoop is useful for storing the large volume
of data into Hadoop Distributed File System (In distribute Manner) and that data get processed by MapReduce model
in parallel. MapReduce is scalable and efficient programming model to perform large scale data intensive application.
In Homogeneous Environment, it decreases the data transmission overhead because Hadoop schedules data location
and all nodes have ideal work with computing speed in a cluster. Unhappily, it’s difficult to take data locality in
heterogeneous or shared environments. The performance of MapReduce is improved using data Prefetching
mechanism which can fetch the data from slower remote node to faster node and as a result job execution time is
reduced. This mechanism hides the overhead of data processing and data transmission when data is not local. Data
prefetching design reduces data transmission overhead effectively.
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
Equalizing the amount of processing time for each reducer instead of equalizing the amount of data each process in heterogeneous environment. A lightweight strategy to address the data skew problem among the reductions of MapReduce applications. MapReduce has been widely used in various applications, including web indexing, log analysis, data mining, scientific simulations and machine translations. The data skew refers to the imbalance in the amount of data assigned to each task.Using an innovative sampling method which can achieve a highly accurate approximation to the distribution of the intermediate data by sampling only a small fraction during the map processing and to reduce the data in reducer side. Prioritizing the sampling tasks for partitioning decision and splitting of large keys is supported when application semantics permit.Thus providing a reduced data of total ordered output as a result by range partitioner. In the proposed system, the data reduction is by predicting the reduction orders in parallel data processing using feature and instance selection. The accuracy of the data scale and data skew is effectively improved by CHI-ICF data reduction technique. In the existing system normal data distribution is calculated instead here still efficient distribution of data using the feature selection by χ 2 statistics (CHI) and instance selection by Iterative case filter (ICF) is processed.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
A novel methodology for task distributionijesajournal
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems
(HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core
processors acts as computing elements. So, an efficient task distribution methodology is essential for
obtaining high performance in modern embedded systems. In this paper, we present a novel methodology
for task distribution called Minimum Laxity First (MLF) algorithm that takes the advantage of runtime
reconfiguration of FPGA in order to effectively utilize the available resources. The MLF algorithm is a list
based dynamic scheduling algorithm that uses attributes of tasks as well computing resources as cost
function to distribute the tasks of an application to HRCS. In this paper, an on chip HRCS computing
platform is configured on Virtex 5 FPGA using Xilinx EDK. The real time applications JPEG, OFDM
transmitters are represented as task graph and then the task are distributed, statically as well dynamically,
to the platform HRCS in order to evaluate the performance of the designed task distribution model. Finally,
the performance of MLF algorithm is compared with existing static scheduling algorithms. The comparison
shows that the MLF algorithm outperforms in terms of efficient utilization of resources on chip and also
speedup an application execution.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
The MapReduce programming model simplifies
large-scale data processing on commodity cluster by
exploiting parallel map tasks and reduces tasks.
Although many efforts have been made to improve
the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which
plays a critical role in performance enhancement.
Traditionally, a hash function is used to partition
intermediate data among reduce tasks, which,
however, is not traffic-efficient because network
topology and data size associated with each key are
not taken into consideration. In this paper, we study
to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition
scheme. Furthermore, we jointly consider the
aggregator placement problem, where each
aggregator can reduce merged traffic from multiple
map tasks. A decomposition-based distributed
algorithm is proposed to deal with the large-scale
optimization problem for big data application and an
online algorithm is also designed to adjust data
partition and aggregation in a dynamic manner.
Finally, extensive simulation results demonstrate that
our proposals can significantly reduce network traffic
cost under both offline and online cases.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.
Sharing of cluster resources among multiple Workflow Applicationsijcsit
Many computational solutions can be expressed as workflows. A Cluster of processors is a shared
resource among several users and hence the need for a scheduler which deals with multi-user jobs
presented as workflows. The scheduler must find the number of processors to be allotted for each workflow
and schedule tasks on allotted processors. In this work, a new method to find optimal and maximum
number of processors that can be allotted for a workflow is proposed. Regression analysis is used to find
the best possible way to share available processors, among suitable number of submitted workflows. An
instance of a scheduler is created for each workflow, which schedules tasks on the allotted processors.
Towards this end, a new framework to receive online submission of workflows, to allot processors to each
workflow and schedule tasks, is proposed and experimented using a discrete-event based simulator. This
space-sharing of processors among multiple workflows shows better performance than the other methods
found in literature. Because of space-sharing, an instance of a scheduler must be used for each workflow
within the allotted processors. Since the number of processors for each workflow is known only during
runtime, a static schedule can not be used. Hence a hybrid scheduler which tries to combine the advantages
of static and dynamic scheduler is proposed. Thus the proposed framework is a promising solution to
multiple workflows scheduling on cluster.
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...dbpublications
It is cost-efficient for a tenant with a limited budget to establish a virtual Map Reduce cluster by renting multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling scheme for this type of computing environment, we propose in this paper a hybrid job-driven scheduling scheme (JoSS for short) from a tenant’s perspective. JoSS provides not only job level scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies Map Reduce jobs based on job scale and job type and designs an appropriate scheduling policy to schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks, avoid job starvation, and improve job execution performance. Two variations of JoSS are further introduced to separately achieve a better map-data locality and a faster task assignment. We conduct extensive experiments to evaluate and compare the two variations with current scheduling algorithms supported by Hadoop. The results show that the two variations outperform the other tested algorithms in terms of map-data locality, reduce-data locality, and network overhead without incurring significant overhead. In addition, the two variations are separately suitable for different Map Reduce workload scenarios and provide the best job performance among all tested algorithms.
Many computational solutions can be expressed as Di
rected Acyclic Graph (DAG), in which
nodes represent tasks to be executed and edges repr
esent precedence constraints among tasks.
A Cluster of processors is a shared resource among
several users and hence the need for a
scheduler which deals with multi-user jobs presente
d as DAGs. The scheduler must find the
number of processors to be allotted for each DAG an
d schedule tasks on allotted processors. In
this work, a new method to find optimal and maximum
number of processors that can be allotted
for a DAG is proposed. Regression analysis is used
to find the best possible way to share
available processors, among suitable number of subm
itted DAGs. An instance of a scheduler
for each DAG, schedules tasks on the allotted proce
ssors. Towards this end, a new framework
to receive online submission of DAGs, allot process
ors to each DAG and schedule tasks, is
proposed and experimented using a simulator. This s
pace-sharing of processors among multiple
DAGs shows better performance than the other method
s found in literature. Because of space-
sharing, an online scheduler can be used for each D
AG within the allotted processors. The use
of online scheduler overcomes the drawbacks of stat
ic scheduling which relies on inaccurate
estimated computation and communication costs. Thus
the proposed framework is a promising
solution to perform online scheduling of tasks usin
g static information of DAG, a kind of hybrid
scheduling
.
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORScscpconf
Many computational solutions can be expressed as Directed Acyclic Graph (DAG), in which
nodes represent tasks to be executed and edges represent precedence constraints among tasks.
A Cluster of processors is a shared resource among several users and hence the need for a
scheduler which deals with multi-user jobs presented as DAGs. The scheduler must find the
number of processors to be allotted for each DAG and schedule tasks on allotted processors. In
this work, a new method to find optimal and maximum number of processors that can be allotted
for a DAG is proposed. Regression analysis is used to find the best possible way to share
available processors, among suitable number of submitted DAGs. An instance of a scheduler
for each DAG, schedules tasks on the allotted processors. Towards this end, a new framework
to receive online submission of DAGs, allot processors to each DAG and schedule tasks, is
proposed and experimented using a simulator. This space-sharing of processors among multiple
DAGs shows better performance than the other methods found in literature. Because of spacesharing,
an online scheduler can be used for each DAG within the allotted processors. The use
of online scheduler overcomes the drawbacks of static scheduling which relies on inaccurate
estimated computation and communication costs. Thus the proposed framework is a promising
solution to perform online scheduling of tasks using static information of DAG, a kind of hybrid
scheduling.
Similar to Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations (20)
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
1. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Map Reduce Workloads: A Dynamic Job
Ordering and Slot Configurations
Ms. Mamatha M R1
, Dr. K Thippeswamy 2
Dept. of Computer Science
1
MTech, Student– VTU PG Center, Mysuru, India
2
Guide & Head of Department– VTU PG Center, Mysuru, India
SURVEY SURVEY
1 ABSTRACT: MapReduce is a popular parallel
computing paradigm for large-scale data processing in
clusters and data centers. A MapReduce workload
generally contains a set of jobs, each of which consists
of multiple map tasks followed by multiple reduce
tasks. Due to 1) that map tasks can only run in map
slots and reduce tasks can only run in reduce slots, and
2) the general execution constraints that map tasks are
executed before reduce tasks, different job execution
orders and map/reduce slot configurations for a
MapReduce workload have significantly different
performance and system utilization. This survey
proposes two classes of algorithms to minimize the
make span and the total completion time for an offline
MapReduce workload. Our first class of algorithms
focuses on the job ordering optimization for a
MapReduce workload under a given map/reduce slot
configuration. In contrast, our second class of
algorithms considers the scenario that we can perform
optimization for map/reduce slot configuration for a
MapReduce workload. We perform simulations as
well as experiments on Amazon EC2 and show that
our proposed algorithms produce results that are up to
15 - 80 percent better than currently unoptimized
Hadoop, leading to significant reductions in running
time in practice.
2 INTRODUCTION: In this survey introduce a
MapReduce job, it consists of a set of map and reduce
tasks, where reduce tasks are performed after the map
tasks. Hadoop, an open source implementation of
MapReduce, has been deployed in large clusters
containing thousands of machines by companies such
as Amazon and Facebook. In those cluster and data
center environments, MapReduce and Hadoop are
used to support batch processing for jobs submitted
from multiple users (i.e., MapReduce workloads).
Despite many research efforts devoted to improve the
performance of a single MapReduce job, there is
relatively little attention paid to the system
performance of MapReduce workloads. Therefore, this
survey tries to improve the performance of
MapReduce workloads.
In this survey, we target at one subset of production
MapReduce workloads that consist of a set of
independent jobs (e.g., each of jobs processes distinct
data sets with no dependency between each other) with
different approaches. For dependent jobs (i.e.,
MapReduce workflow), one MapReduce can only start
only when its previous dependent jobs finish the
computation subject to the input-output data
dependency. In contrast, for independent jobs, there is
an overlap computation between two jobs, i.e., when
the current job completes its map-phase computation
and starts its reduce-phase computation, the next job
can begin to perform its map-phase computation in a
pipeline processing mode by possessing the released
map slots from its previous job. This survey we plan to
implement
1. We plan to propose a bi-criteria heuristic algorithm
to optimize make span and total completion time
simultaneously.
2. Propose slot configuration algorithms for make span
and total completion time.
We also show that there is a proportional feature for
them, which is very important and can be used to
address the time efficiency problem of proposed
enumeration algorithms for a large size of total slots.
2. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2 | P a g e Copyright@IDL-2017
3 SURVEY
3.1 Dynamic Job Ordering and Slot Configurations
for MapReduce Workloads ―MAPREDUCE is a
widely used computing model for large scale data
processing in cloud computing. A MapReduce job
consists of a set of map and reduce tasks, where
reduce tasks are performed after the map tasks.
Hadoop, an open source implementation of
MapReduce, has been deployed in large clusters
containing thousands of machines by companies such
as Amazon and Facebook. In those cluster and data
center environments, MapReduce and Hadoop are
used to support batch processing for jobs submitted
from multiple users (i.e., MapReduce workloads).
Despite many research efforts devoted to improve the
performance of a single MapReduce job there is
relatively little attention paid to the system
performance of MapReduce workloads. Therefore, this
survey tries to improve the performance of
MapReduce workloads. Makespan and total
completion time (TCT) are two key performance
metrics. Generally, makespan is defined as the time
period since the start of the first job until the
completion of the last job for a set of jobs. It considers
the computation time of jobs and is often used to
measure the performance and utilization efficiency of
a system. In contrast, total completion time is referred
to as the sum of completed time periods for all jobs
since the start of the first job. It is a generalized
makespan with queuing time (i.e., waiting time)
included. We can use it to measure the satisfaction to
the system from a single job’s perspective through
dividingthe total completion time by the number of
jobs (i.e., average completion time). Therefore, in this
survey, we aim to optimize these two metrics.
We consider the production MapReduce workloads
whose jobs run periodically for processing new data.
The default FIFO scheduler is often adopted in order
to minimize the overall execution time. The analysis is
generally performed offline to optimize the execution
for such production workloads. There are a surge
amount of optimization approaches on that. For
example, Rasmussen et al. and Jiang et al. consider the
low-level I/O efficient optimization. Agrawal et al.
and Nykiel et al. share the operation by eliminating
redundant data access and computation.We evaluate
our algorithms using both testbed workloads and
synthetic Facebook workloads.
Experiments show that,
1). for the make span, the job ordering optimization
algorithm achieve an approximately 14-36 percent
improvement for testbed workloads, and 10-20 percent
improvement for Facebook workloads. In contrast,
with the map/reduce slot configuration algorithm,
there are about 50-60 percent improvement for test bed
workloads and 54-80 percent make span improvement
for Facebook workloads;
2). for the total completion time, there are nearly 5
improvement with the bi-criteria job ordering
algorithms and 4 improvement with the bi-criteria
map/reduce slot configuration algorithm, for Facebook
workloads that contain many small-size jobs.‖
3.2 A Workload Model for MapReduce
―MapReduce is a programming model for parallel
computing developed by Google. The need to perform
analyses on large amounts of data is not specific to
Cloud service providers, but the MapReduce
programming model was developed with this specific
audience in mind, apart from Google it is known to be
used by for example Yahoo!, MySpace, Facebook, and
Twitter. Facebook uses MapReduce for, among other,
business intelligence, spam detection, and
advertisement optimization. The name MapReduce
originates from the higher-order1 map and reduces
functions originally, found in functional programming
languages. A MapReduce program is in fact the
combination of a map and a reduce function, the map
function is applied on the input data and the reduce
function is applied on the output of the map function.
1. A MapReduce job consists of a map function, a
reduce function, and input data (on a distributed file
system).
2. First, the input data are partitioned into smaller
chunks of data.
3. Then, for each chunk of input data, a ―map task‖
runs which applies the map function to the chunk of
input data. The resulting output of each map task is a
collection of key-value pairs.
4. The output of all map tasks is shuffled, that is, for
each distinct key in the map output; a collection is
created containing all corresponding values from the
map output.
3. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3 | P a g e Copyright@IDL-2017
5. Then, for each key-collection resulting from the
shuffle phase, a ―reduce task‖ runs which applies the
reduce function to the collection of values. The
resulting output is a single key-value pair.
6. The collection of all key-value pairs resulting from
the reduce step is the output of the MapReduce job. (In
Hadoop the reduce outputs are merged after the the job
has finished, when the user uses the ―getmerge‖
command to get the output from the distributed file
system.)
The main place where parallelization is exploited in
MapReduce is during the running of the map and
reduce tasks, depicted as respectively steps three and
five in the above description. Although in the above
overview the steps two to six seem to be distinct
phases, the more advanced MapReduce
implementations run these phases in parallel, a reduce
task can for example start working as soon as the first
key-value pair is emitted by a map task.‖
3.3 Energy Efficiency for Large-Scale MapReduce
Workloads with Significant Interactive Analysis ―
Massive computing clusters are increasingly being
used for data analysis. The sheer scale and cost of
these clusters make it critical to improve their
operating efficiency, including energy. Energy costs
are a large fraction of the total cost of ownership of
datacenters. Consequently, there is a concerted effort
to improve energy efficiency for Internet datacenters,
encompassing government reports, standardization
efforts, and research projects in both industry and
academia. Approaches to increasing datacenter energy
efficiency depend on the workload in question. One
option is to increase machine utilization, i.e., increase
the amount of work done per unit energy. This
approach is favored by large web search companies
such as Google, whose machines have persistently low
utilization and waste considerable energy. Clusters
implementing this approach would service a mix of
interactive and batch workloads, with the interactive
services handling the external customer queries, and
batch processing building the data structures that
support the interactive services. This strategy relies on
predictable diurnal patterns in web query workloads,
using latency-insensitive batch processing drawn from
an ―infinite queue of low-priority work‖ to smooth out
diurnal variations, to keep machines at high utilization.
This survey focuses on an alternate use case—what we
call MapReduce with Interactive Analysis (MIA)
workloads. MIA workloads contain interactive
services, traditional batch processing, and large-scale,
latency-sensitive processing. The last component
arises from human data analysts interactively
exploring large data sets via ad-hoc queries, and
subsequently issuing large-scale processing requests
once they find a good way to extract value from the
data. Such human-initiated requests have flexible but
not indefinite execution deadlines.
MIA workloads require a very different approach to
energy-efficiency, one that focuses on decreasing the
amount of energy used to service the workload. As we
will show by analyzing traces of a front-line MIA
cluster at Facebook, such workloads have arrival
patterns beyond the system’s control. This makes MIA
workloads unpredictable: new data sets, new types of
processing, and new hardware are added rapidly over
time, as analysts collect new data and discover new
ways to analyze existing data. Thus, increasing
utilization is insufficient: First, the workload is
dominated by human-initiated jobs. Hence, the cluster
must be provisioned for peak load to maintain good
SLOs, and low-priority batch jobs only partially
smooth out the workload variation. Second, the
workload has unpredictable high spikes compared with
regular diurnal patterns for web queries, resulting in
wasted work from batch jobs being preempted upon
sudden spikes in the workload.‖
3.4 A Big Data Model with Map Reduce Concept in
Storage System “In recent years there has been an
extraordinary growth of large-scale data processing
and related technologies in both, industry and
academic communities. This trend is mostly driven by
the need to explore the increasingly large amounts of
information that global companies and communities
are able to gather, and has lead the introduction of new
tools and models, most of which are designed around
the idea of handling huge amounts of data. Alongside
the need to manage ever larger amounts of
information, other developments such as cloud
computing have also contributed significantly to the
growth of large-scale technologies. Cloud computing
Has dramatically transformed the way many critical
services are delivered to customers, posing new
challenges to data centers. As a result, there is a
complete new generation of large-scale infrastructures,
bringing an unprecedented level of workload and
server consolidation, that demand not only new
4. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4 | P a g e Copyright@IDL-2017
programming models, but also new management
techniques and hardware platforms. These
infrastructures provide excellent opportunities to build
data processing solutions that require vast computing
resources. However, despite the availability of these
infrastructures and new models that deal with massive
amounts of data, there is still room for improvement,
Specially with regards to the integration of the two
sides of these environments: the systems running on
these infrastructures, and the applications executed on
top of these systems. A good example of this trend
towards improved large-scale data processing is
MapReduce, a programming model intended to ease
the development of massively parallel applications,
and which has been widely adopted to process large
datasets thanks to its simplicity. While the MapReduce
model was originally used primarily for batch data
processing in large static clusters, nowadays it is
mostly deployed along with other kinds of workloads
in shared environments in which multiple users may
be submitting concurrent jobs with completely
different priorities and needs: from small, almost
interactive, executions, to very long applications that
take hours to complete. Scheduling and selecting tasks
for execution is extremely relevant in MapReduce
environments since it governs a job’s opportunity to
make progress and determines its performance.
However, only basic primitives to prioritize between
jobs are available at the moment, constantly causing
either under or over-provisioning, as the amount of
resources needed to complete a particular job are not
obvious a priori.‖
3.5 Interactive Analytical Processing in Big Data
Systems: A Cross-Industry Study of MapReduce
Workloads “Many organizations depend on
MapReduce to handle their large-scale data processing
needs. As companies across diverse industries adopt
MapReduce alongside parallel databases, new
MapReduce workloads have emerged that feature
many small, short, and increasingly interactive jobs.
These workloads depart from the original MapReduce
use case targeting purely batch computations, and
shares semantic similarities with large-scale
interactive query processing, an area of expertise of
the RDBMS community. Consequently, recent studies
on query-like programming extensions for MapReduce
and applying query optimization techniques to
MapReduce are likely to bring considerable benefit.
However, integrating these ideas into business-critical
systems requires configuration tuning and
performance benchmarking against reallife production
MapReduce workloads. Knowledge of such workloads
is currently limited to a handful of technology
companies. A cross-workload comparison is thus far
absent, and use cases beyond the technology industry
have not been described. The increasing diversity of
MapReduce operators create a pressing need to
characterize industrial MapReduce workloads across
multiple companies and industries. Arguably, each
commercial company is rightly advocating for their
particular use cases, or the particular problems that
their products address. Therefore, it falls to neutral
researchers in academia to facilitate cross-company
collaboration, and mediate the release of cross-
industries data. In this survey, we present an empirical
analysis of seven industrial MapReduce workload
traces over long-durations. They come from
production clusters at Facebook, an early adopter of
the Hadoop implementation of MapReduce, and at
ecommerce, telecommunications, media, and retail
customers of Cloudera, a leading enterprise Hadoop
vendor. Cumulatively, these traces comprise over a
year’s worth of data, covering over two million jobs
that moved approximately 1.6 exabytes spread over
5000 machines. Combined, the traces offer an
opportunity to survey emerging Hadoop use cases
across several industries (Cloudera customers), and
track the growth over time of a leading Hadoop
deployment (Facebook). We believe this survey is the
first study that looks at MapReduce use cases beyond
the technology industry, and the first comparison of
multiple large-scale industrial MapReduce
workloads.Our methodology extends, and breaks
down each MapReduce workload into three conceptual
components: data, temporal, and compute patterns.
The key findings of our analysis are as follows:
• There is a new class of MapReduce workloads for
interactive, semi-streaming analysis that differs
considerably from the original MapReduce use case
targeting purely batch computations.
• There is a wide range of behavior within this
workload class, such that we must exercise caution in
regarding any aspect of workload dynamics as
―typical‖.
• Query-like programatic frameworks on top of
MapReduce such as Hive and Pig make up a
considerable fraction of activity in all workloads we
analyzed.
5. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 5 | P a g e Copyright@IDL-2017
• Some prior assumptions about MapReduce such as
uniform data access, regular diurnal patterns, and
prevalenceof large jobs no longer hold.
Subsets of these observations have emerged in several
studies that each looks at only one MapReduce
workload. Identifying these characteristics across a
rich and diverse set of workloads shows that the
observations are applicable to a range of use cases.‖
3.6 A Storage-Centric Analysis of MapReduce
Workloads: File Popularity, Temporal Locality
and Arrival Patterns “Due to an explosive growth of
data in the scientific and Internet services communities
and a strong desire for storing and processing the data,
next generation storage systems are being designed to
handle peta and exascale storage requirements. As Big
Data storage systems continue to grow, a better
understanding of the workloads present in these
systems becomes critical for proper design and tuning.
We analyze one type of Big Data storage cluster:
clusters dedicated to supporting a mix of MapReduce
jobs. Specifically, we study the file access patterns of
two multipeta byte Hadoop clusters at Yahoo! across
several dimensions, with a focus on popularity,
temporal locality and arrival patterns. We analyze two
6-month traces, which together contain more than 940
million creates and 12 billion file open events. We
identify unique properties of the workloads and make
the following key observations:
• Workloads are dominated by high file churn (high
rate of creates/deletes) which leads to 80% − 90% of
files being accessed at most 10 times during a 6-month
period.
• There is a small percentage of highly popular files:
less than 3% of the files account for 34% − 39% of the
accesses (opens).
• Young files account for a high percentage of
accesses, but a small percentage of bytes stored. For
example, 79% − 85% of accesses target files that are
most one day old, yet add up to 1.87% − 2.21% of the
bytes stored.
• The observed request interarrivals (opens, creates
and deletes) are bursty and exhibit self-similar
behavior.
• The files are very short-lived: 90% of the file
deletions target files that are 22.27 mins − 1.25 hours
old.
Derived from these key observations and knowledge
of the domain and application-level workloads running
on the clusters, we highlight the following insights and
implications to storage system design and tuning:
• The peculiarities observed are mostly derived from
short-lived files and high file churn.
• File churn is a result of typical MapReduce
workflows: a high-level job is decomposed into
multiple MapReduce jobs, arranged in a directed
acyclic graph (DAG). Each of these jobs writes its
final output the storage system, but the output that
interests the user is the output of the last job in the
graph. The output of the jobs is deleted soon after it is
consumed.
• High rate of change in file population prompts
research on appropriate storage media and tiered
storage approaches.
• Caching young files or placing them on a fast storage
tier could lead to performance improvement at a low
cost.
• ―Inactive storage‖ (due to data retention policies and
dead projects) constitutes a significant percentage of
stored bytes and files; timely recovery of files and
appropriate choice of replication mechanisms and
media for passive data can lead to improved storage
utilization.
• Our findings call for a model of file popularity that
accounts for a very dynamic population. To the best of
our knowledge, this is the first study of how
MapReduce workloads interact with the storage
layer.‖
4 CONCLUSION
This survey focuses on the job ordering and
map/reduce slot configuration issues for MapReduce
production workloads that run periodically in a data
warehouse, where the average execution time of
map/reduce tasks for a MapReduce job can be
profiled from the history run, under the FIFO
scheduling in a Hadoop cluster. Two performance
metrics are considered, i.e., makespan and total
completion time. We first focus on the makespan. We
propose job ordering optimization algorithm and
map/reduce slot configuration optimization algorithm.
We observe that the total completion time can be poor
subject to getting the optimal make span, therefore, we
further propose a new greedy job ordering algorithm
and a map/reduce slot configuration algorithm to
minimize the makespan and total completion time
together. The theoretical analysis is also given for our
proposed heuristic algorithms, including
approximation ratio, upper and lower bounds on
6. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 6 | P a g e Copyright@IDL-2017
makespan. Finally, survey conduct extensive
experiments to validate the effectiveness of our
proposed algorithms and their theoretical results.
OTHER REFERENCES
[1] S. R. Hejazi and S. Saghafian, ―Flowshop scheduling
problems with makespan criterion: A review,‖ Int. J.
Production Res., vol. 43, no. 14, pp. 2895–2929, 2005.
[2] S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica,
and J. Zhou, ―Re-optimizing data-parallel computing,‖ in
Proc. 9th USE- NIX Conf. Netw. Syst. Design
Implementation, 2012, p. 21.
[3] P. Agrawal, D. Kifer, and C. Olston, ―Scheduling shared
scans oflarge data files,‖ Proc. VLDB Endow., vol. 1, no. 1,
pp. 958–969, Aug. 2008.
[4] W. Cirne and F. Berman, ―When the herd is smart:
Aggregate behavior in the selection of job request,‖ IEEE
Trans. Parallel Disturb. Syst., vol. 14, no. 2, pp. 181–192,
Feb. 2003.
[5] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K.
Elmeleegy, and R. Sears, ―Mapreduce online,‖ in Proc. 7th
USENIX Conf. Netw. Syst. Design Implementation, 2010, p.
21.
[6] J. Dean and S. Ghemawat, ―Mapreduce: Simplified data
processing on large clusters,‖ in Proc. 6th Conf. Symp.
Oper. Syst. Design Implementation, 2004, vol. 6, p. 10.
[7] J. Dittrich, J.-A.-Quiane Ruiz, A. Jindal, Y. Kargin, V.
Setty, and J. Schad, ―adoop++: Making a yellow elephant
run like a cheetah (without it even noticing),” Proc. VLDB
Endowment, vol. 3, nos. 1–2, pp. 515–529, Sep. 2010.
[8] P.-F. Dutot, L. Eyraud, G. Mounie, and D. Trystram, ―Bi-
criteria algorithm for scheduling jobs on cluster platforms,‖
in Proc. 16th
Annu. ACM Symp. Parallelism Algorithms
Archit., 2004, pp. 125–132.
[9] P.-F. Dutot, G.Mounie, and D. Trystram,―Scheduling
parallel tasks: Approximation algorithms,‖ in Handbo ok of
Scheduling: Algorithms, Models, and Performance Analysis,
J. T. Leung, Ed. Boca Raton, FL, USA: CRC Press, ch. 26,
pp. 26-1–26-24.
[10] A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata,
―Columnoriented storage techniques for mapreduce,‖ Proc.
VLDB Endowment, vol. 4, no. 7, pp. 419–429, Apr. 2011.
[11] J. Gupta, A. Hariri, and C. Potts, ―Scheduling a two-
stage hybrid flow shop with parallel machines at the first
stage,‖ Ann. Oper. Res., vol. 69, pp. 171–191, 1997.
[12] J. N. D. Gupta, ―Two-stage, hybrid flowshop scheduling
problem,‖ J. Oper. Res. Soc., vol. 39, no. 4, pp. 359–364,
1988.
[13] H. Herodotou and S. Babu, ―Profiling, what-if analysis,
and costbased optimization of mapreduce programs,‖ Proc.
VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
[14] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F.
B. Cetin, and S. Babu, ―Starfish: A self-tuning system for big
data analytics,‖ in Proc. 5th Conf. Innovative Data Syst.
Res., 2011, pp. 261–272.
[15] S. Ibrahim, H. Jin, L. Lu, B. He, and S. Wu, ―Adaptive
disk I/O scheduling for mapreduce in virtualized
environment,‖ in Proc. Int. Conf. Parallel Process., Sep.
2011, pp. 335–344.
[16] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, ―The
performance of mapreduce: An in-depth study,‖ Proc. VLDB
Endowment, vol. 3, nos. 1–2, pp. 472–483, Sep. 2010.
[17] S. M. Johnson, ―Optimal two- and three-stage
production schedules with setup times included,‖ Naval Res.
Logistics Quart., vol. 1, no. 1, pp. 61–68, 1954.