In order to increase availability in a distributed system some or all of the data items are replicated and
stored at separate sites. This is an issue of key concern especially since there is such a proliferation of
wireless technologies and mobile users. However, the concurrent processing of transactions at separate
sites can generate inconsistencies in the stored information. We have built a distributed service that
manages updates to widely deployed counter-like replicas. There are many heavy-weight distributed
systems targeting large information critical applications. Our system is intentionally, relatively lightweight
and useful for the somewhat reduced information critical applications. The service is built on our
distributed concurrency control scheme which combines optimism and pessimism in the processing of
transactions. The service allows a transaction to be processed immediately (optimistically) at any
individual replica as long as the transaction satisfies a cost bound. All transactions are also processed in a
concurrent pessimistic manner to ensure mutual consistency
Simplified Data Processing On Large ClusterHarsh Kevadia
A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system. They are connected through fast local area network and are deployed to improve performance over that of single computer. We know that on the web large amount of data are being stored, processed and retrieved in a few milliseconds. Doing so with help of single computer machine is very difficult task. And so we require cluster of machines which can perform this task.
Although using cluster for processing data is not enough, we need to develop a technique that can perform this task easily and efficiently. MapReduce programming model is used for this type of processing. In this model Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Dominant resource fairness fair allocation of multiple resource typesanet18
DRF algorithm provides fair resource allocation in a system containing different resource types. Dominant Resource Fairness (DRF), is a generalization of max-min fairness to multiple resource types. This algorithm is used in resource allocation of hadoop cluster.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
This document summarizes a research paper that proposes a method for mining interesting medical knowledge from big uncertain data using MapReduce. The key points are:
1) Mining patterns from large amounts of uncertain medical data generated by hospitals can provide useful knowledge but is computationally intensive.
2) The proposed method uses MapReduce to efficiently mine the data for frequent patterns related to user-specified interests or constraints.
3) This focuses the mining to be more effective and produce personalized results depending on individual user needs and interests in the medical data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
Simplified Data Processing On Large ClusterHarsh Kevadia
A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system. They are connected through fast local area network and are deployed to improve performance over that of single computer. We know that on the web large amount of data are being stored, processed and retrieved in a few milliseconds. Doing so with help of single computer machine is very difficult task. And so we require cluster of machines which can perform this task.
Although using cluster for processing data is not enough, we need to develop a technique that can perform this task easily and efficiently. MapReduce programming model is used for this type of processing. In this model Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Dominant resource fairness fair allocation of multiple resource typesanet18
DRF algorithm provides fair resource allocation in a system containing different resource types. Dominant Resource Fairness (DRF), is a generalization of max-min fairness to multiple resource types. This algorithm is used in resource allocation of hadoop cluster.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
This document summarizes a research paper that proposes a method for mining interesting medical knowledge from big uncertain data using MapReduce. The key points are:
1) Mining patterns from large amounts of uncertain medical data generated by hospitals can provide useful knowledge but is computationally intensive.
2) The proposed method uses MapReduce to efficiently mine the data for frequent patterns related to user-specified interests or constraints.
3) This focuses the mining to be more effective and produce personalized results depending on individual user needs and interests in the medical data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
Overlapped clustering approach for maximizing the service reliability ofIAEME Publication
This document discusses an overlapped clustering approach for maximizing the reliability of heterogeneous distributed computing systems. It proposes assigning tasks to nodes based on their requirements in order to reduce network bandwidth usage and enable local communication. It calculates the reliability of each node and assigns more resource-intensive tasks to more reliable nodes. When nodes fail, it uses load balancing techniques like redistributing tasks from overloaded or failed nodes to idle nodes in the same cluster. The goal is to improve system reliability through approaches like minimizing network communication, assigning tasks based on node reliability, and handling failures through load balancing at the cluster level.
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
This document proposes a peer-to-peer data sharing and deduplication system using genetic algorithms. The system would allow organizations in a corporate network to share data by registering with a P2P service provider and launching peer instances. It addresses challenges of scalability, performance, and security for inter-organizational data sharing. The system integrates cloud computing, databases, and P2P technologies. It uses genetic algorithms for deduplication to reduce redundant data storage. The system is intended to provide flexible, scalable, and cost-effective data sharing services for corporate networks based on a pay-as-you-go model.
This document summarizes various techniques for transaction reordering that have been proposed in previous research. It discusses approaches that reorder transactions based on reducing resource conflicts as well as those that aim to increase resource sharing. Specific techniques covered include steal-on-abort, reordering in replicated databases, reordering for continuous data loading, and reordering to enable synchronized scans. The document provides comparisons of these different reordering strategies based on factors such as overhead, throughput, response time and whether they perform lock conflict analysis or enable resource sharing.
The document discusses using a genetic algorithm to schedule tasks in a cloud computing environment. It aims to minimize task execution time and reduce computational costs compared to the traditional Round Robin scheduling algorithm. The proposed genetic algorithm mimics natural selection and genetics to evolve optimal task schedules. It was tested using the CloudSim simulation toolkit and results showed the genetic algorithm provided better performance than Round Robin scheduling.
T AXONOMY OF O PTIMIZATION A PPROACHES OF R ESOURCE B ROKERS IN D ATA G RIDSijcsit
A novel taxonomy of replica selection techniques is proposed. We studied some data grid approaches
where the selection strategies of data management is different. The aim of the study is to determine the
common concepts and observe their performance and to compare their performance with our strategy
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
This document summarizes a research paper that studied techniques for mitigating data skew and partition skew in MapReduce applications. It describes how skew can occur from unevenly distributed data or straggler nodes. It then summarizes a technique called LIBRA that uses sample map tasks to estimate data distribution, partitions the data accordingly, and allows reduce tasks to start earlier.
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
Driven by the need to provision resources on demand,
scientists are turning to commercial and research test-bed
Cloud computing resources to run their scientific experiments.
Job scheduling on cloud computing resources, unlike earlier platforms,
is a balance between throughput and cost of executions.
Within this context, we posit that usage patterns can improve the
job execution, because these patterns allow a system to plan, stage
and optimize scheduling decisions. This paper introduces a novel
approach to utilization of user patterns drawn from knowledgebased
techniques, to improve execution across a series of active
workflows and jobs in cloud computing environments. Using
empirical analysis we establish the accuracy of our prediction
approach for two different workloads and demonstrate how this
knowledge can be used to improve job executions.
In distributed database systems, data does not re-side in one single location, it may be stored in mul-tiple computers, located in the same physical loca-tion; or may be disseminated over a network of in-terlinked computers. Distributed databases can amend the performance at end-user worksites by allowing transactions to be processed on many ma-chines, instead of being limited to one. These transactions may impose troubles like deadlock. This paper makes an attempt to detect deadlock in homogeneous distributed database systems i.e; lo-cal transactions using process termination method.
Clustering technology has been applied in numerous applications. It can enhance the performance
of information retrieval systems, it can also group Internet users to help improve the click-through rate of
on-line advertising, etc. Over the past few decades, a great many data clustering algorithms have been
developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two
new data clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density
peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest
data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics.
Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP,
do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all
of AP, DP and DBSCAN should be considered. Moreover, we find that the comparison of different clustering
algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the
Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This
work has important reference values for researchers and engineers who need to select appropriate clustering
algorithms for their specific applications.
A frame work for clustering time evolving dataiaemedu
The document proposes a framework for clustering time-evolving categorical data using a sliding window technique. It uses an existing clustering algorithm (Node Importance Representative) and a Drifting Concept Detection algorithm to detect changes in cluster distributions between the current and previous data windows. If a threshold difference in clusters is exceeded, reclustering is performed on the new window. Otherwise, the new clusters are added to the previous results. The framework aims to improve on prior work by handling drifting concepts in categorical time-series data.
On the fly n w mvd identification for reducing data redundancyijdms
Data Cleaning helps in reducing redundancy by determining whether two or more records defined
differently in database represent the same real world object. If they represent the same, then the values of
the records are nested generating a new n-wMVD database. The limitation with this approach is that
addition or deletion of data is not done dynamically. To overcome this limitation we propose a dynamic
approach that helps in reducing the redundancy, by comparing every incoming record with existing data
and allows the modifications to be done if necessary. The operations are performed on the fly on the
existing n-wMVD databases. When a record to be inserted, the approach allows a decision on whether to
insert or nest with the existing record. If a record needs to be updated it is verified to observe if a new
nesting has to be done or existing record can be updated. If a record to be deleted is unique then it is
deleted from the database otherwise it is deleted by removing it from nesting. The applicability of our
approach was tested and it is found that the results are promising which are discussed in this paper.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...IJCSEA Journal
In this paper , we will provide a scheduler on batch jobs with GA regard to the threshold detector. In The algorithm proposed in this paper, we will provide the batch independent jobs with a new technique ,so we can optimize the schedule them. To do this, we use a threshold detector then among the selected jobs, processing resources can process batch jobs with priority. Also hierarchy of tasks in each batch, will be determined with using DGBSA algorithm. Now , with the regard to the works done by previous ,we can provide an algorithm that by adding specific parameters to fitness function of the previous algorithms ,develop a optimum fitness function that in the proposed algorithm has been used. According to assessment done on DGBSA algorithm, in compare with the similar algorithms, it has more performance. The effective parameters that used in the proposed algorithm can reduce the total wasting time in compare with previous algorithms. Also this algorithm can improve the previous problems in batch processing with a new technique.
Grid computing can involve lot of computational tasks which requires trustworthy computational nodes. Load balancing in grid computing is a technique which overall optimizes the whole process of assigning computational tasks to processing nodes. Grid computing is a form of distributed computing but different from conventional distributed computing in a manner that it tends to be heterogeneous, more loosely coupled and dispersed geographically. Optimization of this process must contains the overall maximization of resources utilization with balance load on each processing unit and also by decreasing the overall time or output. Evolutionary algorithms like genetic algorithms have studied so far for the implementation of load balancing across the grid networks. But problem with these genetic algorithm is that they are quite slow in cases where large number of tasks needs to be processed. In this paper we give a novel approach of parallel genetic algorithms for enhancing the overall performance and optimization of managing the whole process of load balancing across the grid nodes.
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION ijdms
Huge Volumes of detailed personal data is continuously collected and analyzed by different types of
applications using data mining, analysing such data is beneficial to the application users. It is an important
asset to application users like business organizations, governments for taking effective decisions. But
analysing such data opens treats to privacy if not done properly. This work aims to reveal the information
by protecting sensitive data. Various methods including Randomization, k-anonymity and data hiding have
been suggested for the same. In this work, a novel technique is suggested that makes use of LBG design
algorithm to preserve the privacy of data along with compression of data. Quantization will be performed
on training data it will produce transformed data set. It provides individual privacy while allowing
extraction of useful knowledge from data, Hence privacy is preserved. Distortion measures are used to
analyze the accuracy of transformed data.
TWO LEVEL DATA FUSION MODEL FOR DATA MINIMIZATION AND EVENT DETECTION IN PERI...pijans
This document discusses a two-level data fusion model for periodic wireless sensor networks. At the first level, sensor nodes send the most common measurement to cluster heads using similarity functions to minimize data. The second level applies fusion at cluster heads to remove similar multi-attribute measurements using multiple correlation to detect events accurately with minimum delay. Experimental results validate the proposed model reduces data transfer, redundancy, and energy consumption over existing techniques, while also enabling early event detection in emergencies.
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
This document summarizes a research paper on compressing uncompressed images from the cloud using k-means clustering and Lempel-Ziv-Welch (LZW) compression. It begins by introducing cloud computing and k-means clustering. It then describes using k-means to group uncompressed images and compressing the images using LZW coding to reduce file sizes while maintaining image quality. The document discusses advantages of LZW compression like achieving compression ratios around 5:1. It provides examples of applying k-means clustering and LZW compression to simplify image compression.
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
Data integration is the process of collecting data from different data sources and providing user with
unified view of answers that meet his requirements. The quality of query answers can be improved by
identifying the quality of data sources according to some quality measures and retrieving data from only
significant ones. Query answers that returned from significant data sources can be ranked according to
quality requirements that specified in user query and proposed queries types to return only top-k query
answers. In this paper, Data integration framework called Data integration to return ranked alternatives
(DIRA) will be introduced depending on data quality assessment module that will use data sources quality
to choose the significant ones and ranking algorithm to return top-k query answers according to different
queries types.
An Empirical Study for Defect Prediction using Clusteringidescitation
Reliably predicting defects in the software is one of
the holy grails of software engineering. Researchers have
devised and implemented a method of defect prediction
approaches varying in terms of accuracy, complexity, and the
input data they require. An accurate prediction of the number
of defects in a software product during system testing
contributes not only to the management of the system testing
process but also to the estimation of the product’s required
maintenance [1]. A prediction of the number of remaining
defects in an inspected artefact can be used for decision making.
Defective software modules cause software failures, increase
development and maintenance costs, and decrease customer
satisfaction. It strives to improve software quality and testing
efficiency by constructing predictive models from code
attributes to enable a timely identification of fault-prone
modules [2]. In this paper, we will discuss clustering techniques
are used for software defect prediction. This helps the
developers to detect software defects and correct them.
Unsupervised techniques may be used for defect prediction in
software modules, more so in those cases where defect labels
are not available [3].
Survey comparison estimation of various routing protocols in mobile ad hoc ne...ijdpsjournal
This document summarizes and compares various routing protocols for mobile ad-hoc networks (MANETs). It first describes the characteristics and challenges of MANETs. It then classifies routing protocols for MANETs into three main categories: table-driven (proactive), on-demand (reactive), and hybrid protocols. Examples of protocols from each category are described in detail, including DSDV, AODV, DSR, and ZRP. Key features such as route discovery, table maintenance, and use of proactive and reactive approaches are discussed for each example protocol. Finally, the document compares different protocols based on parameters like scalability, latency, bandwidth overhead, and mobility impact.
HIGHLY SCALABLE, PARALLEL AND DISTRIBUTED ADABOOST ALGORITHM USING LIGHT WEIG...ijdpsjournal
This document describes a highly scalable parallel and distributed implementation of the AdaBoost algorithm using lightweight threads and web services across multiple machines. The implementation achieves nearly linear speedup by distributing the computation of feature types across different machines. In one approach, five machines are used, with each machine computing one of the five feature types. This results in a speedup of 95.1x when using 31 workstations with quad-core processors, reducing the training time to only 4.8 seconds per feature.
Overlapped clustering approach for maximizing the service reliability ofIAEME Publication
This document discusses an overlapped clustering approach for maximizing the reliability of heterogeneous distributed computing systems. It proposes assigning tasks to nodes based on their requirements in order to reduce network bandwidth usage and enable local communication. It calculates the reliability of each node and assigns more resource-intensive tasks to more reliable nodes. When nodes fail, it uses load balancing techniques like redistributing tasks from overloaded or failed nodes to idle nodes in the same cluster. The goal is to improve system reliability through approaches like minimizing network communication, assigning tasks based on node reliability, and handling failures through load balancing at the cluster level.
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
This document proposes a peer-to-peer data sharing and deduplication system using genetic algorithms. The system would allow organizations in a corporate network to share data by registering with a P2P service provider and launching peer instances. It addresses challenges of scalability, performance, and security for inter-organizational data sharing. The system integrates cloud computing, databases, and P2P technologies. It uses genetic algorithms for deduplication to reduce redundant data storage. The system is intended to provide flexible, scalable, and cost-effective data sharing services for corporate networks based on a pay-as-you-go model.
This document summarizes various techniques for transaction reordering that have been proposed in previous research. It discusses approaches that reorder transactions based on reducing resource conflicts as well as those that aim to increase resource sharing. Specific techniques covered include steal-on-abort, reordering in replicated databases, reordering for continuous data loading, and reordering to enable synchronized scans. The document provides comparisons of these different reordering strategies based on factors such as overhead, throughput, response time and whether they perform lock conflict analysis or enable resource sharing.
The document discusses using a genetic algorithm to schedule tasks in a cloud computing environment. It aims to minimize task execution time and reduce computational costs compared to the traditional Round Robin scheduling algorithm. The proposed genetic algorithm mimics natural selection and genetics to evolve optimal task schedules. It was tested using the CloudSim simulation toolkit and results showed the genetic algorithm provided better performance than Round Robin scheduling.
T AXONOMY OF O PTIMIZATION A PPROACHES OF R ESOURCE B ROKERS IN D ATA G RIDSijcsit
A novel taxonomy of replica selection techniques is proposed. We studied some data grid approaches
where the selection strategies of data management is different. The aim of the study is to determine the
common concepts and observe their performance and to compare their performance with our strategy
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
This document summarizes a research paper that studied techniques for mitigating data skew and partition skew in MapReduce applications. It describes how skew can occur from unevenly distributed data or straggler nodes. It then summarizes a technique called LIBRA that uses sample map tasks to estimate data distribution, partitions the data accordingly, and allows reduce tasks to start earlier.
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
Driven by the need to provision resources on demand,
scientists are turning to commercial and research test-bed
Cloud computing resources to run their scientific experiments.
Job scheduling on cloud computing resources, unlike earlier platforms,
is a balance between throughput and cost of executions.
Within this context, we posit that usage patterns can improve the
job execution, because these patterns allow a system to plan, stage
and optimize scheduling decisions. This paper introduces a novel
approach to utilization of user patterns drawn from knowledgebased
techniques, to improve execution across a series of active
workflows and jobs in cloud computing environments. Using
empirical analysis we establish the accuracy of our prediction
approach for two different workloads and demonstrate how this
knowledge can be used to improve job executions.
In distributed database systems, data does not re-side in one single location, it may be stored in mul-tiple computers, located in the same physical loca-tion; or may be disseminated over a network of in-terlinked computers. Distributed databases can amend the performance at end-user worksites by allowing transactions to be processed on many ma-chines, instead of being limited to one. These transactions may impose troubles like deadlock. This paper makes an attempt to detect deadlock in homogeneous distributed database systems i.e; lo-cal transactions using process termination method.
Clustering technology has been applied in numerous applications. It can enhance the performance
of information retrieval systems, it can also group Internet users to help improve the click-through rate of
on-line advertising, etc. Over the past few decades, a great many data clustering algorithms have been
developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two
new data clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density
peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest
data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics.
Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP,
do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all
of AP, DP and DBSCAN should be considered. Moreover, we find that the comparison of different clustering
algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the
Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This
work has important reference values for researchers and engineers who need to select appropriate clustering
algorithms for their specific applications.
A frame work for clustering time evolving dataiaemedu
The document proposes a framework for clustering time-evolving categorical data using a sliding window technique. It uses an existing clustering algorithm (Node Importance Representative) and a Drifting Concept Detection algorithm to detect changes in cluster distributions between the current and previous data windows. If a threshold difference in clusters is exceeded, reclustering is performed on the new window. Otherwise, the new clusters are added to the previous results. The framework aims to improve on prior work by handling drifting concepts in categorical time-series data.
On the fly n w mvd identification for reducing data redundancyijdms
Data Cleaning helps in reducing redundancy by determining whether two or more records defined
differently in database represent the same real world object. If they represent the same, then the values of
the records are nested generating a new n-wMVD database. The limitation with this approach is that
addition or deletion of data is not done dynamically. To overcome this limitation we propose a dynamic
approach that helps in reducing the redundancy, by comparing every incoming record with existing data
and allows the modifications to be done if necessary. The operations are performed on the fly on the
existing n-wMVD databases. When a record to be inserted, the approach allows a decision on whether to
insert or nest with the existing record. If a record needs to be updated it is verified to observe if a new
nesting has to be done or existing record can be updated. If a record to be deleted is unique then it is
deleted from the database otherwise it is deleted by removing it from nesting. The applicability of our
approach was tested and it is found that the results are promising which are discussed in this paper.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...IJCSEA Journal
In this paper , we will provide a scheduler on batch jobs with GA regard to the threshold detector. In The algorithm proposed in this paper, we will provide the batch independent jobs with a new technique ,so we can optimize the schedule them. To do this, we use a threshold detector then among the selected jobs, processing resources can process batch jobs with priority. Also hierarchy of tasks in each batch, will be determined with using DGBSA algorithm. Now , with the regard to the works done by previous ,we can provide an algorithm that by adding specific parameters to fitness function of the previous algorithms ,develop a optimum fitness function that in the proposed algorithm has been used. According to assessment done on DGBSA algorithm, in compare with the similar algorithms, it has more performance. The effective parameters that used in the proposed algorithm can reduce the total wasting time in compare with previous algorithms. Also this algorithm can improve the previous problems in batch processing with a new technique.
Grid computing can involve lot of computational tasks which requires trustworthy computational nodes. Load balancing in grid computing is a technique which overall optimizes the whole process of assigning computational tasks to processing nodes. Grid computing is a form of distributed computing but different from conventional distributed computing in a manner that it tends to be heterogeneous, more loosely coupled and dispersed geographically. Optimization of this process must contains the overall maximization of resources utilization with balance load on each processing unit and also by decreasing the overall time or output. Evolutionary algorithms like genetic algorithms have studied so far for the implementation of load balancing across the grid networks. But problem with these genetic algorithm is that they are quite slow in cases where large number of tasks needs to be processed. In this paper we give a novel approach of parallel genetic algorithms for enhancing the overall performance and optimization of managing the whole process of load balancing across the grid nodes.
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION ijdms
Huge Volumes of detailed personal data is continuously collected and analyzed by different types of
applications using data mining, analysing such data is beneficial to the application users. It is an important
asset to application users like business organizations, governments for taking effective decisions. But
analysing such data opens treats to privacy if not done properly. This work aims to reveal the information
by protecting sensitive data. Various methods including Randomization, k-anonymity and data hiding have
been suggested for the same. In this work, a novel technique is suggested that makes use of LBG design
algorithm to preserve the privacy of data along with compression of data. Quantization will be performed
on training data it will produce transformed data set. It provides individual privacy while allowing
extraction of useful knowledge from data, Hence privacy is preserved. Distortion measures are used to
analyze the accuracy of transformed data.
TWO LEVEL DATA FUSION MODEL FOR DATA MINIMIZATION AND EVENT DETECTION IN PERI...pijans
This document discusses a two-level data fusion model for periodic wireless sensor networks. At the first level, sensor nodes send the most common measurement to cluster heads using similarity functions to minimize data. The second level applies fusion at cluster heads to remove similar multi-attribute measurements using multiple correlation to detect events accurately with minimum delay. Experimental results validate the proposed model reduces data transfer, redundancy, and energy consumption over existing techniques, while also enabling early event detection in emergencies.
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
This document summarizes a research paper on compressing uncompressed images from the cloud using k-means clustering and Lempel-Ziv-Welch (LZW) compression. It begins by introducing cloud computing and k-means clustering. It then describes using k-means to group uncompressed images and compressing the images using LZW coding to reduce file sizes while maintaining image quality. The document discusses advantages of LZW compression like achieving compression ratios around 5:1. It provides examples of applying k-means clustering and LZW compression to simplify image compression.
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
Data integration is the process of collecting data from different data sources and providing user with
unified view of answers that meet his requirements. The quality of query answers can be improved by
identifying the quality of data sources according to some quality measures and retrieving data from only
significant ones. Query answers that returned from significant data sources can be ranked according to
quality requirements that specified in user query and proposed queries types to return only top-k query
answers. In this paper, Data integration framework called Data integration to return ranked alternatives
(DIRA) will be introduced depending on data quality assessment module that will use data sources quality
to choose the significant ones and ranking algorithm to return top-k query answers according to different
queries types.
An Empirical Study for Defect Prediction using Clusteringidescitation
Reliably predicting defects in the software is one of
the holy grails of software engineering. Researchers have
devised and implemented a method of defect prediction
approaches varying in terms of accuracy, complexity, and the
input data they require. An accurate prediction of the number
of defects in a software product during system testing
contributes not only to the management of the system testing
process but also to the estimation of the product’s required
maintenance [1]. A prediction of the number of remaining
defects in an inspected artefact can be used for decision making.
Defective software modules cause software failures, increase
development and maintenance costs, and decrease customer
satisfaction. It strives to improve software quality and testing
efficiency by constructing predictive models from code
attributes to enable a timely identification of fault-prone
modules [2]. In this paper, we will discuss clustering techniques
are used for software defect prediction. This helps the
developers to detect software defects and correct them.
Unsupervised techniques may be used for defect prediction in
software modules, more so in those cases where defect labels
are not available [3].
Survey comparison estimation of various routing protocols in mobile ad hoc ne...ijdpsjournal
This document summarizes and compares various routing protocols for mobile ad-hoc networks (MANETs). It first describes the characteristics and challenges of MANETs. It then classifies routing protocols for MANETs into three main categories: table-driven (proactive), on-demand (reactive), and hybrid protocols. Examples of protocols from each category are described in detail, including DSDV, AODV, DSR, and ZRP. Key features such as route discovery, table maintenance, and use of proactive and reactive approaches are discussed for each example protocol. Finally, the document compares different protocols based on parameters like scalability, latency, bandwidth overhead, and mobility impact.
HIGHLY SCALABLE, PARALLEL AND DISTRIBUTED ADABOOST ALGORITHM USING LIGHT WEIG...ijdpsjournal
This document describes a highly scalable parallel and distributed implementation of the AdaBoost algorithm using lightweight threads and web services across multiple machines. The implementation achieves nearly linear speedup by distributing the computation of feature types across different machines. In one approach, five machines are used, with each machine computing one of the five feature types. This results in a speedup of 95.1x when using 31 workstations with quad-core processors, reducing the training time to only 4.8 seconds per feature.
On the fly porn video blocking using distributed multi gpu and data mining ap...ijdpsjournal
Preventing users from accessing adult videos and at the same time allowing them to access good
educational videos and other materials through campus wide network is a big challenge for organizations.
Major existing web filtering systems are textual content or link analysis based. As a result, potential users
cannot access qualitative and informative video content which is available online. Adult content detection
in video based on motion features or skin detection requires significant computing power and time.
Judgment to identify pornography videos is taken based on processing of every chunk from video,
consisting specific number of frames, sequentially one after another. This solution is not feasible in real
time when user has started watching the video and decision about blocking needs to be taken within few
seconds.
In this paper, we propose a model where user is allowed to start watching any video; at the backend porn
detection process using extracted video and image features shall run on distributed nodes with multiple
GPUs (Graphics Processing Units). The video is processed on parallel and distributed platform in shortest
time and decision about filtering the video is taken in real time. Track record of blocked content and
websites is cached, too. For every new video downloads, cache is verified to prevent repetitive content
analysis. On the fly blocking is feasible due to latest GPU architecture, CUDA (Compute Unified Device
Architecture) and CUDA aware MPI (Message Passing Interface). It is possible to achieve coarse grained
as well as fine grained parallelism. Video Chunks are processed parallel on distributed nodes. Porn
detection algorithm on frames of chunks of videos can also achieve parallelism using GPUs on single node.
It ultimately results into blocking porn video on the fly and allowing educational and informative videos.
Implementing database lookup method in mobile wimax for location management a...ijdpsjournal
This document summarizes a research paper that proposes implementing a database lookup method for location management in a mobile WiMAX network to reduce handover delay and improve bandwidth utilization and throughput. The paper introduces using location management areas (LMAs) within multicast and broadcast service (MBS) zones to minimize handover delay. Existing methods use paging groups to track user locations, increasing bandwidth usage and delay. The proposed method eliminates paging groups and instead stores user location data in a database accessed by an authentication, authorization, and accounting (AAA) server. Simulation results using the OPNET tool show the proposed method reduces handover delay and increases throughput compared to existing methods.
Management of context aware software resources deployed in a cloud environmen...ijdpsjournal
This document discusses a new scheduling algorithm proposed for managing requests for context-aware software deployed in a cloud computing environment. The algorithm aims to improve the performance of servers hosting high-demand context-aware applications while reducing cloud providers' costs. It does this by classifying similar context requests and dynamically scoring requests, with the goal of processing requests for similar context data in parallel to reduce response times. The algorithm is evaluated through simulation and found to improve efficiency compared to the gi-FIFO scheduling algorithm.
GPU Accelerated Automated Feature Extraction From Satellite Imagesijdpsjournal
This document discusses GPU accelerated automated feature extraction from satellite images. The authors describe an algorithm that uses two Laplacian of Gaussian (LoG) masks on panchromatic or multispectral images to detect zero crossing points and extract pixels based on standard deviation. The extracted images from the two LoG masks are combined and can undergo additional smoothing. Applying this algorithm with GPUs provides a substantial performance improvement and speedup of 20 times compared to conventional computing.
Crypto multi tenant an environment of secure computing using cloud sqlijdpsjournal
Today’s most modern research area of computing is cloud comput
ing due to its ability to diminish the costs
associated with virtualization, high availability, dynamic resource pools and increases the efficien
cy of
computing. But still it contains some drawbacks such as privacy, security, etc. This paper is thorou
ghly
focused on the security of data of multi tenant model obtains from the virtualization feature of clo
ud
computing. We use AES
-
128 bit algorithm and cloud SQL to protect sensitive data before storing in the
cloud. When the authorized customer arises for usag
e of data, then data firstly decrypted after that
provides to the customer. Multi tenant infrastructure is supported by Google, which prefers pushing
of
contents in short iteration cycle. As the customer is distributed and their demands can arise anywhe
re,
anytime so data can’t store at particular site it must be available different sites also. For this f
aster
accessing by different users from different places Google is the best one. To get high reliability a
nd
availability data is stored in encrypted befor
e storing in database and updated every time after usage. It is
very easy to use without requiring any software. This authenticate user can recover their encrypted
and
decrypted data, afford efficient and data storage security in the cloud.
Advanced delay reduction algorithm based on GPS with Load Balancingijdpsjournal
A Mobile Ad-Hoc Network (MANET) is a self-configuring network of mobile nodes connected by wireless
links, to form an arbitrary topology. The nodes are free to move arbitrarily in the topology. Thus, the
network's wireless topology may be random and may change quickly. An ad Hoc network is formed by
sensor networks consisting of sensing, data processing, and communication components. There is frequent
occurrence of congested links in such a network as wireless links inherently have significantly lower
capacity than hardwired links and are therefore more prone to congestion. Here we proposed a algorithm
which involves the reduction in the delay with the help of Request_set created on the basis of the location
information of the destination node. Across the paths found in the Route_reply (RREP) packets the load is
equally distributed
Spectrum requirement estimation for imt systems in developing countriesijdpsjournal
In this paper
we analyze the methodology develope
d by
the
International Telecommunication Union (ITU)
for
estimat
ing
the spectrum requirement for International Mobile Telecommunications (IMT) systems.
The
International
Telecommunication Union estimates spectrum requirement
s
by following ITU
-
R
-
Rec.M1768.
Although
this
methodology is adopted by ITU
-
R
,
there are
discrepancies
for
estimat
ing
the spectrum
requirement for developing countries. ITU estimates the spectrum requirement by considering technica
l
and market
parameters
that
were provided by the
most de
veloped countries with high income and high
development index
. Developed countries have
a very rapid expansible telecom
market
due to the high level
of penetration
,
dominant
user density
and usage of high
-
volume multimedia services.
In contrast,
developing
countries
use less bandwidth
-
intensive services such as voice communication,
low rate data
,
low and medium multimedia.
However,
while
the input parameters are adequate for developed countries
,
they
do not reflect the status of developing countries. For th
is reason
the
ITU spectrum estimation
overestimate
s
the exact requirement
s
of spectrum for IMT systems for developing countries. This paper
presents an approach based on the technical and market related parameters, which is thought to be
applicable
for
ove
rcom
ing
the shortcomings of
the
current ITU methodology in estimatin
g
the spectrum
requirement
for developing
countries like Bangladesh
Design Of Elliptic Curve Crypto Processor with Modified Karatsuba Multiplier ...ijdpsjournal
ECDSA stands for “Elliptic Curve Digital Signature Algorithm”, it’s used to create a digital signature of
data (a file for example) in order to allow you to verify its authenticity without compromising its security.
This paper presents the architecture of finite field multiplication. The proposed multiplier is hybrid
Karatsuba multiplier used in this processor. For multiplicative inverse we choose the Itoh-Tsujii
Algorithm(ITA). This work presents the design of high performance elliptic curve crypto processor(ECCP)
for an elliptic curve over the finite field GF(2^233). The curve which we choose is the standard curve for
the digital signature. The processor is synthesized for Xilinx FPGA.
SURVEY ON QOE\QOS CORRELATION MODELS FORMULTIMEDIA SERVICESijdpsjournal
This paper presents a brief review of some existing correlation models which attempt to map Quality of
Service (QoS) to Quality of Experience (QoE) for multimedia services. The term QoS refers to deterministic
network behaviour, so that data can be transported with a minimum of packet loss, delay and maximum
bandwidth. QoE is a subjective measure that involves human dimensions; it ties together user perception,
expectations, and experience of the application and network performance. The Holy Grail of subjective
measurement is to predict it from the objective measurements; in other words predict QoE from a given set
of QoS parameters or vice versa. Whilst there are many quality models for multimedia, most of them are
only partial solutions to predicting QoE from a given QoS. This contribution analyses a number of previous
attempts and optimisation techniquesthat can reliably compute the weighting coefficients for the QoS/QoE
mapping.
Survey comparison estimation of various routing protocols in mobile ad hoc ne...ijdpsjournal
MANET is
an autonomous system of mobile nodes attached by wireless links. It represents
a complex and
dynamic distributed systems that consist of mobile wireless nodes that can freely self organize into
an ad
-
hoc network topology. The devices in the network may hav
e limited transmission
range therefore multiple
hops may be needed by one node to transfer data to another node in network. This leads to the need f
or an
effective routing protocol. In this paper we study various classifications of routing protocols and
th
eir types
for wireless mobile ad
-
hoc networks like DSDV, GSR, AODV, DSR, ZRP, FSR, CGSR, LAR, and Geocast
Protocols. In this paper we also compare different routing proto
cols on based on a given set of
parameters
Scalability, Latency, Bandwidth, Control
-
ov
erhead, Mobility impact
Derivative threshold actuation for single phase wormhole detection with reduc...ijdpsjournal
Communication in mobile Ad hoc networks is completed via multi
-
hop ways. Owing to the distributed
specification and restricted resource of nodes, MANET is a lot prone
to wormhole attacks i.e. wormhole
attacks place severe threats to each Ad hoc routing protocol and a few security enhancements. Thus,
so as
to discover wormholes, totally different techniques are in use. In all those techniques fixation of
threshold
is mer
ely by trial & error methodology or by random manner. Conjointly wormhole detection is in twin
part by putting the nodes that is higher than the edge in a suspicious set, however predicting the n
ode as a
wormhole by using some other algorithms. Our aim in
this paper is to deduce the traffic threshold level by
derivational approach for identifying wormholes in a very single phase in relay network having dissi
milar
characteristics.
Current Studies On Intrusion Detection System, Genetic Algorithm And Fuzzy Logicijdpsjournal
This document summarizes a research paper on current studies of intrusion detection systems using genetic algorithms and fuzzy logic. The paper presents an overview of intrusion detection systems, including different techniques like misuse detection and anomaly detection. It discusses using genetic algorithms to generate fuzzy rules to characterize normal and abnormal network behavior in order to reduce false alarms. The paper also outlines the dataset, genetic algorithm approach, and use of fuzzy logic that are proposed for the intrusion detection system.
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEYijdpsjournal
Breast cancer has become a common factor now-a-days. Despite the fact, not all general hospitals
have the facilities to diagnose breast cancer through mammograms. Waiting for diagnosing a breast
cancer for a long time may increase the possibility of the cancer spreading. Therefore a computerized
breast cancer diagnosis has been developed to reduce the time taken to diagnose the breast cancer and
reduce the death rate. This paper summarizes the survey on breast cancer diagnosis using various machine
learning algorithms and methods, which are used to improve the accuracy of predicting cancer. This survey
can also help us to know about number of papers that are implemented to diagnose the breast cancer.
The document proposes a validated real-time middleware called DCPS-HMM for distributed cyber physical systems. DCPS-HMM uses a Hidden Markov Model approach to validate process outputs. It consists of several components: a Process Manager that schedules processes; a Process Allocator that assigns processes to resources based on their periodic/aperiodic nature; a Process Implementation module; a Process Tracker; and a Process Validator that uses HMM to validate outputs against past behavior. The system is simulated for credit card fraud detection and a CPS scenario. It aims to provide a flexible, efficient, and validated middleware for diverse distributed CPS requirements.
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
Advances in Integrated Circuit processing allow for more microprocessor design options. As Chip Multiprocessor system (CMP) become the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition the virtualization of these computation resources exposes the system to a mix of diverse and competing workloads. On chip Cache memory is a resource of primary concern as it can be dominant in controlling overall throughput. This Paper presents analysis of various parameters affecting the performance of Multi-core Architectures like varying the number of cores, changes L2 cache size, further we have varied directory size from 64 to 2048 entries on a 4 node, 8 node 16 node and 64 node Chip multiprocessor which in turn presents an open area of research on multicore processors with private/shared last level cache as the future trend seems to be towards tiled architecture executing multiple parallel applications with optimized silicon area utilization and excellent performance.
Target Detection System (TDS) for Enhancing Security in Ad hoc Networkijdpsjournal
The idea of an ad hoc network is a new pattern that allows mobile hosts (nodes) to converse without relying
on a predefined communications to keep the network connected. Most nodes are implicit to be mobile and
communication is implicit to be wireless. Ad-hoc networks are collaborative in the sense that each node is
assumed to relay packets for other nodes that will in return relay their packets. Thus all nodes in an ad-hoc
network form part of the network’s routing infrastructure. The mobility of nodes in an ad-hoc network
denotes that both the public and the topology of the network are extremely active. It is very difficult to
design a once-for-all target detection system. Instead, an incremental enrichment strategy may be more
feasible. A safe and sound protocol should at least include mechanisms against known assault types. In
addition, it should provide a system to easily add new security features in the future. Due to the
significance of MANET routing protocols, we focus on the recognition of attacks targeted at MANET
routing protocols.
Intrusion detection techniques for cooperation of node in MANET have been chosen as the security
parameter. This includes Watchdog and Path rater approach. It also nearby Reputation Based Schemes in
which Reputation concerning every node is measured and will be move to every node in network.
Reputation is defined as Someone’s donation to network operation. CONFIDANT [23], CORE [25],
OCEAN [24] schemes are analyzed and will be here also compared based on various parameters.
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...ijdpsjournal
In the area of Computer Science, Parallel job scheduling is an important field of research. Finding a best
suitable processor on the high performance or cluster computing for user submitted jobs plays an
important role in measuring system performance. A new scheduling technique called communication aware
scheduling is devised and is capable of handling serial jobs, parallel jobs, mixed jobs and dynamic jobs.
This work focuses the comparison of communication aware scheduling with the available parallel job
scheduling techniques and the experimental results show that communication aware scheduling performs
better when compared to the available parallel job scheduling techniques.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
In this paper, a review for consistency of data replication protocols has been investigated. A brief
deliberation about consistency models in data replication is shown. Also we debate on propagation
techniques such as eager and lazy propagation. Differences of replication protocols from consistency view
point are studied. Also the advantages and disadvantages of the replication protocols are shown. We
advent into essential technical details and positive comparisons, in order to determine their respective
contributions as well as restrictions are made. Finally, some literature research strategies in replication
and consistency techniques are reviewed.
Many real-time systems are naturally distributed and these distributed systems require not only highavailability
but also timely execution of transactions. Consequently, eventual consistency, a weaker type of
strong consistency is an attractive choice for a consistency level. Unfortunately, standard eventual
consistency, does not contain any real-time considerations. In this paper we have extended eventual
consistency with real-time constraints and this we call real-time eventual consistency. Followed by this new
definition we have proposed a method that follows this new definition. We present a new algorithm using
revision diagrams and fork-join data in a real-time distributed environment and we show that the proposed
method solves the problem.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
This document summarizes a research paper that proposes a new method for improving both fault tolerance and load balancing in grid computing networks. The method converts the tree structure of grid computing nodes into a distributed R-tree index structure and then applies an entropy estimation technique. This entropy estimation helps discard nodes with high entropy from the tree, reducing complexity. The method then uses thresholding and control algorithms to select optimal route paths based on load balance and fault tolerance. Various optimization techniques like genetic algorithms, ant colony optimization, and particle swarm optimization are also applied to reach better solutions. Experimental results showed the proposed method improved performance over other existing methods.
Implementation of Banker’s Algorithm Using Dynamic Modified Approachrahulmonikasharma
Banker’s algorithm referred to as resource allocation and deadlock avoidance algorithm that checks for the safety by simulating the allocation of predetermined maximum possible of resources and makes the system into s-state by checking the possible deadlock conditions for all other pending processes. It needs to know how much of each resource a process could possibly request. Number of processes are static in algorithm, but in most of system processes varies dynamically and no additional process will be started while it is in execution. The number of resources are not allow to go down while it is in execution. In this research an approach for Dynamic Banker's algorithm is proposed which allows the number of resources to be changed at runtime that prevents the system to fall in unsafe state. It also give details about all the resources and processes that which one require resources and in what quantity. This also allocates the resource automatically to the stopped process for the execution and will always give the appropriate safe sequence for the given processes.
Implementation of Banker’s Algorithm Using Dynamic Modified Approachrahulmonikasharma
Banker’s algorithm referred to as resource allocation and deadlock avoidance algorithm that checks for the safety by simulating the allocation of predetermined maximum possible of resources and makes the system into s-state by checking the possible deadlock conditions for all other pending processes. It needs to know how much of each resource a process could possibly request. Number of processes are static in algorithm, but in most of system processes varies dynamically and no additional process will be started while it is in execution. The number of resources are not allow to go down while it is in execution. In this research an approach for Dynamic Banker's algorithm is proposed which allows the number of resources to be changed at runtime that prevents the system to fall in unsafe state. It also give details about all the resources and processes that which one require resources and in what quantity. This also allocates the resource automatically to the stopped process for the execution and will always give the appropriate safe sequence for the given processes.
Online learning algorithmes often have to operate in the presence of concept drifts. A recent study
revealed that different diversity levels in an ensemble of learning machines are required in order to maintain high
generalization on both old and new concepts. Inspired by this study and based on a further study of diversity with
different strategies to deal with drifts, so propose a new online ensemble learning approach called Diversity for
Dealing with Drifts (DDD).DDD maintains ensembles with different diversity levels and is able to attain better
accuracy than other approaches. Furthermore, it is very robust, outperforming other drift handling approaches in
terms of accuracy when there are false positive drift detections. It is always performed at least as well as
other drift handling approaches under various conditions, with very few exceptions. Presents an analysis of low
and high diversity ensembles combined with different strategies to deal with concept drift and proposes a new
approach (DDD) to handle drifts.
Adaptive check-pointing and replication strategy to tolerate faults in comput...IOSR Journals
This document summarizes an adaptive checkpointing and replication strategy to tolerate faults in computational grids. It proposes maintaining a balance between the overheads of replication and checkpointing. Tasks are replicated on up to three resources based on each resource's probability of permanent failure. Checkpoints are taken adaptively based on the probability of recoverable failure. If a resource fails permanently, the task resumes from the last checkpoint. If a failure is recoverable, the task resumes on the same resource. This strategy aims to minimize resource wastage from replication while utilizing different resource speeds.
This document summarizes an adaptive checkpointing and replication strategy to tolerate faults in computational grids. It proposes maintaining a balance between the overheads of replication and checkpointing. Tasks are replicated on up to three resources based on each resource's probability of permanent failure. Checkpoints are taken adaptively based on the probability of recoverable failure. If a resource fails permanently, the task resumes from the last checkpoint. If a failure is recoverable, the task resumes on the same resource. This strategy aims to minimize resource wastage from replication while utilizing different resource speeds.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document evaluates the performance of a hybrid differential evolution-genetic algorithm (DE-GA) approach for load balancing in cloud computing. It first provides background on cloud computing and load balancing. It then describes the DE-GA approach, which uses differential evolution initially and switches to genetic algorithm if needed. The results show that the hybrid DE-GA approach improves performance over differential evolution and genetic algorithm alone, reducing makespan, average response time, and improving resource utilization. The study demonstrates the benefits of the hybrid evolutionary algorithm for an important problem in cloud computing.
CLOUD COMPUTING – PARTITIONING ALGORITHM AND LOAD BALANCING ALGORITHMijcseit
This document summarizes a research paper on partitioning algorithms and load balancing algorithms for cloud computing. It discusses how cloud partitioning can improve load balancing and system performance. It reviews different partitioning and load balancing algorithms such as ant colony optimization and honey bee behavior algorithms. The paper presents a model for cloud partitioning based on geographic regions. It also provides experimental results comparing the ant colony and honey bee algorithms, finding that honey bee performs better in balancing load across nodes in the cloud.
CLOUD COMPUTING – PARTITIONING ALGORITHM AND LOAD BALANCING ALGORITHMijcseit
This document summarizes a research paper on partitioning algorithms and load balancing algorithms for cloud computing. It discusses how cloud partitioning can improve load balancing and system performance. It reviews different partitioning and load balancing algorithms such as ant colony optimization and honey bee behavior algorithms. The paper presents a model for cloud partitioning based on geographic regions. It also provides experimental results comparing the ant colony and honey bee algorithms, finding that honey bee performs better, especially with higher load thresholds. The paper concludes that load balancing is important for optimizing cloud resource utilization.
An Improved Differential Evolution Algorithm for Data Stream ClusteringIJECEIAES
A Few algorithms were actualized by the analysts for performing clustering of data streams. Most of these algorithms require that the number of clusters (K) has to be fixed by the customer based on input data and it can be kept settled all through the clustering process. Stream clustering has faced few difficulties in picking up K. In this paper, we propose an efficient approach for data stream clustering by embracing an Improved Differential Evolution (IDE) algorithm. The IDE algorithm is one of the quick, powerful and productive global optimization approach for programmed clustering. In our proposed approach, we additionally apply an entropy based method for distinguishing the concept drift in the data stream and in this way updating the clustering procedure online. We demonstrated that our proposed method is contrasted with Genetic Algorithm and identified as proficient optimization algorithm. The performance of our proposed technique is assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall is 90.30% and F-measure estimate is 88.60%.
This document discusses QoS aware replica control strategies for distributed real-time database management systems. It proposes a heuristic approach called Greedy-Cover Firefly algorithm that dynamically places replicas based on QoS requirements and replaces replicas using an adaptive algorithm. The algorithm calculates replication costs and selects optimal nodes for replica placement based on access history. It aims to improve system performance by reducing resources consumed over time while meeting QoS requirements. Simulation results show the proposed algorithms greatly improve the system performance.
Analyzing consistency models for semi active data replication protocol in dis...ijfcstjournal
Data replication is generally used for increasing
accessibility, availability, performance and scalability
of database systems. For implementing data replication mechanisms, we encounter with some consistency
problems.One of the important problems for implementing data replication mechanism is consistency. In this paper,
the performance tradeoffs of consistency models for semi-active data replication protocol in distributed systems
are analyzed.A brief deliberation about consistency models in data replication is shown.Research on how client-centric guarantees relate to data-centric models is discussed.How guaranteeing conditions of data -
centric consistency models and client - centric consistency models isprovided, is also analyzed.Analysis of the consistency models guarantee in terms of multi-client and single client for the semi-active data replication
protocol without failure and leader death is presented. The
experimental results show that semi-active data replication protocol is appropriate for distributed systems by multi-client replication such as web services.
Novel Perspectives in Construction of Recovery Oriented ComputingIJSRD
In this paper we present novel views in implementation of a recovery oriented computing system. We discuss the various factors considered while ROC system design and move on to present the existing technologies in this field. Based on this we propose a ROC system design which enhances the robustness of such a system. Finally we state the future directions which the researchers are currently working in this area.
This document evaluates the performance of the First Come First Serve (FCFS) and Easy Backfilling (EBF) resource allocation algorithms in grid computing systems. It compares the resource utilization and throughput of the two algorithms when gridlet size increases linearly and non-linearly. The results show that EBF achieves better resource utilization and throughput than FCFS in both linear and non-linear cases. EBF is more efficient at scheduling jobs to maximize resource usage and the amount of work completed per time period.
Data Structures in the Multicore Age : NotesSubhajit Sahu
The document discusses the challenges of designing concurrent data structures for multicore processors. It begins by explaining Amdahl's Law, which states that the speedup gained from parallelization is limited by the sequential fraction of a program. For mainstream applications, the sequential fraction often involves coordinating concurrent access to shared data structures.
It then presents an example of designing a concurrent stack. It starts with a simple lock-based stack protected by a single lock. While this guarantees linearizability, it suffers from poor scalability due to the centralized locking bottleneck. It also relies on strong scheduling assumptions. The document indicates that future concurrent data structures will need to be more distributed and relaxed in their consistency requirements to achieve scalability on multicore
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTIJwest
In real environment there is a collection of many noisy and vague data, called Big Data. On the other hand,
to work on the data middleware have been developed and is now very widely used. The challenge of
working on Big Data is its processing and management. Here, integrated management system is required
to provide a solution for integrating data from multiple sensors and maximize the target success. This is in
situation that the system has constant time constrains for processing, and real-time decision-making
processes. A reliable data fusion model must meet this requirement and steadily let the user monitor data
stream. With widespread using of workflow interfaces, this requirement can be addressed. But, the work
with Big Data is also challenging. We provide a multi-agent cloud-based architecture for a higher vision to
solve this problem. This architecture provides the ability to Big Data Fusion using a workflow management
interface. The proposed system is capable of self-repair in the presence of risks and its risk is low.
Similar to A LIGHT-WEIGHT DISTRIBUTED SYSTEM FOR THE PROCESSING OF REPLICATED COUNTER-LIKE OBJECTS (20)
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
A LIGHT-WEIGHT DISTRIBUTED SYSTEM FOR THE PROCESSING OF REPLICATED COUNTER-LIKE OBJECTS
1. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
DOI : 10.5121/ijdps.2013.4301 1
A LIGHT-WEIGHT DISTRIBUTED SYSTEM FOR THE
PROCESSING OF REPLICATED COUNTER-LIKE
OBJECTS
Joel M. Crichlow, Stephen J. Hartley
Computer Science Department, Rowan University
Glassboro, NJ, USA
crichlow@rowan.edu, hartley@elvis.rowan.edu
Michael Hosein
Computing and Information Technology Department, University of the West Indies
St. Augustine, Trinidad
mhosein2006@gmail.com
ABSTRACT
In order to increase availability in a distributed system some or all of the data items are replicated and
stored at separate sites. This is an issue of key concern especially since there is such a proliferation of
wireless technologies and mobile users. However, the concurrent processing of transactions at separate
sites can generate inconsistencies in the stored information. We have built a distributed service that
manages updates to widely deployed counter-like replicas. There are many heavy-weight distributed
systems targeting large information critical applications. Our system is intentionally, relatively light-
weight and useful for the somewhat reduced information critical applications. The service is built on our
distributed concurrency control scheme which combines optimism and pessimism in the processing of
transactions. The service allows a transaction to be processed immediately (optimistically) at any
individual replica as long as the transaction satisfies a cost bound. All transactions are also processed in a
concurrent pessimistic manner to ensure mutual consistency.
KEYWORDS
Distributed System, Availability, Replication, Optimistic Processing, Pessimistic Processing, Concurrent
Processing, Client/Server
1. INTRODUCTION
Our system is called COPAR (Combining Optimism and Pessimism in Accessing Replicas). It
runs on a collection of computing nodes connected by a communications network. The
transactions access data that can be fully or partially replicated. Transactions can originate at any
node and the transaction processing system attempts to treat all transactions in a uniform manner
through cooperation among the nodes. We have had test runs on private LANs as well as over the
Internet, and preliminary results have been published and presented (see [1], [2], [3] and [4]).
This paper provides some background to the project in this section, explains the basic interactions
between the optimistic and pessimistic processing in section 2, and discusses recent key upgrades
in sections 3 to 6.
2. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
2
One of the main reasons for replicating the data in distributed systems is to increase availability.
Replication has become increasingly more useful in the face of wireless technology and roaming
users. However, this replication increases the need for effective control measures to preserve
some level of mutual consistency. Several replica control techniques have been proposed to deal
with this issue and these techniques are described to different levels of detail in many articles and
presentations (e.g. see [5], [6], [7], [8], [9], [10] and [11]).
The techniques vary in a number of ways including the number of the replicas that must
participate before a change can be made to a replica, the nature of the communication among the
replicas, and if a replica can be changed before the others how is the change propagated. A key
contributor to the choice of specific procedures is the nature of the application. For example some
applications can tolerate mutually inconsistent replicas for longer periods than others. The twin
objective is
process the transaction correctly as quickly as possible, and
reflect this at all the replicas so that no harm is done.
One approach is to employ pessimistic strategies which take no action unless there are guarantees
that consistent results and states will be generated. Such techniques sacrifice availability. Another
approach is to employ optimistic techniques that take actions first and then clean up afterwards.
Such techniques may sacrifice data and transaction integrity. Saito & Shapiro [12] and Yu &
Vahdat [13] deal specifically with the issue of optimistic replication.
There is also the matter of failure. How do we achieve consistency and availability when the
network partitions? That is when some nodes cannot communicate with other nodes. In many
cases the key objective remains the same, i.e. to provide a quick response. Although that response
may not be accurate it may be tolerable. Strong arguments have been made for the relaxation of
consistency requirements in order to maintain good availability in the face of network
partitioning. Indeed in what is referred to as Brewer’s CAP theorem, the argument was made that
a system can provide just two from Consistency, Availability and Partition tolerance (see [14] and
[15]).
We will first demonstrate how our system works without partition tolerance to provide
consistency and availability. Then we will discuss a partition tolerance implementation that
maintains availability with delayed or weak consistency. Our system can be described as adhering
to the Base Methodology (see [16]). That is our system conforms to the following:
Basically Available: Provides a fast response even if a replica fails.
Soft State Service: Optimistic processing does not generate permanent state. Pessimistic
processing provides the permanent state.
Eventual Consistency: Optimistic processing responds to users. Pessimistic processing
validates and makes corrections.
The use of a cost bound in the processing of transactions is useful in a system where countable
objects are managed. Lynch et al [17] proposed such a technique as a correctness condition in
highly available replicated databases. Crichlow [18] incorporated the cost bound in a scheme that
combined a simple pessimistic technique with a simple optimistic mechanism to process objects
that are countable. We regard an object as countable if its data fields include only its type and
how many of that object exists. For example an object may be of type blanket and there are one
thousand blankets available.
Every transaction submitted to the system enters concurrently a global pessimistic two-phase
commit sequence and an optimistic individual replica sequence. The optimistic sequence is
moderated by a cost bound, which captures the extent of inconsistency the system will tolerate.
3. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
3
The pessimistic sequence serves to validate the processing and to commit the changes to the
replicas or to undo an optimistic run if it generated an inconsistency. Using this scheme we built
the COPAR service that can provide highly available access to counter-like replicas widely
deployed over a network.
There are several examples of systems that process countable data items. Reservation systems
handle available seats, rooms, vehicles, etc. Distribution systems handle available resources, e.g.
blankets, bottles of water, first-aid kits and so on for disaster relief. Traffic monitoring systems
count vehicles. Therefore our system COPAR although limited to countable objects has wide
applicability.
The main objectives in the design were to:
Provide a high level of availability at a known penalty to the application,
Permit wide distribution of replicas over the network,
Preserve data integrity, and
Build a system that is conceptually simple.
2. COPAR Operation
COPAR uses the client-server model of distributed processing. Servers maintain the “database” of
resources (i.e. the resource counts), and accept transactions from client machines. In our current
prototype there is one client machine called the generator (it generates the transactions) and the
“database” is fully replicated at all the servers. We call these servers the nodes.
Each node maintains two counts of available resources. One count is called the pessimistic or
permanent count; the other count is called the optimistic or temporary count. Changes to the
permanent count are synchronized with all the other nodes over the network using the two-phase
update/commit algorithm (see [19], and [20]). This algorithm forces all the participating nodes to
agree before any changes are made to the count. Thus, this count is the same at all the nodes and
represents true resource availability. The temporary count is maintained separately and
independently by each node.
In general, resource counts for availability are a single non-negative integer R, such as for one
type of resource, or a vector of non-negative integers (R1,R2, ...,Rm), such as for m types of
resources. Similarly, resource counts for transactions are a single integer r, negative for an
allocation and positive for a deallocation or release, or a vector of integers (r1, r2, ..., rm).
When the system is initialized, the permanent count Pjk at each node j (where k ranges from 1 to
m resource types) is set to the initial resource availability Rk. For example let R1 = 2000 first aid
kits, R2 = 1000 blankets, or R3 = 4000 bottles of water for disaster relief. Then the Pjk for 4 nodes
will be initialized as in Table 1:
Table 1. An initial state at 4 nodes
Nodes
1 P11 = 2000 P12 = 1000 P13 = 4000
2 P21 = 2000 P22 = 1000 P23 = 4000
3 P31 = 2000 P32 = 1000 P33 = 4000
4 P41 = 2000 P42 = 1000 P43 = 4000
The temporary count Tjk at each node is set to the initial permanent count divided by the number
of nodes n. Tjk is then adjusted upward by an over-allocation allowance c, called the cost bound,
where c >= 1. Therefore,
4. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
4
Tjk = c * Pjk /n
For example, if there are four nodes, if R1 is 100, and if c is 1.16, then Pj1 is set to 100 and Tj1 is
set to 29 at each node as in Table 2:
Table 2. Initial permanent and temporary counts at 4 nodes
Nodes
1 P11 = 100
T11 = 29
2 P21 = 100
T21 = 29
3 P31 = 100
T31 = 29
4 P41 = 100
T41 = 29
Most reservation/allocation systems allow some over-allocation to compensate for reservations
that are not used, such as passengers not showing up for an airline flight or people not picking up
supplies when delivered to a relief center. There is a cost involved in over-allocation, such as
compensating passengers denied boarding on an airline flight. Organizations using a
reservation/allocation system must carefully evaluate the cost of over-allocation and limit it to
what can be afforded or tolerated.
Currently, interaction with the system is simulated by generating a transaction ti, which makes a
request (r1, r2, ..., rm), i.e. for ri resources of type i, where i ranges from 1 to m types of resources.
This request is sent to a node j. This node is then considered the parent or owner of the
transaction.
The m integers in a transaction are generated randomly and the node j is chosen at random from 1
to n, where there are n nodes. Transactions from the generator are numbered sequentially.
Additions to the pool of resources are handled differently. Such additions are discussed in section
4.
For example, a transaction may make a request (-10, -20, -100) for 10 first aid kits, 20 blankets
and 100 bottles of water, where there are 3 resource types: type1 – first aid kits, type 2 – blankets
and type 3 – bottles of water. Furthermore a transaction deallocating or returning 10 first aid kits,
20 blankets and 100 bottles of water may be expressed as (10, 20, 100).
Each node maintains two queues of transactions, called the parent or owner queue and the child
queue. A parent node, on receiving a transaction, adds that transaction to its parent queue and to
its child queue. The transaction is also broadcast to all nodes to be added to each node’s child
queue. The transactions ti in each node’s parent queue, are kept sorted in order of increasing i, in
other words, in the order generated by the transaction generator.
Note that a particular transaction ti is in exactly one node’s parent queue. Each node j has two
processors (threads), one responsible for maintaining the parent queue and the permanent count
Pjk at the node, and the other responsible for maintaining the child queue and the temporary count
Tjk at the node (see Figure 1).
The permanent processor at each node participates in a two-phase commit cycle with all the other
node permanent processors. After the processing of transaction ti−1 by its parent, the node whose
parent queue contains transaction ti becomes the coordinator for the two-phase commit cycle that
5. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
5
changes the permanent count Pjk at all nodes j to Pjk + rk for k = 1, 2, ...,m. The temporary counts
are also forced to change after permanent processing. We will discuss this in the following
section.
The change to the permanent count is subject to the restriction that Pjk + rk is nonnegative for all
k. If that is not the case, all Pjk are left unchanged and the transaction ti is marked as a violation.
This in effect means that if a request cannot be totally granted then nothing is granted. (This will
be upgraded during further research to allow non-violation if Pjk + rk is nonnegative for at least
one k, i.e. to allow granting of partial requests). At the end of the two-phase commit cycle, the
owner (parent) of transaction ti sends a message to all nodes, including itself, to remove ti from
the node’s child queue if ti is present.
Temporary processing takes place concurrently with permanent processing. The temporary
processor at each node j removes the transaction th at the head of its child queue, if any, and
calculates if the request rk made by th can be allocated or satisfied from its temporary (optimistic)
count Tjk. In other words, node j checks if it is the case that Tjk +rk is non-negative for all k = 1, 2,
… m. If that is not the case, th is discarded (This will be upgraded during further research so that
transaction th is not discarded if Tjk + rk is nonnegative for at least one k); otherwise, node j sets
Tjk to Tjk + rk and sends a message to the parent (owner) node of the transaction, i.e. the node
whose parent queue contains the transaction.
When a parent node n receives such a message from node j for transaction th, node n makes two
checks.
• Is this the first such message received from any node’s temporary processor for transaction th?
• Has transaction th been done permanently yet?
If this is not the first such message, a reply is sent to node j that it should back out of the
temporary allocation it did for th, that is, change its temporary count Tjk to Tjk − rk. This operation
is necessary since another node will have done the temporary processing. This is possible because
all the nodes get a chance to make an individual response to a request. The fastest one wins.
A temporary transaction may have to be “undone”. Therefore, if this is the first such message and
if the transaction th has not yet been done permanently (pessimistically), node j sending the
message is marked as the node having done transaction th temporarily (optimistically). If this is
the first such message, but transaction th has already been done permanently, no node is recorded
as having done the transaction temporarily.
When the permanent processor in a node j coordinates the two-phase commit for a transaction ti
and has decided that transaction ti is a violation, that is, Pjk + rk is negative for one or more k,
node j checks to see if the transaction was marked as having been done optimistically earlier by
some node’s temporary processor. If so, the transaction ti is marked as “undone,” meaning that a
promised request cannot now be granted.
1 1
2 2
3
3 1
2
3
P C P
C
A B
6. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
6
Figure 1. Each node A and B has a parent queue P and a child queue C. Node A owns
transactions 1 and 2 and will process these pessimistically in a two-phase commit protocol
involving A and B. Node B owns transaction 3 and will process it pessimistically in a two-phase
commit protocol involving A and B. Concurrently nodes A and B process transactions 1, 2 and 3
optimistically.
If no node has done the transaction optimistically and it is not a violation, the owner’s temporary
processor allocation Tjk is “charged” for it, Tjk = Tjk + rk. This is done to lessen the probability of a
later transaction being performed optimistically but then marked “undone” by the permanent
processor.
3. Updating Optimistic counts after Pessimistic/Permanent Processing
The temporary optimistic counts Tjk change at the end of optimistic transaction processing. The
pessimistic counts Pjk change at the end of permanent transaction processing. Whenever there is a
change to Pjk this should generate an update to Tjk which is consistent with the new Pjk. Therefore
Tjk is updated by the temporary optimistic processing and after the pessimistic permanent
processing.
As is stated above, when the system is initialized, the permanent count Pjk at each node j (where k
ranges from 1 to m resource types) is set to the initial resource availability Rk. However,
permanent processing of a transaction generates a new Pjk where
(new) Pjk = (old) Pjk + rk
Therefore a new Tjk is generated where
Tjk = c * (new) Pjk * wjk
You may notice that there is an apparent difference between how Tjk is derived here and how it
was derived initially (/n is replaced by * wjk). The wjk is a weight which captures the amount of
allocations done by a node and influences the reallocation of the Tjk values.
We will now discuss how the wjk is calculated. Permanent processing uses the two-phase commit
protocol which requires a coordinating node. Permanent processors via the coordinator maintain a
running total of allocations done by each node. Let rajk be the total allocations of resource k done
by node j on completion of permanent processing. Let RAk will be the total allocations of
resource k done by all the nodes at end of permanent processing. We let
wjk = (rajk + 1)/(RAk + n) where n is the number of nodes.
Note that initially rajk and RAk are equal to zero, therefore on initialization wjk is equal to 1/n. This
is consistent with the initial derivation of Tjk.
The coordinating parent processor can now use
Tjk = c * Pjk * wjk
to compute the new temporary counts for optimistic processing. However there is a problem here.
While the coordinating pessimistic processing was being done, the optimistic temporary
processors were still running. Therefore the information used in the computation of the Tjk can be
stale. That is the rajk used in the computation of the new Tjk for node j could have been changed
due to further optimistic processing by that node.
7. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
7
We must therefore distinguish between two rajk. Let the one that was used by pessimistic
processing to compute the new count be called rajk,recorded and the current one be rajk,current.
When the temporary processors receive Tjk from the permanent processor the temporary
processors adjust the Tjk as follows in order to reflect its current position:
Tjk = Tjk – (rajk,current - rajk,recorded)
If this result is negative make it 0. The zero value forces temporary processing at this node to
stop.
For example, let R1 denote resources of type 1 (say blankets) initially 100 and c = 1.1.
Then Pj1 = 100 and Tj1 = 110/n.
Let there be 3 replicas i.e. n = 3.
Therefore T11 = T21 = T31 = 37.
Let permanent processors at nodes 1, 2, 3 record allocations of 30, 20, 10 blankets respectively.
Therefore
ra11,recorded = 30, ra21,recorded = 20, and ra31,recorded = 10.
Therefore Pj1 = R1 is now 40 (i.e. 100 – 30 – 20 – 10) and
Tj1 = 1.1 * 40* wj1
Therefore
T11 = ((30 + 1) / (60 + 3))* 44= 22
Assume that 6 more blankets were allocated at temporary processor 1.
Therefore
T11 = T11 - (ra11,current - ra11,recorded) = 22 – (36 – 30) = 16
We now compute the new temporary count for temporary processor 2.
T21 = ((20 + 1) / (60 + 3))* 44= 15
Assume 4 more blankets were allocated at temporary processor 2.
Therefore
T21 = T21 - (ra21,current - ra21,recorded) = 15 – (24 – 20) = 11
We now compute the new temporary count for temporary processor 3.
T31 = ((10 + 1) / (60 + 3))* 44= 7
Assume 3 blankets were returned/deallocated at temporary processor 3.
Therefore
T31 = T31 - (ra31,current - ra31,recorded) = 7 – (7 – 10) = 10
On the other hand let’s assume that 8 more blankets were allocated at temporary processor 3, then
ra31,current = 18, and
T31 = T31 - (ra31,current - ra31,recorded) = 7 – (18 – 10) = -1. This temporary count is then set to 0 and
temporary processor 3 is stopped until it gets a count greater than 0.
8. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
8
Note that this still does not prevent temporary over-allocations since one temporary does not
know what the other temporary is doing and cost bound c = 1.1. However, it reduces the incidents
of over-allocations and hence the number of “undones”. But our objective of high-availability is
being maintained.
4. ADDITIONS
At any time while the system is running additions can be made to the available pool of resources,
e.g. new donations can be made to a pool of resources for disaster relief. An addition is
considered a unique transaction called ai(r1 … rm) that adds rk, (i.e. r resources of type k where k
ranges from 1 to m) to the pool of available resources. It is not appended to the child queues.
When this transaction is processed Pjk and Tjk are updated by the permanent processor:
Pjk = Pjk + rk
Tjk = c * Pjk * wjk using the current values of the wjk.
The temporary processors will then update Tjk to reflect their current situation as discussed in
section 3.
5. RESULTS FROM TESTS WHERE THERE ARE NO FAILURES
The COPAR test-bed includes a transaction generator and servers on a LAN at Rowan University
in New Jersey (R) interconnected via the Internet with a server about 40 miles away at Richard
Stockton College in New Jersey (RS) and a server at the University of the West Indies (UWI)
located in Trinidad in the southern Caribbean approximately 2000 miles away.
The transaction generator and servers are all started with a program running on the transaction
generator node that reads and parses an XML file containing the data for the run. We have
demonstrated that a large percentage of transactions submitted to the system can be handled
optimistically (without multi-server agreement) at significantly faster turnaround times and with a
very small percentage of “undones”.
The figures and tables display results when 200 transactions were generated at a rate of 5
transactions per second. There are 200 resources of each type available initially. The cost bound
is 1.16. Transactions include requests for resources, returns of resources and new donations (i.e.
additions). Requests and returns range from 3 to 9 resources per transaction. Donations range
from 3 to 10 resources per donation. Tests were done on two platforms: a four node platform with
all servers (including transaction generator) at Rowan (R); and a six node platform including
Rowan, Richard Stockton and UWI (R+RS+UWI).
In Figure 2, of the 200 transactions 18 are donations totaling 136. There is one resource type of
200 resources available initially. On the R platform 159 transactions are done optimistically and 4
are undone. On the R+RS+UWI platform 182 transactions are done optimistically and 25 are
undone.
During these tests on the R platform, pessimistic processing times (PT) range from 29
milliseconds to 288 milliseconds, optimistic processing times (OT) range from 1 millisecond to
20 milliseconds. The average PT to RT ratio is 18. The R+RS+UWI platform is subject to the
vagaries of the Internet and the vast geographical expanse. The PT times range from 2.6 seconds
to 10 minutes, OT times range from 1 millisecond to 1 second. The average PT to RT ratio is
117000 (see Table 3).
9. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
9
Figure 2. Tests were done on two platforms: a four node platform with all servers (including
transaction generator) at Rowan (R); and a six node platform including Rowan, Richard Stockton
and UWI (R+RS+UWI); 200 transactions were generated at a rate of 5 transactions per second.
Table 3. Pessimistic (PT) and Optimistic (OT) times from tests in Figure 2
Min Max Ave
PT/OT
ratio
R
PT
29
msec
288
msec
18
R
OT
1
msec
20
msec
R+RS+UWI
PT
2.6
sec
10
min
117000
R+RS+UWI
OT
1
msec
1
sec
In Figure 3 there are 3 resource types each with 200 resources initially. There are 28 donations
totaling 189, 172 and 181 for resource types 1, 2 and 3 respectively. On the R platform 161
transactions are done optimistically and 2 are undone. On the R+RS+UWI platform 172
transactions are done optimistically and 10 are undone.
During these tests on the R platform, PT times range from 29 milliseconds to 255 milliseconds,
OT times range from 1 millisecond to 21 milliseconds. The average PT to RT ratio is 19. The
R+RS+UWI platform is subject to the vagaries of the Internet and the vast geographical expanse.
The PT times range from 3 seconds to 8 minutes, OT times range from 1 millisecond to 258
milliseconds. The average PT to RT ratio is 111000, (see Table 4).
10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
10
Figure 3. There are 3 resource types each with 200 resources initially. There are 28 donations
totaling 189, 172 and 181 for resource types 1, 2 and 3 respectively. On the R platform 161
transactions are done optimistically and 2 are undone. On the R+RS+UWI platform 172
transactions are done optimistically and 10 are undone
Table 4. Pessimistic (PT) and Optimistic (OT) times from tests in Fig. 3
Min Max Ave PT/OT
ratio
R
PT
29
msec
255
msec
19
R
OT
1
msec
21
msec
R+RS+UWI
PT
3
sec
8
min
111000
R+RS+UWI
OT
1
msec
258
msec
6. HANDLING FAILURE
Our failure handling model addresses only the case of a node that can no longer be reached.
Failing to reach a node may be due to that node’s failure, communication link failure, or an
unacceptably long response time. Such a failure handling model is workable in COPAR since the
transactions handled and the information maintained by the system can tolerate certain margins of
error.
If a node cannot be reached due to node or communication link failure then the pessimistic 2PC
processing will fail. However optimistic processing will continue at all operating nodes until the
cost bound at those nodes is zero. The objective will be to restart pessimistic processing only if a
majority of the initial group of nodes can be reached.
The restart of pessimistic processing among the majority uses the concept of the “distinguished
partition”. That is, the network is now partitioned and processing is allowed to continue in a
favored partition. This favored partition is called the “distinguished partition”. Voting schemes in
which nodes possess read/write votes are often used to determine that “distinguished partition”
(see [21] and [9]).
11. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
11
Our “distinguished partition” for pessimistic processing will be the partition with the majority of
the initial group of nodes. The restart will use the current permanent/pessimistic resource counts
and generate new temporary/optimistic counts for the new reachable pool of nodes.
For example, given the following 4-node situation in Table 5:
Table 5. The current state at 4 nodes
Nodes
1 P11 = 100
T11 = 29
2 P21 = 100
T21 = 29
3 P31 = 100
T31 = 29
4 P41 = 100
T41 = 29
After some processing, assume that each node has allocated 4 resources and this has been
processed pessimistically, therefore the new situation is Table 6:
Table 6. The new state after allocating 4 resources
Nodes
1 P11 = 84
T11 = 25
2 P21 = 84
T21 = 25
3 P31 = 84
T31 = 25
4 P41 = 84
T41 = 25
Assume that node 4 can no longer be reached, but it is still operable. That node can continue to
process requests until T41 = 0. Pessimistic processing can restart with nodes 1, 2 and 3 with a P
value of 84 and 3 being the number of nodes.
Currently the system is controlled by a transaction generator, which can be viewed as a single
point of failure. In the future transaction handling should be separated from system management.
The system manager will handle system start-up, initialization, monitoring, restart, etc. The
transaction handlers should exist at all nodes. In the meantime the transaction generator assumes
the role of the monitor of the system.
We would like the two-phase commit processing to recover from the failure of a participating
node. Therefore we are proposing the following pseudo two-phase commit sequence.
In phase one, after a time-out before receiving all votes, the coordinator will count the number of
votes to determine if it heard from a majority of the initial set of participants. If it did not hear
from a majority the transaction will be aborted. If it heard from a majority it will start phase two
with the majority as the group of participants.
In phase two, after a time-out before receiving all commit responses, the coordinator will
determine if it heard from a majority of the initial set of participants. If it did not hear from a
majority the transaction will be aborted. If it heard from a majority the coordinator will complete
12. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
12
the commit sequence. The subsequent processing round will start phase one with this new group
of participants.
If the transaction generator times out on a coordinator it will assume that the coordinator is no
longer reachable. The transaction generator will determine if a majority of the initial set of nodes
is still operable. If a majority is operable the transaction generator will select a new coordinator
and restart transaction processing with the new group of nodes. If a majority is not operable the
transaction generator will wait until a majority of nodes come back online.
As a proof of concept we ran some tests of COPAR that simulated a node failure during
transaction processing. We did this in the following way. Whenever a server is notified that a
transaction has been selected for pessimistic processing, that server increments a counter that was
initially zero. When that counter reaches 50 the server checks the notification message to
determine if the sender was a specified server s. If it is s then s is classified as inactive and is
dropped from the two-phase commit pool. The pessimistic processing continues with the pool
reduced by one server.
However, since server s is in reality still active it will continue optimistic processing until it
empties its child queue or until its cost bound is less than or equal to zero. At this point the
transaction generator does not know that server s is no longer in the two-phase commit pool and
so the generator can continue to send new transactions to server s.
In order to prevent this, the generator increments a counter whenever it generates a new
transaction. When that counter reaches 25 the generator stops sending transactions to server s.
Transactions that should have gone to s are sent alternately to its downstream and upstream
neighbor. In the tests 200 transactions are generated. Therefore the counter values must be less
than 200. Since the selection of a coordinator/parent is pseudo-random and since we do not keep a
history of the interactions between servers then our choice of the counter values are somewhat
arbitrary, and it is intended primarily to ensure that new transactions are not sent to server s after
it has been dropped from the pessimistic two-phase pool.
In the tests discussed below a server on the Rowan(R) LAN is dropped during the processing. In
Figure 4, of the 200 transactions 18 are donations totaling 136. There is one resource type of 200
resources available initially. On the R platform 157 transactions are done optimistically and 1 is
undone. On the R+RS+UWI platform 182 transactions are done optimistically and 25 are undone.
Notice the similarity between these results and those displayed in Figure 2 where the only change
here is in the dropped server.
0
50
100
150
200
250
r r+rs+uwi
AllTrans
Opt
Undone
13. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
13
Figure 4. A server on the Rowan(R) LAN is dropped during the processing. Results are similar to
case when no server is dropped.
During these tests on the R platform, pessimistic processing times (PT) range from 29
milliseconds to 222 milliseconds, optimistic processing times (OT) range from 1 millisecond to
17 milliseconds. The average PT to RT ratio is 21. The R+RS+UWI platform is subject to the
vagaries of the Internet and the vast geographical expanse. The PT times range from 2.6 seconds
to 8.8 minutes, OT times range from 1 millisecond to 991 milliseconds. The average PT to RT
ratio is 117000 (see Table 7).
Notice that whereas the numbers of completions are similar to the case when all servers were
operable (see Table 3), there are differences in completion times when a server is dropped. It is
expected that the pessimistic processing should decrease after a server was dropped. On the R
platform max PT dropped from 288 milliseconds to 222 milliseconds, and on the R+RS+UWI
platform max PT dropped from 10 minutes to 8.8 minutes.
Table 7. Pessimistic (PT) and Optimistic (OT) times from tests in Fig. 4
Min Max Ave PT/OT
ratio
R
PT
29
msec
222
msec
21
R
OT
1
msec
17
msec
R+RS+UWI
PT
2.6
sec
8.8
min
97251
R+RS+UWI
OT
1
msec
991
msec
In Figure 5, of the 100 transactions 9 are donations totaling 66. There is one resource type of 200
resources available initially. The results of two tests on the R+RS+UWI platform are displayed.
In both tests a Rowan server was dropped after about 50 transactions. In the test labeled “more”
the distribution of transactions was such that the remote UWI server (about 2000 miles away from
the generator) got 50% more transactions than in the test labeled “less”. In each case 91
transactions are done optimistically and 0 is undone. The difference in the distribution of
transactions does not affect the numbers completed.
0
20
40
60
80
100
120
r+rs+uwi
(less)
r+rs+uwi
(more)
AllTrans
Opt
Undone
14. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
14
Figure 5. The results of two tests on the R+RS+UWI platform are displayed. In both tests a
Rowan server was dropped after about 50 transactions. In the test labeled “more” the distribution
of transactions was such that the remote UWI server got 50% more transactions than in the test
labeled “less”. In each case 91 transactions are done optimistically and 0 is undone.
During these tests on the “less” platform, pessimistic processing times (PT) range from 2.8
seconds to 4 minutes, optimistic processing times (OT) range from 1 millisecond to 835
milliseconds. The average PT to RT ratio is 47389. On the “more” platform the PT times range
from 4.4 seconds to 6 minutes, OT times range from 2 millisecond to 1.4 seconds. The average
PT to RT ratio is 45674 (see Table 8). On the “more” platform the far-distant UWI server
performed the coordinator role more often than on the “less” platform. Therefore the nature of the
two-phase commit would generate a longer max PT time. However to the satisfaction of the users
of the system the maximum optimistic processing time is 1.4 seconds with 0 undone.
Table 8. Pessimistic (PT) and Optimistic (OT) times from tests in Fig. 5
Min Max Ave PT/OT
ratio
Less
PT
2.8
sec
4
min
47389
Less
OT
1
msec
835
msec
More
PT
4.4
sec
6
min
45674
More
OT
2
msec
1.4
sec
7. CONCLUSION
We feel that we have met the main objectives that we had set for COPAR. It targets applications
where there is need for very fast receipt and distribution of resources over possibly wide
geographical areas, e.g. a very wide disaster zone. COPAR provides a high level of availability.
There is very fast turnaround time on the processing of transactions. The validation is quick thus
minimizing the need to undo an optimistic result. There is a simple failure handling scheme
which permits all reachable nodes to continue optimistic processing and a “distinguished
partition” to continue pessimistic processing.
There is wide geographical distribution of replicas covering a range of approximately 2000 miles.
Data integrity is preserved through the pessimistic two-phase commit and the choice of an initial
cost bound. It is our view that the design embodies simple but workable concepts. All nodes
handle their child queues optimistically (independently) and their parent queues pessimistically
(two-phase commit).
However there is further work to be done. Three main tasks are (1) improving the handling of
failure, (2) separating the system manager from the transaction manager and (3) implementing
multiple transaction generators with interfaces that run on mobile devices.
15. International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.3, May 2013
15
REFERENCES
[1] Crichlow, J.M., Hartley, S. and Hosein, M., 2012. A High-Availability Distributed System for
Countable Objects like Disaster Relief, The Computing Science and Technology International
Journal, Vol. 2, No. 2, June, 29-32.
[2] Francis, M.F. and Crichlow, J.M., 1995. A mechanism for combining optimism and pessimism in
distributed processing, Proceedings of the IASTED/ISMM International Conference on Intelligent
Information Management Systems, Washington, D.C., June, 103-106.
[3] Hosein, M. and Crichlow, J.M., 1998. Fault-tolerant Optimistic Concurrency control in a Distributed
System, Proceedings of the IASTED International Conference on Software Engineering, Las Vegas,
October 28-31, 319-322.
[4] Innis, C., Crichlow, J., Hosein, M. and Hartley, S., 2002, A Java System that combines Optimism and
Pessimism in updating Replicas, Proceedings of the IASTED International Conference on Software
Engineering and Applications, Cambridge, MA, November 4-6.
[5] Bernstein, P.A., Hadzilacos, V. and Goodman, N., 1987. Concurrency Control and Recovery in
Database Systems, Addison-Wesley Pub. Co., Reading, Ma.
[6] Birman, K.P., 1996. Building Secure and Reliable Network Applications, Manning Pub. Co.
[7] Birman, K.P., 2012. Guide to Reliable Distributed Systems, Springer.
[8] Crichlow, J.M., 2000. The Essence of Distributed Systems, Prentice Hall/Pearson Education, U.K.
[9] Jajodia, S. and Mutchler, D., 1990. Dynamic voting algorithms for maintaining the consistency of a
replicated database, ACM Transactions on Database Systems, 15, 2(Jun), 230-280.
[10] Krishnakumar, N & Bernstein, A.J., 1991. Bounded Ignorance in Replicated Systems, Tenth ACM
Symp. On Principles of Database Systems.
[11] Wolfson, O., Jajodia, S. and Huang, Y., 1997. An adaptive data replication algorithm. ACM
Transactions on Database Systems, 22(2), 255-314.
[12] Saito, Y. and Shapiro, M., 2005. Optimistic Replication, ACM Computing Surveys, 37(1), March, 42-
81.
[13] Yu, H. & Vahdat, A., 2001. The Costs and Limits of Availability for Replicated Services, Proceedings
of the 18th ACM Symposium on Operating systems Principles, Alberta, Canada, October 21-24, 29-
42.
[14] Brewer, E., 2000. Towards Robust Distributed Systems, ACM Symposium on Principles of
Distributed Computing (PODC), Keynote address, July 19.
[15] Lynch, N.A. and Gilbert, S., 2002. Brewer’s Conjecture and the feasibility of consistent, available,
partition tolerant Web Services, ACM SIGACT News, Vol. 33, Issue 2, 51-59.
[16] Pritchett, D., 2008. An ACID Alternative, ACM Queue, Vol. 6, No. 3, May/June, 48-55.
[17] Lynch, N.A., Blaustein, B.T. and Siegel, M., 1986. Correctness conditions for highly available
replicated databases, Proceedings of the fifth annual ACM Symposium on Principles of Distributed
Computing, Aug., 11-28.
[18] Crichlow, J.M., 1994. Combining optimism and pessimism to produce high availability in distributed
transaction processing, ACM SIGOPS Operating Systems Review, 28, 3(July), 43-64.
[19] Crichlow, J.M., 2009. Distributed Systems – Computing over Networks, Prentice Hall India.
[20] Tanenbaum, A. S., van Steen, M., 2002. Distributed Systems, Principles and Paradigms, Prentice
Hall, NJ.
[21] Gifford, D.K., 1979. Weighted Voting for Replicated Data, Proceedings of the Seventh ACM
Symposium on Operating Systems Principles, Dec., 150-162.