This document summarizes a research paper that proposes a new resource scheduling algorithm called STRS for cloud computing environments. STRS aims to optimally allocate data resources across computational clusters in a distributed system to minimize data access costs. It does this through two distributed algorithms - STRSA runs at each parent node to determine optimal data allocation to child nodes, and STRSD runs at each child node to determine optimal data de-allocation. The paper also proposes a intra-cluster replication algorithm called ORPNDA that uses heuristic expansion-shrinking methods to determine optimal partial data replication within each cluster. Experimental results show STRS and ORPNDA significantly outperform general frequency-based replication schemes.
1) The document discusses mining data streams using an improved version of McDiarmid's bound. It aims to enhance the bounds obtained by McDiarmid's tree algorithm and improve processing efficiency.
2) Traditional data mining techniques cannot be directly applied to data streams due to their continuous, rapid arrival. The document proposes using Gaussian approximations to McDiarmid's bounds to reduce the size of training samples needed for split criteria selection.
3) It describes Hoeffding's inequality, which is commonly used but not sufficient for data streams. The document argues that McDiarmid's inequality, used appropriately, provides a more efficient technique for high-speed, time-changing data streams.
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
This document discusses techniques for clustering hierarchical documents based on their structural similarity. It summarizes several existing approaches:
1) A tree edit distance-based method that represents trees as paths and computes the distance between subtrees. However, it requires trees to have a pre-specified structure.
2) Chawathe's algorithm that uses pre-order tree traversal and transforms trees into sequences of node labels and depths to calculate distances. It allows efficient assignment of new documents to clusters.
3) The XCLSC algorithm that clusters documents in two phases - grouping structurally similar documents and then searching to further improve clustering results and performance. However, it has high computational requirements.
4) The XPattern and PathXP
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
A DELAY – CONSTRAINED IN MULTICAST ROUTING USING JIA ALGORITHMIJCI JOURNAL
The Distributed multicast routing protocol under delay constraints, is one of the software, which requires simultaneous transmission of message from a source to a group of destinations within specified time delay. For example. Video Conferencing system. Multicast routing is to find a routing tree which is routed from the source and contains all the destinations. The principle goal of multicast routing is to minimize the network cost. A tree with minimal overall cost is called a Steiner tree. Finding such tree is the principle of the NP complete.
Many inexpensive heuristic algorithms have been proposed for the Steiner tree problem. However, most of the proposed algorithms are centralized in nature. Centralized algorithm requires a central node to be responsible for computing the tree and this central node must have full knowledge about the global network. But, this is not practical in large networks. Therefore, existing algorithms suffer from the drawback such as heavy communication cost, long connection setup time and poor quality of produced routing trees. So far, a little work has been done on finding delay bounded Steiner tree in a distributed manner. An effort is made in this paper to this effect. The Study reveals that the drawbacks mentioned
above has been sufficiently reduced. This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
This document proposes a new sketch method called Joint Sketch (JS) to measure the host connection degree distribution (HCDD) in real-time for high-speed network links. JS uses a discrete uniform Flajolet-Martin sketch combined with a small bitmap to build a compact digest of each host's network flows. Theoretical analysis and experimental results show that JS is significantly more accurate than previous methods at estimating the HCDD, using the same amount of memory. JS provides a more effective traffic summary than prior methods, especially for hosts with many connections.
1) The document discusses mining data streams using an improved version of McDiarmid's bound. It aims to enhance the bounds obtained by McDiarmid's tree algorithm and improve processing efficiency.
2) Traditional data mining techniques cannot be directly applied to data streams due to their continuous, rapid arrival. The document proposes using Gaussian approximations to McDiarmid's bounds to reduce the size of training samples needed for split criteria selection.
3) It describes Hoeffding's inequality, which is commonly used but not sufficient for data streams. The document argues that McDiarmid's inequality, used appropriately, provides a more efficient technique for high-speed, time-changing data streams.
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
This document discusses techniques for clustering hierarchical documents based on their structural similarity. It summarizes several existing approaches:
1) A tree edit distance-based method that represents trees as paths and computes the distance between subtrees. However, it requires trees to have a pre-specified structure.
2) Chawathe's algorithm that uses pre-order tree traversal and transforms trees into sequences of node labels and depths to calculate distances. It allows efficient assignment of new documents to clusters.
3) The XCLSC algorithm that clusters documents in two phases - grouping structurally similar documents and then searching to further improve clustering results and performance. However, it has high computational requirements.
4) The XPattern and PathXP
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
A DELAY – CONSTRAINED IN MULTICAST ROUTING USING JIA ALGORITHMIJCI JOURNAL
The Distributed multicast routing protocol under delay constraints, is one of the software, which requires simultaneous transmission of message from a source to a group of destinations within specified time delay. For example. Video Conferencing system. Multicast routing is to find a routing tree which is routed from the source and contains all the destinations. The principle goal of multicast routing is to minimize the network cost. A tree with minimal overall cost is called a Steiner tree. Finding such tree is the principle of the NP complete.
Many inexpensive heuristic algorithms have been proposed for the Steiner tree problem. However, most of the proposed algorithms are centralized in nature. Centralized algorithm requires a central node to be responsible for computing the tree and this central node must have full knowledge about the global network. But, this is not practical in large networks. Therefore, existing algorithms suffer from the drawback such as heavy communication cost, long connection setup time and poor quality of produced routing trees. So far, a little work has been done on finding delay bounded Steiner tree in a distributed manner. An effort is made in this paper to this effect. The Study reveals that the drawbacks mentioned
above has been sufficiently reduced. This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
This document proposes a new sketch method called Joint Sketch (JS) to measure the host connection degree distribution (HCDD) in real-time for high-speed network links. JS uses a discrete uniform Flajolet-Martin sketch combined with a small bitmap to build a compact digest of each host's network flows. Theoretical analysis and experimental results show that JS is significantly more accurate than previous methods at estimating the HCDD, using the same amount of memory. JS provides a more effective traffic summary than prior methods, especially for hosts with many connections.
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...aciijournal
A Horizontal scaling is a Cloud architectural strategy by which the number of nodes or computers
increased to meet the demand of continuously increasing workload. The cost of compute instances
increases with increased workload & the research is aimed to bring an optimization of the reserved Cloud instances using principles of Inventory theory applied to IoT datasets with variable stochastic nature. With a structured solution architecture laid down for the business problem to understand the checkpoints of compute instances – the range of approximate reserved compute instances have been optimized & pinpointed by analysing the probability distribution curves of the IoT datasets. The Inventory theory applied to the distribution curves of the data provides the optimized number of compute instances required taking the range prescribed from the solution architecture. The solution would help Cloud solution architects & Project sponsors in planning the compute power required in AWS® Cloud platform in any business situation where ingestion & processing data of stochastic nature is a business need.
This document discusses techniques for evaluating continuous aggregation queries over a network of data aggregators. It presents the following key points:
1. Continuous queries are used to monitor changing data and provide online results within a specified incoherency bound. This paper presents a technique for decomposing queries and assigning sub-queries to aggregators to minimize refresh messages.
2. A network of data aggregators maintains data at different incoherency bounds. The problem is assigning sub-queries to aggregators such that their combined incoherency satisfies the overall query bound with fewest messages.
3. A query cost model is developed to estimate refresh messages based on data dynamics and specified incoherency bounds. Performance
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
This document proposes a two-stage sampling selection strategy (T3S) for large-scale data deduplication using Apache Spark. T3S reduces the labeling effort for training data by first selecting balanced subsets of candidate pairs, then removing redundant pairs to produce a smaller, more informative training set. It detects fuzzy region boundaries using this training set to classify candidate pairs. The approach is implemented in a distributed manner using Apache Spark and shows better performance than an existing method by reducing the training set size.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Data Compression in Data mining and Business Intelligencs ShahDhruv21
This document discusses various techniques for numerosity reduction in data mining. Numerosity reduction aims to reduce the volume of data while maintaining integrity. Methods include parametric approaches like regression and log-linear models, and non-parametric methods such as histograms, clustering, sampling, and data cube aggregation. Histograms bin data into buckets to store average values, while clustering partitions data into groups. Sampling obtains a representative subset of data. Data cube aggregation precomputes and stores multidimensional summarized data. Together, these techniques provide more efficient analysis of large datasets.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
This document presents a distributed decision tree learning algorithm called Vertical Hoeffding Tree (VHT) for mining big data streams. It summarizes the contributions of the master's thesis, which include: (1) Developing the SAMOA framework for distributed streaming machine learning, (2) Integrating SAMOA with the Storm distributed stream processing engine, and (3) Implementing the VHT algorithm to improve scalability over the standard Hoeffding Tree algorithm when dealing with high-dimensional data streams. The evaluation shows that VHT achieves similar accuracy to Hoeffding Tree but higher throughput, especially on datasets with many attributes.
The document summarizes 10 influential data mining algorithms:
1. C4.5 decision tree algorithm and its successor C5.0, which can construct classifiers as decision trees or rulesets.
2. K-means clustering algorithm, an iterative algorithm that partitions data into k clusters based on minimizing distances between data points and cluster centers.
3. Additional algorithms covered include SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These algorithms cover important data mining tasks such as classification, clustering, association analysis, and link mining.
Operating Task Redistribution in Hyperconverged Networks IJECEIAES
This document summarizes a research article that proposes a new method for rational task redistribution in hyperconverged networks. The method aims to minimize average packet delay by optimally distributing tasks across network nodes based on their available computational resources. It formulates the task redistribution problem and defines an objective function considering penalties for delays. An initial distribution is determined based on minimum penalties. The distribution is then iteratively optimized by redistributing tasks along "contours" to reduce penalties until an optimal solution is reached that balances resources and task requirements across all nodes. Simulation results using a university e-learning network model demonstrate the method's better performance over classical approaches.
A frame work for clustering time evolving dataiaemedu
The document proposes a framework for clustering time-evolving categorical data using a sliding window technique. It uses an existing clustering algorithm (Node Importance Representative) and a Drifting Concept Detection algorithm to detect changes in cluster distributions between the current and previous data windows. If a threshold difference in clusters is exceeded, reclustering is performed on the new window. Otherwise, the new clusters are added to the previous results. The framework aims to improve on prior work by handling drifting concepts in categorical time-series data.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Column store decision tree classification of unseen attribute setijma
A decision tree can be used for clustering of frequently used attributes to improve tuple reconstruction time
in column-stores databases. Due to ad-hoc nature of queries, strongly correlative attributes are grouped
together using a decision tree to share a common minimum support probability distribution. At the same
time in order to predict the cluster for unseen attribute set, the decision tree may work as a classifier. In
this paper we propose classification and clustering of unseen attribute set using decision tree to improve
tuple reconstruction time.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
Postdiffset Algorithm in Rare Pattern: An Implementation via Benchmark Case S...IJECEIAES
Frequent and infrequent itemset mining are trending in data mining techniques. The pattern of Association Rule (AR) generated will help decision maker or business policy maker to project for the next intended items across a wide variety of applications. While frequent itemsets are dealing with items that are most purchased or used, infrequent items are those items that are infrequently occur or also called rare items. The AR mining still remains as one of the most prominent areas in data mining that aims to extract interesting correlations, patterns, association or casual structures among set of items in the transaction databases or other data repositories. The design of database structure in association rules mining algorithms are based upon horizontal or vertical data formats. These two data formats have been widely discussed by showing few examples of algorithm of each data formats. The efforts on horizontal format suffers in huge candidate generation and multiple database scans which resulting in higher memory consumptions. To overcome the issue, the solutions on vertical approaches are proposed. One of the established algorithms in vertical data format is Eclat.ECLAT or Equivalence Class Transformation algorithm is one example solution that lies in vertical database format. Because of its ‘fast intersection’, in this paper, we analyze the fundamental Eclat and Eclatvariants such asdiffsetand sortdiffset. In response to vertical data format and as a continuity to Eclat extension, we propose a postdiffset algorithm as a new member in Eclat variants that use tidset format in the first looping and diffset in the later looping. In this paper, we present the performance of Postdiffset algorithm prior to implementation in mining of infrequent or rare itemset. Postdiffset algorithm outperforms 23% and 84% to diffset and sortdiffset in mushroom and 94% and 99% to diffset and sortdiffset in retail dataset.
This document discusses a problem of determining the correct author of a paper from a dataset where author names may be ambiguous. The data provided includes information about papers, authors, conferences, and their relationships. Issues with data cleaning and formatting are described. Initially, the problem was viewed as a clustering problem, but clustering algorithms like k-means did not work well due to data type issues. Classification algorithms like J48 decision trees and ZeroR were then explored, but results were not promising. Next steps discussed include further feature engineering and exploring other classification models like Naive Bayes.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...aciijournal
A Horizontal scaling is a Cloud architectural strategy by which the number of nodes or computers
increased to meet the demand of continuously increasing workload. The cost of compute instances
increases with increased workload & the research is aimed to bring an optimization of the reserved Cloud instances using principles of Inventory theory applied to IoT datasets with variable stochastic nature. With a structured solution architecture laid down for the business problem to understand the checkpoints of compute instances – the range of approximate reserved compute instances have been optimized & pinpointed by analysing the probability distribution curves of the IoT datasets. The Inventory theory applied to the distribution curves of the data provides the optimized number of compute instances required taking the range prescribed from the solution architecture. The solution would help Cloud solution architects & Project sponsors in planning the compute power required in AWS® Cloud platform in any business situation where ingestion & processing data of stochastic nature is a business need.
This document discusses techniques for evaluating continuous aggregation queries over a network of data aggregators. It presents the following key points:
1. Continuous queries are used to monitor changing data and provide online results within a specified incoherency bound. This paper presents a technique for decomposing queries and assigning sub-queries to aggregators to minimize refresh messages.
2. A network of data aggregators maintains data at different incoherency bounds. The problem is assigning sub-queries to aggregators such that their combined incoherency satisfies the overall query bound with fewest messages.
3. A query cost model is developed to estimate refresh messages based on data dynamics and specified incoherency bounds. Performance
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
This document proposes a two-stage sampling selection strategy (T3S) for large-scale data deduplication using Apache Spark. T3S reduces the labeling effort for training data by first selecting balanced subsets of candidate pairs, then removing redundant pairs to produce a smaller, more informative training set. It detects fuzzy region boundaries using this training set to classify candidate pairs. The approach is implemented in a distributed manner using Apache Spark and shows better performance than an existing method by reducing the training set size.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Data Compression in Data mining and Business Intelligencs ShahDhruv21
This document discusses various techniques for numerosity reduction in data mining. Numerosity reduction aims to reduce the volume of data while maintaining integrity. Methods include parametric approaches like regression and log-linear models, and non-parametric methods such as histograms, clustering, sampling, and data cube aggregation. Histograms bin data into buckets to store average values, while clustering partitions data into groups. Sampling obtains a representative subset of data. Data cube aggregation precomputes and stores multidimensional summarized data. Together, these techniques provide more efficient analysis of large datasets.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
This document presents a distributed decision tree learning algorithm called Vertical Hoeffding Tree (VHT) for mining big data streams. It summarizes the contributions of the master's thesis, which include: (1) Developing the SAMOA framework for distributed streaming machine learning, (2) Integrating SAMOA with the Storm distributed stream processing engine, and (3) Implementing the VHT algorithm to improve scalability over the standard Hoeffding Tree algorithm when dealing with high-dimensional data streams. The evaluation shows that VHT achieves similar accuracy to Hoeffding Tree but higher throughput, especially on datasets with many attributes.
The document summarizes 10 influential data mining algorithms:
1. C4.5 decision tree algorithm and its successor C5.0, which can construct classifiers as decision trees or rulesets.
2. K-means clustering algorithm, an iterative algorithm that partitions data into k clusters based on minimizing distances between data points and cluster centers.
3. Additional algorithms covered include SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These algorithms cover important data mining tasks such as classification, clustering, association analysis, and link mining.
Operating Task Redistribution in Hyperconverged Networks IJECEIAES
This document summarizes a research article that proposes a new method for rational task redistribution in hyperconverged networks. The method aims to minimize average packet delay by optimally distributing tasks across network nodes based on their available computational resources. It formulates the task redistribution problem and defines an objective function considering penalties for delays. An initial distribution is determined based on minimum penalties. The distribution is then iteratively optimized by redistributing tasks along "contours" to reduce penalties until an optimal solution is reached that balances resources and task requirements across all nodes. Simulation results using a university e-learning network model demonstrate the method's better performance over classical approaches.
A frame work for clustering time evolving dataiaemedu
The document proposes a framework for clustering time-evolving categorical data using a sliding window technique. It uses an existing clustering algorithm (Node Importance Representative) and a Drifting Concept Detection algorithm to detect changes in cluster distributions between the current and previous data windows. If a threshold difference in clusters is exceeded, reclustering is performed on the new window. Otherwise, the new clusters are added to the previous results. The framework aims to improve on prior work by handling drifting concepts in categorical time-series data.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Column store decision tree classification of unseen attribute setijma
A decision tree can be used for clustering of frequently used attributes to improve tuple reconstruction time
in column-stores databases. Due to ad-hoc nature of queries, strongly correlative attributes are grouped
together using a decision tree to share a common minimum support probability distribution. At the same
time in order to predict the cluster for unseen attribute set, the decision tree may work as a classifier. In
this paper we propose classification and clustering of unseen attribute set using decision tree to improve
tuple reconstruction time.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
Postdiffset Algorithm in Rare Pattern: An Implementation via Benchmark Case S...IJECEIAES
Frequent and infrequent itemset mining are trending in data mining techniques. The pattern of Association Rule (AR) generated will help decision maker or business policy maker to project for the next intended items across a wide variety of applications. While frequent itemsets are dealing with items that are most purchased or used, infrequent items are those items that are infrequently occur or also called rare items. The AR mining still remains as one of the most prominent areas in data mining that aims to extract interesting correlations, patterns, association or casual structures among set of items in the transaction databases or other data repositories. The design of database structure in association rules mining algorithms are based upon horizontal or vertical data formats. These two data formats have been widely discussed by showing few examples of algorithm of each data formats. The efforts on horizontal format suffers in huge candidate generation and multiple database scans which resulting in higher memory consumptions. To overcome the issue, the solutions on vertical approaches are proposed. One of the established algorithms in vertical data format is Eclat.ECLAT or Equivalence Class Transformation algorithm is one example solution that lies in vertical database format. Because of its ‘fast intersection’, in this paper, we analyze the fundamental Eclat and Eclatvariants such asdiffsetand sortdiffset. In response to vertical data format and as a continuity to Eclat extension, we propose a postdiffset algorithm as a new member in Eclat variants that use tidset format in the first looping and diffset in the later looping. In this paper, we present the performance of Postdiffset algorithm prior to implementation in mining of infrequent or rare itemset. Postdiffset algorithm outperforms 23% and 84% to diffset and sortdiffset in mushroom and 94% and 99% to diffset and sortdiffset in retail dataset.
This document discusses a problem of determining the correct author of a paper from a dataset where author names may be ambiguous. The data provided includes information about papers, authors, conferences, and their relationships. Issues with data cleaning and formatting are described. Initially, the problem was viewed as a clustering problem, but clustering algorithms like k-means did not work well due to data type issues. Classification algorithms like J48 decision trees and ZeroR were then explored, but results were not promising. Next steps discussed include further feature engineering and exploring other classification models like Naive Bayes.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes an algorithm called Replica Placement in Graph Topology Grid (RPGTG) to optimally place data replicas in a graph-based data grid while ensuring quality of service (QoS). The algorithm aims to minimize data access time, balance load among replica servers, and avoid unnecessary replications, while restricting QoS in terms of number of hops and deadline to complete requests. The article describes how the algorithm converts the graph structure of the data grid to a hierarchical structure to better manage replica servers and proposes services to facilitate dynamic replication, including a replica catalog to track replica locations and a replica manager to perform replication
Improvement of limited Storage Placement in Wireless Sensor NetworkIOSR Journals
This document discusses improving limited storage placement in wireless sensor networks. It aims to minimize the total energy cost for collecting raw sensor data and responding to queries by optimally placing a limited number of storage nodes in the network. An algorithm is presented that calculates the minimum energy cost for placing up to k storage nodes by constructing and evaluating a two-dimensional table. The table entries represent the energy costs at each node for different numbers of storage nodes placed in its subtree. Filling the table from the leaves to the root allows finding the optimal storage node placement with minimum total energy cost.
Abstract— Cloud storage is usually distributed infrastructure, where data is not stored in a single device but is spread to several storage nodes which are located in different areas. To ensure data availability some amount of redundancy has to be maintained. But introduction of data redundancy leads to additional costs such as extra storage space and communication bandwidth which required for restoring data blocks. In the existing system, the storage infrastructure is considered as homogeneous where all nodes in the system have same online availability which leads to efficiency losses. The proposed system considers that distributed storage system is heterogeneous where each node exhibit different online availability. Monte Carlo Sampling is used to measure the online availability of storage nodes. The parallel version of Particle Swarm Optimization is used to assign redundant data blocks according to their online availability. The optimal data assignment policy reduces the redundancy and their associated cost.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey IJECEIAES
In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
The document discusses the Industrial Internet of Things (IIoT) and key challenges in developing a dataflow programming model and middleware for IIoT systems. It notes that IIoT systems involve large-scale distributed data publishing and processing streams in a parallel manner. Existing pub-sub middleware like DDS can handle data distribution but lack support for composable local data processing. The document proposes combining DDS with reactive programming using Rx.NET to provide a unified dataflow model for both local processing and distribution.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes a new dynamic data replication and job scheduling strategy for data grids. The strategy aims to improve data access time and reduce bandwidth consumption by replicating data based on file popularity, storage limitations at nodes, and data category. It replicates more popular files that are in the same category as frequently accessed data to nodes close to where jobs are run. This is intended to optimize performance by locating data and jobs close together. The document provides context on related work and outlines the proposed system architecture and replication/scheduling approach.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
Nodes in Mobile Ad-hoc network are connected wirelessly and the network is auto configuring [1]. This paper introduces the usefulness of data warehouse as an alternative to manage data collected by WSN.Wireless Sensor Network produces huge quantity of data that need to be proceeded and homogenised, so as to help researchers and other people interested in the information. Collected data is managed and compared with other coming from datasources and systems could participate in technical report and decision making. This paper proposes a model to design, extract, transform and normalize data collected by Wireless Sensor Networks by implementing a multidimensional warehouse for comparing many aspects in WSN such as (routing protocol[4], sensor, sensor mobility, cluster ….). Hence, data warehouse defined and applied to the context above is presented as a useful approach that gives specialists row data and information for decision processes and navigate from one aspect to another.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
AN INVERTED LIST BASED APPROACH TO GENERATE OPTIMISED PATH IN DSR IN MANETS –...Editor IJCATR
In this paper, we design and formulate the inverted list based approach for providing safer path and effective
communication in DSR protocol.Some nodes in network can participate in network more frequenctly whereas some nodes are not
participating. Because of this there is the requirement of such an approach that will take an intelligent decision regarding the sharing of
bandwidth or the resource to a node or the node group. Dynamic source routing protocol (DSR) is an on-demand, source routing
protocol , whereby all the routing information is maintained (continually updated) at mobile nodes.
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A location based least-cost scheduling for data-intensive applicationsIAEME Publication
This document summarizes a research paper that proposes a location-based least-cost scheduling algorithm for transferring multiple data-intensive files simultaneously to multiple compute nodes in a grid environment. The proposed model includes an optimized meta-scheduler that receives multiple files, predicts the optimal number of parallel TCP streams to use for each file transfer based on sampling, and schedules the files to compute nodes using a greedy algorithm that considers location and cost. Experimental results showed the optimized model achieved better transfer times and throughput compared to non-optimized transfers.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies.