Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
The wide usage of the Internet and the availability of powerful computers and high-speed networks as low cost
commodity components have a deep impact on the way we use computers today, in such a way that
these technologies facilitated the usage of multi-owner and geographically distributed resources to address
large-scale problems in many areas such as science, engineering, and commerce. The new paradigm of
Grid computing has evolved from these researches on these topics. Performance and utilization of the grid
depends on a complex and excessively dynamic procedure of optimally balancing the load among the
available nodes. In this paper, we suggest a novel two-dimensional figure of merit that depict the network
effects on load balance and fault tolerance estimation to improve the performance of the network
utilizations. The enhancement of fault tolerance is obtained by adaptively decrease replication time and
message cost. On the other hand, load balance is improved by adaptively decrease mean job response time.
Finally, analysis of Genetic Algorithm, Ant Colony Optimization, and Particle Swarm Optimization is
conducted with regards to their solutions, issues and improvements concerning load balancing in
computational grid. Consequently, a significant system utilization improvement was attained. Experimental
results eventually demonstrate that the proposed method's performance surpasses other methods.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
The wide usage of the Internet and the availability of powerful computers and high-speed networks as low cost
commodity components have a deep impact on the way we use computers today, in such a way that
these technologies facilitated the usage of multi-owner and geographically distributed resources to address
large-scale problems in many areas such as science, engineering, and commerce. The new paradigm of
Grid computing has evolved from these researches on these topics. Performance and utilization of the grid
depends on a complex and excessively dynamic procedure of optimally balancing the load among the
available nodes. In this paper, we suggest a novel two-dimensional figure of merit that depict the network
effects on load balance and fault tolerance estimation to improve the performance of the network
utilizations. The enhancement of fault tolerance is obtained by adaptively decrease replication time and
message cost. On the other hand, load balance is improved by adaptively decrease mean job response time.
Finally, analysis of Genetic Algorithm, Ant Colony Optimization, and Particle Swarm Optimization is
conducted with regards to their solutions, issues and improvements concerning load balancing in
computational grid. Consequently, a significant system utilization improvement was attained. Experimental
results eventually demonstrate that the proposed method's performance surpasses other methods.
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
Nodes in Mobile Ad-hoc network are connected wirelessly and the network is auto configuring [1]. This paper introduces the usefulness of data warehouse as an alternative to manage data collected by WSN.Wireless Sensor Network produces huge quantity of data that need to be proceeded and homogenised, so as to help researchers and other people interested in the information. Collected data is managed and compared with other coming from datasources and systems could participate in technical report and decision making. This paper proposes a model to design, extract, transform and normalize data collected by Wireless Sensor Networks by implementing a multidimensional warehouse for comparing many aspects in WSN such as (routing protocol[4], sensor, sensor mobility, cluster ….). Hence, data warehouse defined and applied to the context above is presented as a useful approach that gives specialists row data and information for decision processes and navigate from one aspect to another.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies.
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
Nodes in Mobile Ad-hoc network are connected wirelessly and the network is auto configuring [1]. This paper introduces the usefulness of data warehouse as an alternative to manage data collected by WSN.Wireless Sensor Network produces huge quantity of data that need to be proceeded and homogenised, so as to help researchers and other people interested in the information. Collected data is managed and compared with other coming from datasources and systems could participate in technical report and decision making. This paper proposes a model to design, extract, transform and normalize data collected by Wireless Sensor Networks by implementing a multidimensional warehouse for comparing many aspects in WSN such as (routing protocol[4], sensor, sensor mobility, cluster ….). Hence, data warehouse defined and applied to the context above is presented as a useful approach that gives specialists row data and information for decision processes and navigate from one aspect to another.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies.
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Grid is a type of parallel and distributed systems that is designed to provide reliable access to data
and computational resources in wide area networks. These resources are distributed in different geographical
locations. Efficient data sharing in global networks is complicated by erratic node failure, unreliable network
connectivity and limited bandwidth. Replication is a technique used in grid systems to improve the
applications’ response time and to reduce the bandwidth consumption. In this paper, we present a survey on
basic and new replication techniques that have been proposed by other researchers. After that, we have a full
comparative study on these replication strategies
Frequent pattern mining techniques helpful to find interesting trends or patterns in
massive data. Prior domain knowledge leads to decide appropriate minimum support threshold. This
review article show different frequent pattern mining techniques based on apriori or FP-tree or user
define techniques under different computing environments like parallel, distributed or available data
mining tools, those helpful to determine interesting frequent patterns/itemsets with or without prior
domain knowledge. Proposed review article helps to develop efficient and scalable frequent pattern
mining techniques.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...IAEME Publication
Association rule mining plays an important role in decision support system. Nowadays in the era of internet, various online marketing sites and social networking sites are generating enormous amount of structural/semi structural data in the form of sales data, tweets, emails, web pages and so on. This online generated data is too large that it becomes very complex to process and analyze it using traditional systems which consumes more time. This paper overcomes the main memory bottleneck in single computing system. There are two major goals of this paper. In this paper, big sales dataset of AMUL dairy is preprocessed using Hadoop Map Reduce that convert it into the transactional dataset. Then, after removing the null transactions; distributed frequent pattern mining algorithm MR-DARM (Map Reduce based Distributed Association Rule Mining) is used to find most frequent item set. Finally, strong association rules are generated from frequent item sets. The paper also compares the time efficiency of MR-DARM algorithm with existing Count Distributed Algorithm (CDA) and Fast Distributed Mining (FDM) distributed frequent pattern mining algorithms. The compared algorithms are presented together with experimental results that lead to the final conclusions.
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...TELKOMNIKA JOURNAL
This research is about the application of multi-threaded and trie data structures to the support calculation problem in the Apriori algorithm.
The support calculation results can search the association rule for market basket analysis problems. The support calculation process is a bottleneck process and can cause delays in the following process. This work observed five multi-threaded models based on Flynn’s taxonomy, which are single process, multiple data (SPMD), multiple process, single data (MPSD), multiple process, multiple data (MPMD), double SPMD first variant, and double SPMD second variant to shorten the processing time of the support calculation. In addition to the processing time, this works also consider the time difference between each multi-threaded model when the number of item variants increases. The time obtained from the experiment shows that the multi-threaded model that applies a double SPMD variant structure can perform almost three times faster than the multi-threaded model that applies the SPMD structure, MPMD structure, and combination of MPMD and SPMD based on the time difference of 5-itemsets and 10-itemsets experimental result.
T AXONOMY OF O PTIMIZATION A PPROACHES OF R ESOURCE B ROKERS IN D ATA G RIDSijcsit
A novel taxonomy of replica selection techniques is proposed. We studied some data grid approaches
where the selection strategies of data management is different. The aim of the study is to determine the
common concepts and observe their performance and to compare their performance with our strategy
Optimization of resource allocation in computational gridsijgca
The resource allocation in Grid computing system needs to be scalable, reliable and smart. It should also be adaptable to change its allocation mechanism depending upon the environment and user’s requirements. Therefore, a scalable and optimized approach for resource allocation where the system can adapt itself to the changing environment and the fluctuating resources is essentially needed. In this paper, a Teaching Learning based optimization approach for resource allocation in Computational Grids is proposed. The proposed algorithm is found to outperform the existing ones in terms of execution time and cost. The algorithm is simulated using GRIDSIM and the simulation results are presented.
Grid computing can involve lot of computational tasks which requires trustworthy computational nodes. Load balancing in grid computing is a technique which overall optimizes the whole process of assigning computational tasks to processing nodes. Grid computing is a form of distributed computing but different from conventional distributed computing in a manner that it tends to be heterogeneous, more loosely coupled and dispersed geographically. Optimization of this process must contains the overall maximization of resources utilization with balance load on each processing unit and also by decreasing the overall time or output. Evolutionary algorithms like genetic algorithms have studied so far for the implementation of load balancing across the grid networks. But problem with these genetic algorithm is that they are quite slow in cases where large number of tasks needs to be processed. In this paper we give a novel approach of parallel genetic algorithms for enhancing the overall performance and optimization of managing the whole process of load balancing across the grid nodes.
An effective search on web log from most popular downloaded contentijdpsjournal
A Web page recommender system effectively predicts the best related web page to search. While search
ing
a word from search engine it may display some unnecessary links and unrelated data’s to user so to a
void
this problem, the con
ceptual prediction model combines both the web usage and domain knowledge. The
proposed conceptual prediction model automatically generates a semantic network of the semantic Web
usage knowledge, which is the integration of domain knowledge and web usage i
nformation. Web usage
mining aims to discover interesting and frequent user access patterns from web browsing data. The
discovered knowledge can then be used for many practical web applications such as web
recommendations, adaptive web sites, and personali
zed web search and surfing
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMINOUS DATA-SETS (20)
Advanced Computing: An International Journal (ACIJ) is a peer-reviewed, open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advanced computing. The journal focuses on all technical and practical aspects of high performance computing, green computing, pervasive computing, cloud computing etc. The goal of this journal is to bring together researchers and a practitioners from academia and industry to focus on understanding advances in computing and establishing new collaborations in these areas.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of computing.
Call for Papers - Advanced Computing An International Journal (ACIJ) (2).pdfacijjournal
Submit your Research Papers!!!
Advanced Computing: An International Journal ( ACIJ )
ISSN: 2229 -6727 [Online] ; 2229 - 726X [Print]
Webpage URL: http://airccse.org/journal/acij/acij.html
Submission URL: http://coneco2009.com/submissions/imagination/home.html
Submission Deadline : April 08, 2023
Here's where you can reach us : acijjournal@yahoo.com or acij@aircconline
Advanced Computing: An International Journal (ACIJ
)
is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advancedcomputing. The journal focuses on all technical and practical aspects of high performancecomputing, green computing, pervasive computing, cloud computing etc. The goal of this journalis to bring together researchers anda practitioners from academia and industry to focus onunderstanding advances in computing and establishing new collaborations in these areas
Submit your Research Papers!!!
Advanced Computing: An International Journal ( ACIJ )
ISSN: 2229 -6727 [Online] ; 2229 - 726X [Print]
Webpage URL: http://airccse.org/journal/acij/acij.html
Submission URL: http://coneco2009.com/submissions/imagination/home.html
Here's where you can reach us : acijjournal@yahoo.com or acij@aircconline.com
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
4thInternational Conference on Machine Learning & Applications (CMLA 2022)acijjournal
4thInternational Conference on Machine Learning & Applications (CMLA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.Authors are solicited to contribute to the conference by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
3rdInternational Conference on Natural Language Processingand Applications (N...acijjournal
3rdInternational Conference on Natural Language Processing and Applications (NLPA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Natural Language Computing and its applications. The Conference looks for significant contributions to all major fieldsof the Natural Language processing in theoretical and practical aspects.
4thInternational Conference on Machine Learning & Applications (CMLA 2022)acijjournal
4thInternational Conference on Machine Learning & Applications (CMLA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
Graduate School Cyber Portfolio: The Innovative Menu For Sustainable Developmentacijjournal
In today’s milieu, new demands and trends emerge in the field of Education giving teachers of Higher Education Institutions (HEI’s) no choice but to be innovative to cope with the fast changing technology. To be naturally innovative, a graduate school teacher needs to be technologically and pedagogically competent. One of the ways to be on this level is by creating his cyber portfolio to support students’ eportfolio for lifelong learning. Cyber portfolio is an innovative menu for teachers who seek out strategies to integrate technology in their lessons. This paper presents a straightforward preparation on how to innovate a cyber portfolio that has its practical and breakthrough solution against expensive and inflexible vended software which often saddle many universities. Additionally, this cyber portfolio is free and it addresses the 21st century skills of graduate students blended with higher order thinking skills, multiple intelligence, technology and multimedia.
Genetic Algorithms and Programming - An Evolutionary Methodologyacijjournal
Genetic programming (GP) is an automated method for creating a working computer program from a high-level problem statement of a problem. Genetic programming starts from a high-level statement of “what needs to be done” and automatically creates a computer program to solve the problem. In artificial intelligence, genetic programming (GP) is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user defined task. It is a specialization of genetic algorithms (GA) where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness span determined by a program's ability to perform a given computational task. This paper presents a idea of the various principles of genetic programming which includes, relative effectiveness of mutation, crossover, breeding computer programs and fitness test in genetic programming. The literature of traditional genetic algorithms contains related studies, but through GP, it saves time by freeing the human from having to design complex algorithms. Not only designing the algorithms but creating ones that give optimal solutions than traditional counterparts in noteworthy ways.
Data Transformation Technique for Protecting Private Information in Privacy P...acijjournal
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Data
Mining can be utilized in any organization that needs to find patterns or relationships in their data. A group of techniques that find relationships that have not previously been discovered. In many situations, the extracted patterns are highly private and it should not be disclosed. In order to maintain the secrecy of data,
there is in need of several techniques and algorithms for modifying the original data in order to limit the extraction of confidential patterns. There have been two types of privacy in data mining. The first type of privacy is that the data is altered so that the mining result will preserve certain privacy. The second type of privacy is that the data is manipulated so that the mining result is not affected or minimally affected. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be
applied on data bases without violating the privacy of individuals. Many techniques for privacy preserving data mining have come up over the last decade. Some of them are statistical, cryptographic, randomization methods, k-anonymity model, l-diversity and etc. In this work, we propose a new perturbative masking technique known as data transformation technique can be used for protecting the sensitive information. An
experimental result shows that the proposed technique gives the better result compared with the existing technique.
Advanced Computing: An International Journal (ACIJ) acijjournal
Advanced Computing: An International Journal (ACIJ) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advanced computing. The journal focuses on all technical and practical aspects of high performance computing, green computing, pervasive computing, cloud computing etc. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding advances in computing and establishing new collaborations in these areas.
E-Maintenance: Impact Over Industrial Processes, Its Dimensions & Principlesacijjournal
During the course of the industrial 4.0 era, companies have been exponentially developed and have
digitized almost the whole business system to stick to their performance targets and to keep or to even
enlarge their market share. Maintenance function has obviously followed the trend as it’s considered one
of the most important processes in every enterprise as it impacts a group of the most critical performance
indicators such as: cost, reliability, availability, safety and productivity. E-maintenance emerged in early
2000 and now is a common term in maintenance literature representing the digitalized side of maintenance
whereby assets are monitored and controlled over the internet. According to literature, e-maintenance has
a remarkable impact on maintenance KPIs and aims at ambitious objectives like zero-downtime.
10th International Conference on Software Engineering and Applications (SEAPP...acijjournal
10th International Conference on Software Engineering and Applications (SEAPP 2021) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Software Engineering and Applications. The goal of this Conference is to bring together researchers and practitioners from academia and industry to focus on understanding Modern software engineering concepts and establishing new collaborations in these areas.
10th International conference on Parallel, Distributed Computing and Applicat...acijjournal
10th International conference on Parallel, Distributed Computing and Applications (IPDCA 2021) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Parallel, Distributed Computing. Original papers are invited on Algorithms and Applications, computer Networks, Cyber trust and security, Wireless networks and mobile Computing and Bioinformatics. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
DETECTION OF FORGERY AND FABRICATION IN PASSPORTS AND VISAS USING CRYPTOGRAPH...acijjournal
In this paper, we present a novel solution to detect forgery and fabrication in passports and visas using
cryptography and QR codes. The solution requires that the passport and visa issuing authorities obtain a
cryptographic key pair and publish their public key on their website. Further they are required to encrypt
the passport or visa information with their private key, encode the ciphertext in a QR code and print it on
the passport or visa they issue to the applicant.
The issuing authorities are also required to create a mobile or desktop QR code scanning app and place it
for download on their website or Google Play Store and iPhone App Store. Any individual or immigration
authority that needs to check the passport or visa for forgery and fabrication can scan its QR code, which
will decrypt the ciphertext encoded in the QR code using the public key stored in the app memory and
displays the passport or visa information on the app screen. The details on the app screen can be
compared with the actual details printed on the passport or visa. Any mismatch between the two is a clear
indication of forgery or fabrication.
Discussed the need for a universal desktop and mobile app that can be used by immigration authorities and
consulates all over the world to enable fast checking of passports and visas at ports of entry for forgery
and fabrication.
Detection of Forgery and Fabrication in Passports and Visas Using Cryptograph...acijjournal
In this paper, wepresenta novel solution to detect forgery and fabrication in passports and visas using cryptography and QR codes. The solution requires that the passport and visa issuing authorities obtain a cryptographic key pair and publish their public key on their website. Further they are required to encrypt the passport or visa information with their private key, encode the ciphertext in a QR code and print it on the passport or visa they issue to the applicant.
The issuing authorities are also required to create a mobile or desktop QR code scanning app and place it for download on their website or Google Play Store and iPhone App Store. Any individual or immigration authority that needs to check the passport or visa for forgery and fabrication can scan its QR code, which will decrypt the ciphertext encoded in the QR code using the public key stored in the app memory and displays the passport or visa information on the app screen. The details on the app screen can be compared with the actual details printed on the passport or visa. Any mismatch between the two is a clear indication of forgery or fabrication.
Discussed the need for a universal desktop and mobile app that can be used by immigration authorities and consulates all over the world to enable fast checking of passports and visas at ports of entry for forgery and fabrication.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMINOUS DATA-SETS
1. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
DOI : 10.5121/acij.2012.3604 29
MAP/REDUCE DESIGN AND IMPLEMENTATION OF
APRIORIALGORITHM FOR HANDLING
VOLUMINOUS DATA-SETS
Anjan K Koundinya1
,Srinath N K1
,K A K Sharma1
, Kiran Kumar1
, Madhu M N1
and Kiran U Shanbag1
1
Department of Computer Science and Engineering,
R V College of Engineering, Bangalore, India
annjank2@gmail.com
ABSTRACT
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
KEYWORDS
Frequent Itemset, Distributed Computing,Hadoop, Apriori, Distributed Data Mining
1. INTRODUCTION
In many applications of the real world, generated data is of great concern to the stakeholder as it
delivers meaningful information / knowledge that assists in making predictive analysis. This
knowledge helps in modifying certain decision parameters of the application that changes the
overall outcome of a business process. The volume of data, collectively called data-sets,
generated by the application is very large. So, there is a need of processing large data-sets
efficiently. The data-set collected may be from heterogeneous sources and may be structured or
unstructured data. Processing such data generates useful patterns from which knowledge can be
extracted. the simplest approach is to use this template and insert headings and text into it as
appropriate.
Data mining is the process of finding correlations or patterns among fields in large data-sets and
building up the knowledge-base, based on the given constraints. The overall goal of data mining
is to extract knowledge from an existing data-set and transform it into a human-understandable
structure for further use. This process is often referred to as Knowledge Discovery in data-sets
(KDD). The process has revolutionized the approach of solving the complex real-world
problems. KDD process consists of series of tasks like selection, pre-processing, transformation,
data mining and interpretation as shown in Figure1.
2. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
30
Figure 1: KDD Process
In a distributed computing environment is a bunch of loosely coupled processing
nodes connected by network. Each nodes contributes to the execution or distribution /
replication of data. It is referred to as a cluster of nodes. There are various methods of
setting up a cluster, one of which is usually referred to as cluster framework. Such
frameworks enforce the setting up processing and replication nodes for data. Examples
are DryAdLinq and Apache Hadoop (also called Map / Reduce). The other methods
involve setting up of cluster nodes on ad-hoc basis and not being bound by a rigid
framework. Such methods just involve a set of API calls basically for remote method
invocation (RMI) as a part of inter-process communication. Examples are Message
Passing Interface (MPI) and a variant of MPI called MPJExpress.
The method of setting up a cluster depends upon the data densities and up on the
scenarios listed below:
• The data is generated at various locations and needs to be accessed locally
most of the time for processing.
• The data and processing is distributed to the machines in the cluster to
reduce the impact of any particular machine being overloaded that
damages its processing
This paper is organized as follows, the next section will discuss about complete
survey of the related work carried out the domain of the distributed data mining,
specially focused on finding frequent item sets. The section 3 of this paper discusses
about the design and implementation of the Apriori algorithm tuned to the distributed
environment, keeping a key focus on the experimental test-bed requirement. The section
4, discusses about the results of the test setup based on Map/Reduce – Hadoop. Finally
conclude our work with the section 5.
3. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
31
2. RELATED WORK
Distributed Data Mining in Peer-to-Peer Networks (P2P) [1] offers an overview of
distributed data-mining applications and algorithms for peer-to-peer environments. It describes
both exact and approximate distributed data-mining algorithms that work in a decentralized
manner. It illustrates these approaches for the problem of computing and monitoring clusters in
the data residing at the different nodes of a peer-to-peer network. This paper focuses on an
emerging branch of distributed data mining called peer-to-peer data mining. It also offers a
sample of exact and approximate P2P algorithms for clustering in such distributed
environments.
Web Service-based approach for data mining in distributed environments [2] presents an
approach to develop a data mining system in distributed environments. This paper presents a
web service-based approach to solve these problems. The system is built using this approach
offers a uniform presentation and storage mechanism, platform independent interface, and a
dynamically extensible architecture. The proposed approach in this paper allows users to
classify new incoming data by selecting one of the previously learnt models.
Architecture for data mining in distributed environments [3] describes system architecture for
scalable and portable distributed data mining applications. This approach presents a document
metaphor called emph{Living Documents} for accessing and searching for digital documents
in modern distributed information systems. The paper describes a corpus linguistic analysis of
large text corpora based on colocations with the aim of extracting semantic relations from
unstructured text.
Distributed Data Mining of Large Classifier Ensembles [4] presents a new classifier
combination strategy that scales up efficiently and achieves both high predictive accuracy and
tractability of problems with high complexity. It induces a global model by learning from the
averages of the local classifiers output. The effective combination of large number of classifiers
is achieved this way.
Multi Agent-Based Distributed Data Mining [5] is the integration of multi-agent system and
distributed data mining (MADM), also known as multi-agent based distributed data mining. The
perspective here is in terms of significance, system overview, existing systems, and research
trends. This paper presents an overview of MADM systems that are prominently in use. It also
defines the common components between systems and gives a description of their strategies and
architecture.
Preserving Privacy and Sharing the Data in Distributed Environment using Cryptographic
Technique on Perturbed data [6] proposes a framework that allows systematic transformation of
original data using randomized data perturbation technique. The modified data is then submitted
to the system through cryptographic approach. This technique is applicable in distributed
environments where each data owner has his own data and wants to share this with the other
data owners. At the same time, this data owner wants to preserve the privacy of sensitive data in
the records.
Distributed anonymous data perturbation method for privacy-preserving data mining [7]
discusses a light-weight anonymous data perturbation method for efficient privacy preserving in
distributed data mining. Two protocols are proposed to address these constraints and to protect
data statistics and the randomization process against collusion attacks.An Algorithm for
Frequent Pattern Mining Based on Apriori[8] proposes three different frequent pattern mining
approaches (Record filter, Intersection and the Proposed Algorithm) based on classical Apriori
algorithm. This paper performs a comparative study of all three approaches on a data-set of
4. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
32
2000 transactions. This paper surveys the list of existing association rule mining techniques and
compares the algorithms with their modified approach.
Using Apriori-like algorithms for Spatio-Temporal Pattern Queries [9] presents a way to
construct Apriori-like algorithms for mining spatio-temporal patterns. This paper addresses
problems of the different types of comparing functions that can be used to mine frequent
patterns.Map-Reduce for Machine Learning on Multi core [10] discusses ways to develop a
broadly applicable parallel programming paradigm that is applicable to different learning
algorithms. By taking advantage of the summation form in a map-reduce framework, this paper
tries to parallelize a wide range of machine learning algorithms and achieve a significant speed-
up on a dual processor cores.
Using Spot Instances for MapReduce Work flows [11] describes new techniques to improve the
runtime of MapReducejobs. This paper presents Spot Instances (SI) as a means of attaining
performance gains at low monetary cost.
3. DESIGN AND IMPLEMENTATION
3.1 Experimental Setup
The experimental setup has three nodes connected to managed switch linked to private
LAN. One of these nodes is configured as Hadoop Master or as the namenode which
controls the data distribution over the Hadoop cluster. All the nodes are identical in
terms of the system configuration i.e., all the nodes have identical processor - Intel
Core2 Duo and assembled by standard manufacturer. As investigative
effort,configuration made to understand Hadoop will have three nodes in fully
distributed mode. The intention is to scale the number of nodes by using standard
cluster management software that can easily add new nodes to Hadoop rather than
installing Hadoop in every node. The visualization of this setup is shown in the figure 2.
Figure 2: Experimental Setup for Hadoop Multi-node
3.1.1 Special Configuration Setup for Hadoop
Focus here is on Hadoop setup with multi-node configuration. The prerequisite for
Hadoop installation is mentioned below -
5. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
33
• Set-up requires Java installed on the system, before installing Hadoop make sure
that the JDK is up and running.
• The operating system can either be Ubuntu or Suse or Fedora for the setup.
• Create a new user group and user on Linux system. This has to be replicated for
all the nodes planned for cluster setup. This user and group will exclusively
usedfor Hadoop operations.
3.1.1.1 Single Node Setup
Hadoop must first be configured to the single node setup also called as Pseudo-
distributed mode which runs on standalone machine to start with. Later this setup will
be extended to fully –configured multi-node setup with al-least two nodes and with
basic network interconnect as Cat 6 cable. Below is the required configuration setup for
single node setup-
• Download Hadoop version 0.20.x from Apache Hadoop site, which is
downloaded as tar ball (tar.gz). Change the ownership of the file as 755 or 765
(octal representation for R W E).
• Unzip the File to /usr/local and create a folder Hadoop.
• Update the bashrc file located at $HOME/.bashrc. Content to be appended to
this file is illustrated at the http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/#update-home-bashrc.
• Modify the shell script - hadoop-env.sh and the contents to be appended is
described at the http://www.michael-noll.com/tutorials/running-hadoop-on-
ubuntu-linux-single-node-cluster/#hadoop-env-sh.
• Change the configuration file conf/core-site.xml, conf/mapred-site.xml and
conf/hdfs-site.xml with the contents illustrated at the http://www.michael-
noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#conf-
site-xml.
• Format the HDFS file system via Namenode by using the command given
below –
$ hduser@ubuntu: $ /usr/local/hadoop/bin/hadoopnamenode–format
• Start a single node cluster using the command -
$ hduser@ubuntu: $ /usr/local/hadoop/bin/start-all.sh
This will start up a Namenode, Datanode, Jobtracker and a Tasktracker.
• Confgure the SSH access for the user created for Hadoop on the Linux machine.
3.1.1.2 Multi-Node Configuration
• Assign names to nodes by editing the files “/etc/hosts'' . For instance -
$ sudo vi /etc/hosts
6. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
34
192.168.0.1 master
192.168.0.2 slave1
192.168.0.3 slave2
• SSH access -The hduser on master node must be able to connect to its own user
account and also hduser on slave nodes through password-less SSH login. To
connect the master to itself -
hduser@master: $ ssh master
• Configuration for Master Node -
o conf/masters -This file specifies on which machine Hadoop starts
secondary NameNodes in multi-node cluster. bin/start-dfs.sh and
bin/start-mapred.sh shell scripts will run the primary NameNode and
JobTracker on which node runs these commands respectively. Now
update the file conf/masters like this
o conf/slaves - This File actually consists of the host machines where
Hadoop slave daemonsare run. Update the file by adding these lines -
master
slave
• Modify the following configuration files previously addressed under Single node
setup-
o In the fileconf/core-site.xml
<property >
<name>fs.default.name </name>
<value>hdfs://master:54310 </value>
<description>The name of the
default file system.A URI whose schemeand authority
determine
the FileSystemimplementation.Theuri's scheme
determines the con_g property(fs.SCHEME.impl)
naming theFileSystem implementation class.Theuri's
authority is usedto determine the host,
port, etc. for a filesystem.
</description >
</property >
o In the file conf/mapred-site.xml
<property >
<name>mapred.job.tracker</name>
<value>master:54311 </value>
<description>The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-process as a
single map and reduce task.
7. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
35
</description >
</property >
o In the file conf/hdfs-site.xml
<property >
<name>dfs.replication</name>
<value>3 </value>
<description>Default block replication. The actual number of
replications
can be specified when the file is created.The default is used if
replication
is not specified in create time.
</description >
</property >
• Format the HDFS via Namenode
/bin/start-dfs.sh
This command will brings up the HDFS with NameNode on and the DataNodes
on machines listed in /conf/slaves-file.
3.1.1.3 Running the Map/Reduce Daemon
To run mapred daemons, we need to run the following command on ?master?
hduser@master:/usr/local/hadoop$ bin/start-mapred.sh
See if all daemons have been started or not by running the command jps. i.e. on master
node
hduser@master:/usr/local/hadoop$ jps
To stop all the daemons, we have to run this command/bin/stop-all.sh.
3.1.1.4 Running the Map-Reduce job
• To run code on Hadoop cluster, put the code into a jar file and name it as
apriori.jar. Put input files in the directory /usr/hduser/input and output files in
/usr/hduser/output/.
• Now in order to run the jar file and run the following command at
HADOOP_HOME prompt hduser@master:/usr/local/hadoop$ bin/hadoop jar
apriori.jar apriori /user/hduser/input /user/hduser/output
8. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
36
3.2 System Deployment
The overall deployment of the required system is visualized using the system
organization as described in the figure 3. Series of Map calls is made to send the data to
cluster node and the format is of the form <Key, Value> and then a Reduce calls is
applied to summarize the resultant from different nodes. A simple GUI is sufficient to
display these results to user operating the Hadoop Master.
Figure 3: System Organization
3.3Algorithm Design
The algorithm discussed produces all the subsets that would be generated from
the given Item set. Further these subsets are searched against the data-sets and the
frequency is noted.There are lots of data items and their subsets, hence they need to be
searched them simultaneously so that search time reduces. Hence, the Map-Reduce
concept of the Hadoop architecture comes into picture.Map function is forked for every
subset of the items. These maps can run on any node in the distributed environment
configured under Hadoop configuration. The job distribution is taken care by the
Hadoop system and the files, data-sets required are put into HDFS. In each Map
function, the value is the item set. The whole of the data-set is scanned to find the entry
of the value item set and the frequency is noted. This is given as an output to the Reduce
function in the Reduce class defined in the Hadoop core package.
In the reducer function, each output of the each map is collected and it is put into
required file with its frequency. Algorithm is mentioned below in natural language:
• Read the subsets file.
• For each subsets in the file
o Initialise count to zero and invoke map function.
o Read the database.
o For each entry in the database
Compare with the subset.
9. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
37
If equals the increment the count by one.
Write the output counts into the intermediate file.
• Call reduce function to count the subsets and report the output to the file.
4. RESULTS
The experimental setup described before has been rigorous tested against a Pseudo-distributed
configuration of Hadoop and with standalone PC for varying intensity of data and transaction.
The fully configured multi-node Hadoop with differential system configuration (FHDSC) would
take comparatively long time to process data as against the fully configured similar multi-nodes
(FHSSC)). Similarity is in terms of the system configuration starting from computer architecture
to operating system running in it. This is clearly depicted in the figure 4.
Figure 4:FHDSC Vs. FHSSC
The results for taken from the 3-node Fully-distributed and Pseudo distributed modes of Hadoop
for large transaction are fairly good till it reaches the maximum threshold capacity of nodes. The
result is depicted in the figure 5.
Figure 5:Transactions VsHadoop Configuration
10. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
38
Looking the graph, there is large variance in time seen at threshold of 12,000 transactions.
Beyond which the time is in exponential. This is because of the computer architecture and
limited storage capacity of 80GB per Node. Hence the superset transaction generation will take
longer time to compute and the miner for frequent item-set.
The performance is expressed in the lateral comparison between FHDSC and FHSSC as given
below-
ࣁ ൌ
ࡲࡴࡰࡿ
ࡲࡴࡿࡿ
FHDSC = FHSSC = loge N
Where N is the number of nodes installed in the cluster.
5. CONCLUSIONS AND FUTURE ENHANCEMENTS
The paper presents a novel approach of design algorithms for clustered environment. This is
applicable to scenarios when there data-intensive computation is required. Such setup give a
broad avenue for investigation and research in Data mining. Looking the demand for such
algorithm there is urgent need to focus and explore more about clustered environment specially
for this domain.
The future works concentrates about building a framework that deals with unstructured data
analysis and complex Data mining algorithms. Also to look at integrating one of the following
parallel computing infrastructure with Map/Reduce to utilize computing power for complex
calculation on data especially in case of the classifiers under supervised learning-
• MPJ express is Java based parallel computing toolkit that can be integrated with
distributed environment.
• AtejiPX parallel programming toolkit with Hadoop.
ACKNOWLEDGEMENTS
Prof. Anjan K would like to thanks Late Dr. V.K Ananthashayana, Erstwhile Head,
Department of Computer Science and Engineering, M.S.Ramaiah Institute of
Technology, Bangalore, for igniting the passion for research.Authors would also like to
thank Late. Prof. Harish G for his consistent effort and encouragement throughout this
research project.
REFERENCES
[1]SouptikDatta, KanishkaBhaduri, Chris Giannella, Ran Wolff, and HillolKargupta, Distributed Data
Mining in Peer-to-Peer Networks, Universityof Maryland, Baltimore County, Baltimore, MD, USA,
Journal IEEEInternet Computing archive Volume 10 Issue 4, Pages 18 - 26, July 2006.
[2]Ning Chen, Nuno C. Marques, and NarasimhaBolloju, A Web Servicebasedapproach for data mining
in distributed environments, Departmentof Information Systems, City University of Hong Kong, 2005.
[3] MafruzZamanAshrafi, David Taniar, and Kate A. Smith, A Data MiningArchitecture for Distributed
Environments, pages 27-34, Springer-VerlagLondon, UK, 2007.
11. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012
39
[4] GrigoriosTsoumakas and IoannisVlahavas, Distributed Data Mining ofLarge Classifier Ensembles,
SETN-2008, Thessaloniki, Greece, Proceedings,Companion Volume, pp. 249-256, 11-12 April 2008.
[5] VudaSreenivasaRao, Multi Agent-Based Distributed Data Mining: AnOver View, International
Journal of Reviews in computing, pages 83-92,2009.
[6] P.Kamakshi, A.VinayaBabu, Preserving Privacy and Sharing the Datain Distributed Environment
using Cryptographic Technique on Perturbeddata, Journal Of Computing, Volume 2, Issue 4, ISSN
21519617, April2010.
[7] Feng LI, Jin MA, Jian-hua LI, Distributed anonymous data perturbationmethod for privacy-preserving
data mining, Journal of Zhejiang UniversitySCIENCE A ISSN 1862-1775, pages 952-963, 2008.
[8] Goswami D.N. et. al., An Algorithm for Frequent Pattern Mining BasedOn Apriori (IJCSE)
International Journal on Computer Science andEngineering Vol. 02, No. 04, 942-947, 2010.
[9] MarcinGorawski and PawelJureczek, Using Apriori-like Algorithmsfor Spatio-Temporal Pattern
Queries, Silesian University of Technology,Institute of Computer Science, Akademicka 16, Poland, 2010.
[10] Cheng-Tao Chu et. al., Map-Reduce for Machine Learning on Multicore,CS Department, Stanford
University, Stanford, CA, 2006.
[11] NavrajChohanet. al., See Spot Run: Using Spot Instances for Map-Reduce Workflows, Computer
Science Department, University of California,2005.
AUTHORS PROFILE
Anjan K Koundinya has received his B.E degree from Visveswariah
Technological University, Belgaum, India in 2007 And his master degree from
Department of Computer Science and Engineering, M.S. Ramaiah Institute of
Technology, Bangalore, India. He has been awarded Best Performer PG 2010
and rank holder for his academic excellence. His areas of research includes
Network Security and Cryptology, Adhoc Networks, Mobile Computing, Agile
Software Engineering and Advanced Computing Infrastructure. He is currently
working as Assistant Professor in Dept. of Computer Science and Engineering, R
V College of Engineering.
Srinath N K has his M.E degree in Systems Engineering and Operations
Research from Roorkee University, in 1986 and PhD degree from
AvinashLingum University, India in 2009. His areas of research interests include
Operations Research, Parallel and Distributed Computing, DBMS,
Microprocessor.His is working as Professor and Head, Dept of Computer
Science and Engineering, R V College of Engineering.