This document proposes a "truly stepwise" approach to managing and implementing the data mining process. It involves storing the output of each main phase - such as pre-processing, feature extraction, and modeling - in a storage medium between phases. This allows each phase to work independently without needing to rerun previous phases. It compares this approach to the traditional method where all phases are combined and run sequentially without intermediate storage. The document then presents a case study applying this new approach to a project on resistance spot welding quality estimation, finding it performed better than the traditional approach.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Progressive Mining of Sequential Patterns Based on Single ConstraintTELKOMNIKA JOURNAL
Data that were appeared in the order of time and stored in a sequence database can be processed to obtain sequential patterns. Sequential pattern mining is the process to obtain sequential patterns from database. However, large amount of data with a variety of data type and rapid data growth raise the scalability issue in data mining process. On the other hand, user needs to analyze data based on specific organizational needs. Therefore, constraint is used to impose limitation in the mining process. Constraint in sequential pattern mining can reduce the short and trivial sequential patterns so that the sequential patterns satisfy user needs. Progressive mining of sequential patterns, PISA, based on single constraint utilizes Period of Interest (POI) as predefined time frame set by user in progressive sequential tree. Single constraint checking in PISA utilizes the concept of anti monotonic or monotonic constraint. Therefore, the number of sequential patterns will decrease, the total execution time of mining process will decrease and as a result, the system scalability will be achieved.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
The document discusses a combined mining approach to generate patterns from complex data. It proposes applying a lossy-counting algorithm to each data source to obtain frequent itemsets, then generating combined association rules. It also describes obtaining pair and cluster patterns by considering multiple features, and generating incremental pair and cluster patterns. Further, it combines FP-growth and Bayesian belief networks to classify data and obtain more informative knowledge.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Progressive Mining of Sequential Patterns Based on Single ConstraintTELKOMNIKA JOURNAL
Data that were appeared in the order of time and stored in a sequence database can be processed to obtain sequential patterns. Sequential pattern mining is the process to obtain sequential patterns from database. However, large amount of data with a variety of data type and rapid data growth raise the scalability issue in data mining process. On the other hand, user needs to analyze data based on specific organizational needs. Therefore, constraint is used to impose limitation in the mining process. Constraint in sequential pattern mining can reduce the short and trivial sequential patterns so that the sequential patterns satisfy user needs. Progressive mining of sequential patterns, PISA, based on single constraint utilizes Period of Interest (POI) as predefined time frame set by user in progressive sequential tree. Single constraint checking in PISA utilizes the concept of anti monotonic or monotonic constraint. Therefore, the number of sequential patterns will decrease, the total execution time of mining process will decrease and as a result, the system scalability will be achieved.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
The document discusses a combined mining approach to generate patterns from complex data. It proposes applying a lossy-counting algorithm to each data source to obtain frequent itemsets, then generating combined association rules. It also describes obtaining pair and cluster patterns by considering multiple features, and generating incremental pair and cluster patterns. Further, it combines FP-growth and Bayesian belief networks to classify data and obtain more informative knowledge.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
Data characterization towards modeling frequent pattern mining algorithmscsandit
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle
extremely huge amount of data, it is quite natural that the demand for the computational
environment to accelerates, and scales out big data applications increases. The important thing
is, however, the behavior of big data applications is not clearly defined yet. Among big data
applications, this paper specifically focuses on stream mining applications. The behavior of
stream mining applications varies according to the characteristics of the input data. The
parameters for data characterization are, however, not clearly defined yet, and there is no study
investigating explicit relationships between the input data, and stream mining applications,
either. Therefore, this paper picks up frequent pattern mining as one of the representative
stream mining applications, and interprets the relationships between the characteristics of the
input data, and behaviors of signature algorithms for frequent pattern mining.
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...CSCJournals
Abstract In [1], Geng and Li presented a framework to analyze network performance based on information quality. In that paper, the authors based their framework on the flow of information from a Base Station (BS) to clients. The theory they established can, and needs, to be extended to accommodate for the flow of information from the clients to the BS. In this work, we use that framework and study the case of client to BS data transmission. Our work closely parallels the work of Geng and Li, we use the same notation and liberally reference their work. Keywords: information theory, information quality, network protocols, network performance
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes a new dynamic data replication and job scheduling strategy for data grids. The strategy aims to improve data access time and reduce bandwidth consumption by replicating data based on file popularity, storage limitations at nodes, and data category. It replicates more popular files that are in the same category as frequently accessed data to nodes close to where jobs are run. This is intended to optimize performance by locating data and jobs close together. The document provides context on related work and outlines the proposed system architecture and replication/scheduling approach.
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET Journal
This document summarizes several techniques for handling concept drift in data stream mining. It discusses how ensemble methods are commonly used to deal with concept drift and categorizes ensemble approaches into online and block-based. It also reviews several existing studies on handling concept drift, including methods that use adaptive windowing and online learning as well as techniques for detecting concept drift and efficiently updating models. The document concludes by discussing the need for approaches that can adapt to different types of concept drift and changes in non-stationary data streams.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
This document summarizes a research paper that proposes a method for mining interesting medical knowledge from big uncertain data using MapReduce. The key points are:
1) Mining patterns from large amounts of uncertain medical data generated by hospitals can provide useful knowledge but is computationally intensive.
2) The proposed method uses MapReduce to efficiently mine the data for frequent patterns related to user-specified interests or constraints.
3) This focuses the mining to be more effective and produce personalized results depending on individual user needs and interests in the medical data.
Parallel Key Value Pattern Matching Modelijsrd.com
Mining frequent itemsets from the huge transactional database is an important task in data mining. To find frequent itemsets in databases involves big decision in data mining for the purpose of extracting association rules. Association rule mining is used to find relationships among large datasets. Many algorithms were developed to find those frequent itemsets. This work presents a summarization and new model of parallel key value pattern matching model which shards a large-scale mining task into independent, parallel tasks. It produces a frequent pattern showing their capabilities and efficiency in terms of time consumption. It also avoids the high computational cost. It discovers the frequent item set from the database.
This document provides an overview of stream data mining techniques. It discusses how traditional data mining cannot be directly applied to data streams due to their continuous, rapid nature. The document outlines some essential methodologies for analyzing data streams, including sampling, load shedding, sketching, and data summarization techniques like reservoirs, histograms, and wavelets. It also discusses challenges in applying these techniques to data streams and open problems in the emerging field of stream data mining.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a
partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor
networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in
colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs).
This method tries to find all colocations that are to be generated from a random world. For this we first
apply an approximation error to find all the PPCs which reduce the computations. Next find all the
possible worlds and split them into two different worlds and compute the prevalence probability. These
worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic
Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant
improvement in computational time in comparison to some of the existing methods used in colocation
mining.
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
The document discusses representing relational spatiotemporal data using information granules. It proposes:
1) Describing the relational data using a vocabulary of granular descriptors formed from Cartesian products of spatial, temporal, and signal information granules. This granular representation provides an interpretable perspective on the data.
2) Analyzing the capabilities of different vocabularies to capture the essence of the data through the processes of granulation and degranulation, where the original data is reconstructed from its granular representation. The quality of reconstruction is used to optimize the vocabulary.
3) Extending the approach to analyze evolvability of the granular description as the relational data changes across consecutive
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
Data characterization towards modeling frequent pattern mining algorithmscsandit
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle
extremely huge amount of data, it is quite natural that the demand for the computational
environment to accelerates, and scales out big data applications increases. The important thing
is, however, the behavior of big data applications is not clearly defined yet. Among big data
applications, this paper specifically focuses on stream mining applications. The behavior of
stream mining applications varies according to the characteristics of the input data. The
parameters for data characterization are, however, not clearly defined yet, and there is no study
investigating explicit relationships between the input data, and stream mining applications,
either. Therefore, this paper picks up frequent pattern mining as one of the representative
stream mining applications, and interprets the relationships between the characteristics of the
input data, and behaviors of signature algorithms for frequent pattern mining.
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...CSCJournals
Abstract In [1], Geng and Li presented a framework to analyze network performance based on information quality. In that paper, the authors based their framework on the flow of information from a Base Station (BS) to clients. The theory they established can, and needs, to be extended to accommodate for the flow of information from the clients to the BS. In this work, we use that framework and study the case of client to BS data transmission. Our work closely parallels the work of Geng and Li, we use the same notation and liberally reference their work. Keywords: information theory, information quality, network protocols, network performance
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes a new dynamic data replication and job scheduling strategy for data grids. The strategy aims to improve data access time and reduce bandwidth consumption by replicating data based on file popularity, storage limitations at nodes, and data category. It replicates more popular files that are in the same category as frequently accessed data to nodes close to where jobs are run. This is intended to optimize performance by locating data and jobs close together. The document provides context on related work and outlines the proposed system architecture and replication/scheduling approach.
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET Journal
This document summarizes several techniques for handling concept drift in data stream mining. It discusses how ensemble methods are commonly used to deal with concept drift and categorizes ensemble approaches into online and block-based. It also reviews several existing studies on handling concept drift, including methods that use adaptive windowing and online learning as well as techniques for detecting concept drift and efficiently updating models. The document concludes by discussing the need for approaches that can adapt to different types of concept drift and changes in non-stationary data streams.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
This document summarizes a research paper that proposes a method for mining interesting medical knowledge from big uncertain data using MapReduce. The key points are:
1) Mining patterns from large amounts of uncertain medical data generated by hospitals can provide useful knowledge but is computationally intensive.
2) The proposed method uses MapReduce to efficiently mine the data for frequent patterns related to user-specified interests or constraints.
3) This focuses the mining to be more effective and produce personalized results depending on individual user needs and interests in the medical data.
Parallel Key Value Pattern Matching Modelijsrd.com
Mining frequent itemsets from the huge transactional database is an important task in data mining. To find frequent itemsets in databases involves big decision in data mining for the purpose of extracting association rules. Association rule mining is used to find relationships among large datasets. Many algorithms were developed to find those frequent itemsets. This work presents a summarization and new model of parallel key value pattern matching model which shards a large-scale mining task into independent, parallel tasks. It produces a frequent pattern showing their capabilities and efficiency in terms of time consumption. It also avoids the high computational cost. It discovers the frequent item set from the database.
This document provides an overview of stream data mining techniques. It discusses how traditional data mining cannot be directly applied to data streams due to their continuous, rapid nature. The document outlines some essential methodologies for analyzing data streams, including sampling, load shedding, sketching, and data summarization techniques like reservoirs, histograms, and wavelets. It also discusses challenges in applying these techniques to data streams and open problems in the emerging field of stream data mining.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a
partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor
networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in
colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs).
This method tries to find all colocations that are to be generated from a random world. For this we first
apply an approximation error to find all the PPCs which reduce the computations. Next find all the
possible worlds and split them into two different worlds and compute the prevalence probability. These
worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic
Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant
improvement in computational time in comparison to some of the existing methods used in colocation
mining.
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
The document discusses representing relational spatiotemporal data using information granules. It proposes:
1) Describing the relational data using a vocabulary of granular descriptors formed from Cartesian products of spatial, temporal, and signal information granules. This granular representation provides an interpretable perspective on the data.
2) Analyzing the capabilities of different vocabularies to capture the essence of the data through the processes of granulation and degranulation, where the original data is reconstructed from its granular representation. The quality of reconstruction is used to optimize the vocabulary.
3) Extending the approach to analyze evolvability of the granular description as the relational data changes across consecutive
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET Journal
This document summarizes research on techniques for handling concept drift in data stream mining. It begins with an introduction to the challenges of concept drift in data streams and the two main approaches for handling concept drift using ensembles: online and block-based. It then reviews several existing studies on concept drift detection and handling in data streams. Finally, it proposes an adaptive online ensemble approach that uses an internal change detector to dynamically determine block sizes and capture concept drifts in a timely manner. Experimental results show this approach outperforms other ensemble techniques, especially on datasets with sudden concept changes.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
Frequent itemset mining and association rule generation is
a challenging task in data stream. Even though, various algorithms
have been proposed to solve the issue, it has been found
out that only frequency does not decides the significance
interestingness of the mined itemset and hence the association
rules. This accelerates the algorithms to mine the association
rules based on utility i.e. proficiency of the mined rules. However,
fewer algorithms exist in the literature to deal with the utility
as most of them deals with reducing the complexity in frequent
itemset/association rules mining algorithm. Also, those few
algorithms consider only the overall utility of the association
rules and not the consistency of the rules throughout a defined
number of periods. To solve this issue, in this paper, an enhanced
association rule mining algorithm is proposed. The algorithm
introduces new weightage validation in the conventional
association rule mining algorithms to validate the utility and
its consistency in the mined association rules. The utility is
validated by the integrated calculation of the cost/price efficiency
of the itemsets and its frequency. The consistency validation
is performed at every defined number of windows using the
probability distribution function, assuming that the weights are
normally distributed. Hence, validated and the obtained rules
are frequent and utility efficient and their interestingness are
distributed throughout the entire time period. The algorithm is
implemented and the resultant rules are compared against the
rules that can be obtained from conventional mining algorithms
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
1) This document discusses different techniques for cross-domain data fusion, including stage-based, feature-level, probabilistic, and multi-view learning methods.
2) It reviews literature on data fusion definitions, implementations, and techniques for handling data conflicts. Common steps in data fusion are data transformation, schema mapping, and duplicate detection.
3) The proposed system architecture performs data cleaning, then applies stage-based, feature-level, probabilistic, and multi-view learning fusion methods before analyzing dataset, hardware, and software requirements.
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSijistjournal
The growth of smart and intelligent devices known as sensors generate large amount of data. These generated data over a time span takes such a large volume which is designated as big data. The data structure of repository holds unstructured data. The traditional data analytics methods well developed and used widely to analyze structured data and to limit extend the semi-structured data which involves additional processing over heads. The similar methods used to analyze unstructured data are different because of distributed computing approach where as there is a possibility of centralized processing in case of structured and semi-structured data. The under taken work is confined to analysis of both verities of methods. The result of this study is targeted to introduce methods available to analyze big data.
This document provides an overview of techniques for temporal data mining. It discusses how temporal sequences arise in domains like engineering, science, finance, and healthcare. It covers approaches for representing temporal sequences, including keeping data in its original form, piecewise approximations, transformations to other domains, discretization, and generative models. It also discusses measuring similarity between sequences and techniques for classification and relation finding in temporal data, including frequent pattern detection and prediction. The goal is to classify and organize temporal data mining techniques to help practitioners address problems involving temporal data.
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopBRNSSPublicationHubI
This document describes research on improving the Apriori algorithm for association rule mining on large datasets using Hadoop. The researchers implemented an improved Apriori algorithm that uses MapReduce on Hadoop to reduce the number of database scans needed. They tested the proposed algorithm on various datasets and found it had faster execution times and used less memory compared to the traditional Apriori algorithm.
This document proposes using the discrete wavelet transform (DWT) with truncation for privacy-preserving data mining. The DWT decomposes data into approximation and detail coefficients. Truncating the detail coefficients distorts the data while maintaining statistical properties. An experiment shows the method effectively conceals sensitive information while preserving data mining performance after distortion.
The document proposes streaming algorithms for performing Pearson's chi-square goodness-of-fit test in a streaming setting with minimal assumptions. It presents algorithms for the one-sample and two-sample continuous chi-square tests that use O(K^2log(N)√N) space, where K is the number of bins and N is the stream length. It also shows that no sublinear solution exists for the categorical chi-square test and provides a heuristic algorithm. The algorithms are validated on real and synthetic data and can detect deviations from distributions or differences between streams with low memory requirements.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
95Orchestrating Big Data Analysis Workflows in the Cloud.docxfredharris32
95
Orchestrating Big Data Analysis Workflows in the Cloud:
Research Challenges, Survey, and Future Directions
MUTAZ BARIKA and SAURABH GARG, University of Tasmania
ALBERT Y. ZOMAYA, University of Sydney
LIZHE WANG, China University of Geoscience (Wuhan)
AAD VAN MOORSEL, Newcastle University
RAJIV RANJAN, China University of Geoscience (Wuhan) and Newcastle University
Interest in processing big data has increased rapidly to gain insights that can transform businesses, govern-
ment policies, and research outcomes. This has led to advancement in communication, programming, and
processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and
Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but com-
posed of several individual analytical steps running in the form of a workflow. These big data workflows are
vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how
to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration
requirements of these workflows as well as the challenges in achieving these requirements. We also survey
current trends and research that supports orchestration of big data workflows and identify open research
challenges to guide future developments in this area.
CCS Concepts: • General and reference → Surveys and overviews; • Computer systems organization
→ Cloud computing; • Computing methodologies → Distributed algorithms; • Computer systems
organization → Real-time systems;
Additional Key Words and Phrases: Big data, cloud computing, workflow orchestration, research taxonomy,
approaches, and techniques
ACM Reference format:
Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2019. Or-
chestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
ACM Comput. Surv. 52, 5, Article 95 (September 2019), 41 pages.
https://doi.org/10.1145/3332301
This research is supported by an Australian Government Research Training Program (RTP) Scholarship. This research is
also partially funded by two Natural Environment Research Council (UK) projects including (LANDSLIP:NE/P000681/1
and FloodPrep:NE/P017134/1).
Authors’ addresses: M. Barika and S. Garg, Discipline of ICT | School of Technology, Environments and Design (TED),
College of Sciences and Engineering, University of Tasmania, Hobart, Tasmania, Australia 7001; emails: {mutaz.barika,
saurabh.garg}@utas.edu.au; A. Zomaya, School of Computer Science, Faculty of Engineering, J12 - Computer Science Build-
ing. The University of Sydney, Sydney, New South Wales, Australia; email: [email protected]; L. Wang, School
of Computer Science, China University of Geoscience, No. 388 Lumo Road, Wuhan, P. R China; email: [email protected];
A Van Moorsel, School of Computing, Newcastle University,.
95Orchestrating Big Data Analysis Workflows in the Cloud.docxblondellchancy
95
Orchestrating Big Data Analysis Workflows in the Cloud:
Research Challenges, Survey, and Future Directions
MUTAZ BARIKA and SAURABH GARG, University of Tasmania
ALBERT Y. ZOMAYA, University of Sydney
LIZHE WANG, China University of Geoscience (Wuhan)
AAD VAN MOORSEL, Newcastle University
RAJIV RANJAN, China University of Geoscience (Wuhan) and Newcastle University
Interest in processing big data has increased rapidly to gain insights that can transform businesses, govern-
ment policies, and research outcomes. This has led to advancement in communication, programming, and
processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and
Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but com-
posed of several individual analytical steps running in the form of a workflow. These big data workflows are
vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how
to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration
requirements of these workflows as well as the challenges in achieving these requirements. We also survey
current trends and research that supports orchestration of big data workflows and identify open research
challenges to guide future developments in this area.
CCS Concepts: • General and reference → Surveys and overviews; • Computer systems organization
→ Cloud computing; • Computing methodologies → Distributed algorithms; • Computer systems
organization → Real-time systems;
Additional Key Words and Phrases: Big data, cloud computing, workflow orchestration, research taxonomy,
approaches, and techniques
ACM Reference format:
Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2019. Or-
chestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
ACM Comput. Surv. 52, 5, Article 95 (September 2019), 41 pages.
https://doi.org/10.1145/3332301
This research is supported by an Australian Government Research Training Program (RTP) Scholarship. This research is
also partially funded by two Natural Environment Research Council (UK) projects including (LANDSLIP:NE/P000681/1
and FloodPrep:NE/P017134/1).
Authors’ addresses: M. Barika and S. Garg, Discipline of ICT | School of Technology, Environments and Design (TED),
College of Sciences and Engineering, University of Tasmania, Hobart, Tasmania, Australia 7001; emails: {mutaz.barika,
saurabh.garg}@utas.edu.au; A. Zomaya, School of Computer Science, Faculty of Engineering, J12 - Computer Science Build-
ing. The University of Sydney, Sydney, New South Wales, Australia; email: [email protected]; L. Wang, School
of Computer Science, China University of Geoscience, No. 388 Lumo Road, Wuhan, P. R China; email: [email protected];
A Van Moorsel, School of Computing, Newcastle University, ...
Big data service architecture: a surveyssuser0191d4
This document discusses big data service architecture. It begins with an introduction to big data services and their economic benefits. It then describes the key components of big data service architecture, including data collection and storage, data processing, and applications. For data collection and storage, it covers Extract-Transform-Load tools, distributed file systems, and NoSQL databases. For data processing, it discusses batch, stream, and hybrid processing frameworks like MapReduce, Storm, and Spark. It concludes by noting big data applications in various fields and cloud computing services for big data.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE cscpconf
The traditional medical analysis is based on the static data, the medical data is about to be analysis after the collection of these data sets is completed, but this is far from satisfying the actual demand. Large amounts of medical data are generated in real time, so that real-time analysis can yield more value. This paper introduces the design of the Sentinel which can realize the real-time analysis system based on the clustering algorithm. Sentinel can realize clustering analysis of real-time data based on the clustering algorithm and issue an early alert.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
Similar to Managing and implementing the data mining process using a truly stepwise approach (20)
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Managing and implementing the data mining process using a truly stepwise approach
1. Managing and Implementing the Data Mining Process
Using a Truly Stepwise Approach
Perttu Laurinen(1
, Lauri Tuovinen(1
, Eija Haapalainen(1
, Heli Junno(1
, Juha Röning(1
and
Dietmar Zettel(2
.
1)
Intelligent Systems Group, Department of Electrical and Information Engineering,
PO BOX 4500, FIN-90014 University of Oulu, Finland. E-mail:
perttu.laurinen@ee.oulu.fi.
(2
Fachhochschule Karlsruhe, Institut für Innovation und Transfer, Moltkestr. 30,
76133 Karlsruhe, Germany. E-mail: dietmar.zettel@fh-karlsruhe.de.
Abstract. Data mining consists of transformation of information with a
variety of algorithms to discover the underlying dependencies. The
information is passed through a chain of algorithms and usually not stored
until it has reached the end of the chain, which may result in a number of
difficulties. This paper presents a method for better management and
implementation of the data mining process and reports a case study of the
method applied to the pre-processing of spot welding data. The developed
approach, called ‘truly stepwise data mining’, enables more systematic
processing of data. It verifies the correctness of the data, allows easier
application of a variety of algorithms to the data, manages the work chain, and
differentiates between the data mining tasks. The method is based on storage
of the data between the main phases of the data mining process. The different
layers of the storage medium are defined on the basis of the type of algorithms
applied to the data. The layers defined in this research consist of raw data, pre-
processed data, features, and models. In conclusion, we present a systematic,
easy-to-apply method for implementing and managing the work flow of the
data mining process. A case study of applying the method to a resistance spot
welding quality estimation project is presented to illustrate the superior
performance of the method compared to the currently used approach.
Key words: hierarchical data storage, work flow management, data mining
work flow implementation.
1. Introduction
Data mining consists of transformation of information with a variety of
algorithms to discover the underlying dependencies. The information is passed
through a chain of algorithms, and the success of the process is determined by the
outcome. The typical phases of a data mining process are: raw data acquisition, pre-
processing, feature extraction, and modeling. The method of managing the
interactions between these phases has a major impact on the outcome of the project.
2. The traditional approach of implementing the data mining process is to combine the
algorithms developed for the different phases and to run the data through the chain,
as presented in Figure 1.
The emphasis in the approach presented in Figure 1 is on the algorithms
processing the data. The information is expected to flow smoothly through the chain
from the beginning to the end on a single run, and the algorithms are usually
implemented within the same application. It is not unusual that the data analyst takes
care of all the phases, and not much attention is always paid to the (non-standard)
storage format of the data. This may result in a number of difficulties that detract
from the quality of the data mining process. To name a few, the approach makes it
more challenging to apply methods implemented in a variety of tools to the data,
requires comprehensive knowledge from the analyst, and results in incoherent
storage of the research results and data.
The approach proposed in this study has a different perspective toward
implementing the data mining process. The emphasis is on a standard way of storing
the data between the different phases of the process. This, in turn, increases the
independence between the transformations and the data storage. Standard storage
makes it possible for the algorithms to access the data through a standard interface,
which allows interaction between the data and the algorithms implemented in
various applications supporting the interface. The approach will be explained in
more detail in the next chapter, and after that, the benefits of applying it will be
illustrated with a comparison to the traditional approach and a case study.
Extensive searches of scientific databases and the World Wide Web did not bring
to light similar approaches applied to the implementation of the data mining process.
However, there are studies and projects on the management of the data mining
process. These studies identify the main phases of the process in a manner similar to
that presented in Figure 1 and give a general outline of the steps that should be kept
in mind when realizing the process. One of the earliest efforts – perhaps the very
earliest one – was the CRISP-DM, initiated in 1996 by three companies that
proceeded to form a consortium called CRISP-DM (CRoss-Industry Standard
Process for Data Mining). CRISP-DM is also the name of the process model created
by the consortium, which was proposed to serve as a standard reference for all
appliers of data mining [1]. The goal of the process model is to offer a representation
of the phases and tasks involved that is generic enough to be applicable to any data
mining effort as well as guidelines on how to apply the process model. Although it
is difficult to verify a method as generic beyond all doubt, several studies testify to
the usefulness of CRISP-DM as a tool for managing data mining ventures ([2], [3],
[4]). The approach proposed in CRIPS-DM was extended in RAMSYS [5], which
proposed a methodology for performing collaborative data mining work. Other
proposals, with many similarities to CRISP-DM, for the data mining process were
presented in [6] and [7]. Nevertheless, these studies did not take a stand on what
would be an effective implementation of the data mining process in practice. This
study proposes an effective approach for implementing the data mining process and
compares it to the traditional way of implementing the process, pointing out the
obvious advantages of the proposed method.
3. Rawdata FeatureextractionPre-processing ModellingData Data Data
Figure 1: The traditional data mining process.
2. A truly stepwise method for managing and implementing the
data mining process
This chapter presents a general framework for the proposed method and defines
the way in which it can be applied to the management of the data mining process.
The two basic issues that result in a number of difficulties when using the traditional
approach to data mining are:
1) The transformations are organized in a way that makes them highly
dependent on each other, and
2) all transformations are usually calculated at once.
To demonstrate these problems and to present an idea for a solution, the
following formalism is used. The data supplied for the analysis can be assumed to be
stored in a matrix 0X . The result of the analysis, nX , is obtained when the nth
transformation (function) ( )Xnf is applied to the data. The transformations are
applied step-by-step to the data, but they are calculated all at once, and the results
are not stored until the last transformation has been applied. Using the above
notation, the process can be defined as the inner functions of the supplied data,
which leads to:
Definition: Stepwise data mining process (the traditional approach). The
stepwise data mining process is a chain of inner transformations, nff ...1 , that
process the raw data, 0X , without storing it until the desired data, the output nX ,
has been obtained:
( )( )( )( )....... 012 XX fffnn =
This points out clearly the marked dependence between the transformations and
the fact that all transformations are calculated at once. However, the data is not
dependent on the transformations in such a way that all transformations would have
to be calculated in a single run. The result, nX , might equally well be generated in a
truly stepwise application of the transformations, which leads to the definition of the
proposed data mining process.
4. Definition: Truly stepwise data mining process. The results, nXX ,...,1 , of each
transformation, nff ...1 , are stored in a storage medium before applying the next
transformation in the chain to them:
( )
( )
( ).)
2)
1)
1
122
011
−=
=
=
nnn fn
f
f
XX
XX
XX
This approach makes the transformations less dependent on each other: to be able
to calculate the kth transformation (k =1,…,n), one does not need to calculate all the
(k-1) transformations prior to k, but just to fetch the data, 1−kX , stored after the (k-
1)th transformation and to apply the transformation k to that. The obvious difference
between the two processes is that, in the latter, the result of the kth transformation is
dependent only on the data, 1−kX , while in the former, it is dependent on 0X and
the transformations 11... −kff . The difference between these two definitions, or
approaches, might seem small at this stage, but it will be shown below how large it
actually is.
In theory, the result of each transformation could be stored in the storage medium
(preferably a database). In the context of data mining, however, it is more feasible to
store the data only after the main phases of the transformations. The main phases are
the same as those shown in Figure 1. Now that the proposed truly stepwise data
mining process has been defined and the main phases have been identified, the
stepwise process presented in Figure 1 can be altered to reflect the developments, as
shown in Figure 2. The apparent change is the emphasis on the storage of the data.
In Figure 1, the data flowed from one transformation to another, and the boxes
represented the transformations. In Figure 2, the boxes represent the data stored in
the different layers, and the transformations make the data flow from one layer to
another. Thus, the two figures have all the same components, but the effect of
emphasizing the storage of the data is apparent. The notion of the transformations
carrying the data between the storage layers also seems more natural than the idea
that the data is transmitted between the different transformations.
A few more comments on the diagram should be made before presenting the
comparison of the two approaches. Four storage layers are defined, i.e. the layers of
raw data, pre-processed data, features, and models. One more layer could be added
to the structure: a layer representing the best model selected from the pool of
available models. On the other hand, this is not necessary, since the presented
approach could be applied to the pool of models, treating the generated models as
raw data. In this case, the layers would define the required steps for choosing the
best model. Another comment can be made concerning the amount and scope of data
stored in the different layers. As the amount of data grows toward the bottom layers,
the scope of data decreases, and vice versa. In practice, if the storage capabilities of
the system are limited and unlimited amounts of data are available, the stored
features may cover a broader range of data than pure data could. This is pointed out
in the figure by the two arrows on the sides.
5. Data layer 1: Raw data
Data layer 2: Pre-processed data
Data layer 4: Models
Data layer 3: Features
Pre-
processing
Feature
extraction
Analysis,
modelling
Thescopeofdatagrows
Theamountofdatagrows
Figure 2: The four storage layers of the proposed data mining process.
3. The proposed vs. the traditional method
In this chapter, the various benefits of the truly stepwise approach over the
stepwise one are illustrated.
Independence between the different phases of the data mining process. In the
stepwise approach, the output of a transformation is directly dependent on each of
the transformations applied prior to it. To use an old phrase, the chain is as weak as
its weakest link. In other words, if one of the transformations does not work
properly, none of the transformations following it can be assumed to work properly,
either, since each is directly dependent on the output of the previous
transformations. In the truly stepwise method, a transformation is directly dependent
only on the data stored in the layer immediately prior to the transformation, not on
the previous transformations. The transformations prior to a certain transformation
do not necessarily have to work perfectly, it is enough that the data stored in the
previous layers is correct. From the viewpoint of the transformations, it does not
matter how the data was acquired, e.g. whether it was calculated using the previous
transformations or even inserted manually.
6. The multitude of algorithms easily applicable to the data. In the stepwise
procedure, the algorithms must be implemented in one way or another inside the
same tool, since the data flows directly from one algorithm to another. In the truly
stepwise approach, the number of algorithms is not limited to those implemented in
a certain tool, but is proportional to the number of tools that implement an interface
for accessing the storage medium. The most frequently used interface is the database
interface for accessing data stored in a database using SQL. Therefore, if a standard
database is used as a storage medium, the number of algorithms is limited to the
number of tools implementing a database interface – which is large.
Specialization and teamwork of researchers. The different phases of the data
mining process require so much expertise that it is hard to find people who would be
experts in all of them. It is easier to find an expert specialized in some of the phases
or transformations. However, in most data mining projects, the researcher must
apply or know details of many, if not all, of the steps of the data mining chain, to be
able to conduct the work. This results in wasted resources, since it takes some of her
/ his time away from the area she / he is specialized in. Furthermore, when a team of
data miners is performing a data mining project, it might be that everybody is doing
a bit of everything. This results in confusion in the project management and de-
synchronization of the tasks. Using the proposed method, the researchers can work
on the data relevant to their specialization. When a team of data miners are working
on the project, the work can be naturally divided between the workers by allocating
the data stored in the different layers to suit the expertise and skills of each person.
Data storage and on-line monitoring. The data acquired in the different phases of
the data mining process is stored in a coherent way when, for example, a standard
database is used to implement the truly stepwise process. When the data can be
accessed through a standard interface after the transformations, one can peek in on
the data at any time during the process. This can be convenient, especially in
situations where the data mining chain is delivered as a finished implementation.
When using a database interface, one can even select the monitoring tools from a set
of readily available software. To monitor the different phases of the stepwise
process, it would be necessary to display the output of the transformations in some
way, which requires extra work.
Time savings. When the data in the different layers has been calculated once in
the truly stepwise process, it does not need to be re-calculated unless it needs to be
changed. When working with large data sets, this may result in enormous time
savings. Using the traditional method, the transformations must be recalculated
when one wants to access the output of any phase of the data mining chain, which
results in unnecessary waste of staff and CPU time.
Now that the numerous benefits of the proposed method have been presented, we
could ask what the drawbacks of the method are. The obvious reason for the need
for time is the care and effort one has to invest in defining the interface for
transferring the intermediate data to the storage space. On the other hand, if this
work is left undone, one may have to put twice as much time in tackling with the
flaws in the data mining process. It might also seem that the calculation of the whole
data mining chain using the stepwise process is faster than in the truly stepwise
process. That is true, but once the transformations in the truly stepwise process are
ready and finished, the process can be run in the stepwise manner. In conclusion, no
obvious drawbacks are so far detectable in the truly stepwise process.
7. 4. A case study – pre-processing spot welding data
This chapter illustrates the benefits of the proposed method in practice. The idea
is here applied to a data mining project analysing the quality of spot welding joints,
and a detailed comparison to the traditional approach is made concerning the amount
of work required for acquiring pre-processed data.
The spot welding quality improvement project (SIOUX) is a two-year EU-
sponsored CRAFT project aiming to create non-destructive quality estimation
methods for a wide range of spot welding applications. Spot welding is a welding
technique widely used in, for example, the electrical and automotive industries,
where more than 100 million spot welding joints are produced daily in the European
vehicle industry only [8]. Non-destructive quality estimates can be calculated based
on the shape of the signal curves measured during the welding event [9], [10]. The
method results in savings in time, material, environment, and salary costs – which
are the kind of advantages that the European manufacturing industry should have in
their competition against outsourcing work to cheaper countries.
The collected data consists of information regarding the welded materials, the
quality of the welding spot, the settings of the welding machine, and the voltage and
current signals measured during the welding event. To demonstrate the data, the left
panel of Figure 3 displays a typical voltage curve acquired from a welding spot, and
the right panel shows a resistance curve obtained by pre-processing the data.
Figure 3: The left panel shows a voltage signal of a welding spot measured during a welding event. The
high variations and the flat regions are still apparent in the diagram. The right panel shows the resistance
curve after pre-processing.
The data transformations needed for pre-processing signal curves consist of
removal of the flat regions from the signal curves (welding machine inactivity),
normalization of the curves to a pre-defined interval, smoothing of the curves using
a filter, and calculation of the resistance curve based on the voltage and current
signals.
The transformations are implemented in software written specifically for this
project, called Tomahawk. The software incorporates all the algorithms required for
calculating the quality estimate of a welding spot, along with a database for storing
the welding data. The software and the database are closely connected, but
independent. The basic principles of the system are presented in Figure 4. The
special beauty of Tomahawk lies in the way the algorithms are implemented as a
connected chain. Hence, the product of applying all the algorithms is the desired
output of the data mining process. The algorithms are called plug-ins, and the way
8. the data is transmitted between each pair of plug-ins is well defined. When the
program is executed, the chain of plug-ins is executed at once. This is an
implementation of the definition of the stepwise (traditional) data mining process.
Quality
measure
Plug-in 1:
Transformation 1
Plug-in 2:
Transformation 2
Plug-in 3:
Transformation 3
Plug-in n:
Transformation n
TOMAHAWK
Welding data
Figure 4: The operating principle of the Tomahawk software. The architecture is a realization of the
stepwise data mining process.
When the project has been completed, all the plug-ins should be ready and work
for all kinds of welding data as seamlessly as presented in Figure 4. However, in the
production phase of the system, when the plug-ins are still under active
development, three major issues that interfere with the daily work of the
development team can be recognized in the chapter “The proposed vs. the traditional
method”.
• Independence. It cannot be guaranteed that all parts of the pre-processing
algorithms would work as they should for all the available data. However,
the researcher working on the pre-processed data is dependent on the pre-
processing sequence. Because of this, she/he cannot be sure that the data is
always correctly pre-processed.
• Specialization and teamwork. The expert working on the pre-processed
data might not have the expertise to correctly pre-process the raw data in
the context of Tomahawk, which would make it impossible for him/her to
perform her/his work correctly.
• The multitude of algorithms easily applicable to the data. In the production
phase, it is better if the range of algorithms tested on the data is not
exclusively limited to the implementation of the algorithms in Tomahawk,
since it would require a lot of effort to re-implement algorithms available
elsewhere as plug-ins before testing them.
The solution was to develop Tomahawk in such a way that it supports the truly
stepwise data mining process. A plug-in capable of storing and delivering pre-
processed data was implemented. Figure 5 presents the effects of the developments.
The left panel displays the pre-processing sequence prior to the adjustments. All the
plug-ins were calculated at once, and they had to be properly configured to obtain
properly pre-processed data. The right panel shows the situation after the adoption
of the truly stepwise data mining process. The pre-processing can be done in its own
9. sequence, after which a plug-in that inserts the data into the database is applied.
Now the pre-processed data is in the database and available for further use at any
given time.
Plug-in 1:
Transformation
1
Plug-in 2:
Transformation
2
Plug-in 3:
Transformation
3
Plug-in 8:
Transformation
8
Pre-processing in TOMAHAWK
Tomahawk
database
Expert
Plug-in 9:
Transformation
9
Tomahawk
database
Plug-in:
Output pre-
processed data
to databasePlug-in:
Load pre-
processed data
from database
A sequence of pre-
processing plug-ins
Pre-process
Figure 5: The left panel shows the application of the stepwise data mining process on the pre-
processing of the raw data in Tomahawk. The right panel shows Tomahawk after the modifications
that made it support the truly stepwise data mining process for pre-processing.
The first and second issues are simple to solve by using the new approach. The
pre-processing expert of the project takes care of properly configuring the pre-
processing plug-ins. If the plug-ins need to be re-configured or re-programmed for
different data sets, she / he has the requisite knowledge to do it, and after the
application of the re-configured plug-ins, the data can be saved in the database. If it
is not possible to find a working combination of plug-ins at the current state of
development, the data can still be pre-processed manually, which would not be
feasible when using the stepwise process. After this, the expert in working on pre-
processed data can load the data from the database and be confident that the data she
/ he is working on has been correctly pre-processed. The third issue is also easy to
solve; after the modifications, the set of algorithms that can be tested on the data is
no longer limited to those implemented in Tomahawk, but includes tools that have a
database interface implemented in them, for example Matlab. This expands
drastically the range of available algorithms, which in turn makes it also faster to
find an algorithm suitable to a given task. As soon as a suitable algorithm has been
found, it can be implemented in Tomahawk.
Finally, a comparison of the steps required for pre-processing the data in the
SIOUX project using the stepwise and truly stepwise approaches is presented. The
motivation of the comparison is to demonstrate how large a task it would be for the
researcher working on pre-processed data to pre-process the data using the stepwise
approach before she / he could start the actual work.
If one wants to acquire pre-processed data using the stepwise approach, it takes
the application and configuration of 8 plug-ins to pre-process the data. The left panel
of Figure 6 shows one of the configuration dialogs of the plug-ins. This particular
panel has 4 numerical values that must be set correctly and the option of setting 6
check boxes. The total number of options the researcher has to set in the 8 plug-ins
for acquiring correctly pre-processed data is 68. The 68 options are not the same for
all the data gathered in the project, and it requires advanced pre-processing skills to
configure them correctly. Therefore, it is quite a complicated task to pre-process the
10. data, and it is especially difficult for a researcher who has not constructed the pre-
processing plug-ins. The need to configure the 68 options of the pre-processing
sequence would take a lot of time and expertise away from the actual work and still
give poor confidence in that the data is correctly pre-processed.
To acquire the pre-processed data using the truly stepwise approach, one only
needs to fetch the data from the database. The right panel of Figure 6 shows the
configuration dialog of the database plug-in, which is used to configure the data
fetched for analysis from the database. Using the dialog, the researcher working on
the pre-processed data can simply choose the pre-processed data items that will be
used in the further analyses, and she / he does not have to bother with the actual pre-
processing of the data. The researcher can be sure that all the data loaded from the
database has been correctly pre-processed by the expert in pre-processing. From the
viewpoint of the researcher responsible for the pre-processing, it is good to know
that the sequence of pre-processing plug-ins does not have to be run every time that
pre-processed data is needed, and that she / he can be sure that correctly pre-
processed data will be used in the further steps of the data mining process.
In conclusion, by using the stepwise process, a researcher working with pre-
processed data could never be certain that the data had been correctly pre-processed,
or that all the plug-ins had been configured the way they should, which resulted in
confusion and uncertainty about the quality of the data. The truly stepwise process,
on the other hand, allowed a notably simple way to access the pre-processed data,
resulted in time savings, and ensure that the analyzed data were correctly pre-
processed.
Figure 6: The left panel shows one of the 8 dialogues that need to be filled in to acquire pre-processed
signal curves. The right panel shows the dialogue that is used for fetching raw and pre-processed data
directly from the database.
5. Conclusions
This paper presented a new approach for managing the data mining process,
called truly stepwise data mining process. In the truly stepwise process, the
transformed data is stored after the main phases of the data mining process, and the
transformations are applied to data fetched from the data storage medium. The
11. benefits of the process compared to the stepwise data mining process (the traditional
approach) were analyzed. It was noticed that the proposed approach increases the
independence of the algorithms applied to the data and the number of algorithms
easily applicable to the data and makes it easier to manage and allocate the expertise
and teamwork of the data analysts. Also, data storage and on-line monitoring of the
data mining process are easier to organize using the new method, and it saves both
staff and CPU time. The approach was illustrated using a case study of a spot
welding data mining project. The two approaches were compared, and it was
demonstrated that the proposed method markedly simplified the tasks of the
specialist working on the pre-processed data.
In the future, the possibilities to apply the approach on a finer scale will be
studied - here it was only applied after the main phases of the data mining process.
The feature and model data of the approach will also be demonstrated, and the
application of the method will be extended to other data mining projects.
6. Acknowledgements
We would like to express our gratitude to our colleagues at Fachochschule
Karlsruhe, Institut für Innovation und Transfer, in Harms + Wende GmbH & Co.KG
[11], in Technax Industrie [12] and in Stanzbiegetechnik GesmbH [13] for providing
the data set, the expertise needed in the case study and for numerous other things
that made it possible to accomplish this work. We also wish to thank the graduate
school GETA [14], supported by Academy of Finland, for sponsoring this research.
Furthermore, this study has been financially supported by the Commission of the
European Communities, specific RTD programme “Competitive and Sustainable
Growth”, G1ST-CT-2002-50245, “SIOUX” (Intelligent System for Dynamic Online
Quality Control of Spot Welding Processes for Cross(X)-Sectoral Applications”). It
does not necessarily reflect the views of this programme and in no way anticipates
the Commission’s future policy in this area.
References
[1] P. Chapman, J. Clinton, T. Khabaza, T. Reinartz and R. Wirth, "CRISP-DM 1.0 Step-by-
step data mining guide," August, 2000.
[2] Hotz, E., Grimmer, U. Heuser, W. & Nakhaeizadeh, G. 2001. REVI-MINER, a KDD-
Environment for Deviation Detection and Analysis of Warranty and Goodwill Cost
Statements in Automotive Industry. In Proc. Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD 2001), 432–437.
[3] Liu, J.B. & Han, J. 2002. A Practical Knowledge Discovery Process for
Distributed Data Mining. In Proc. ISCA 11th
International Conference on
Intelligent Systems: Emerging Technologies, 11–16.
[4] Silva, E.M., do Prado, H.A. & Ferneda, E. 2002. Text mining: crossing the
chasm between the academy and the industry. In Proc. Third International
Conference on Data Mining, 351–361.
[5] S. Moyle and A. Jorge, "RAMSYS - A methodology for supporting rapid remote
collaborative data mining projects," in ECML/PKDD'01 workshop on Integrating
12. Aspects of Data Mining, Decision Support and Meta-Learning: Internal SolEuNet
Session, 2001, pp. 20-31.
[6] D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann Publishers, 1999.
[7] R.J. Brachman and T. Anand, "The Process of Knowledge Discovery in Databases: A
Human-Centered Approach," in Advances in Knowledge Discovery and Data Mining,
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy Eds. MIT Press, 1996,
pp. 37-58.
[8] TWI World Centre for Materials Joining Technology, information available at their
homepage: http://www.twi.co.uk/j32k/protected/band_3/kssaw001.html, referenced
31.12.2003.
[9] Laurinen, P.; Junno, H.; Tuovinen, L.; Röning, J.; Studying the Quality of Resistance
Spot Welding Joints Using Bayesian Networks, Artificial Intelligence and Applications
(AIA 2004), February 16-18, 2004, Innsbruck, Austria.
[10] Junno, H.; Laurinen, P.; Tuovinen, L.; Röning, J.; Studying the Quality of Resistance
Spot Welding Joints Using Self-Organising Maps, Fourth International ICSC Symposium
on Engineering of Intelligent Systems (EIS 2004), February 29 -March 2, 2004, Madeira,
Portugal.
[11] Harms+Wende GmbH & Co.KG, the world wide web page: http://www.harms-
wende.de/, referenced 13.2.2004.
[12] Technax Industrie, the world wide web page: http://www.technaxindustrie.com/,
referenced 13.2.2004.
[13] Stanzbiegetechnik GesmbH, the world wide web page:
http://www.stanzbiegetechnik.at/Startseite/index.php, referenced 13.2.2004.
[14] Graduate school GETA, the world wide web page: http://wooster.hut.fi/geta/, referenced
13.2.2004.