This document provides an overview of techniques for temporal data mining. It discusses how temporal sequences arise in domains like engineering, science, finance, and healthcare. It covers approaches for representing temporal sequences, including keeping data in its original form, piecewise approximations, transformations to other domains, discretization, and generative models. It also discusses measuring similarity between sequences and techniques for classification and relation finding in temporal data, including frequent pattern detection and prediction. The goal is to classify and organize temporal data mining techniques to help practitioners address problems involving temporal data.
This document discusses using Fourier transforms to measure image similarity. It proposes a metric that uses both the real and complex components of the Fourier transform to compute a similarity ranking between two images. The metric calculates the intersection of the covariance matrix of the Fourier transform magnitude and phase spectra of the two images. The approach is shown to be advantageous for images with varying degrees of lighting. It summarizes existing methods for image comparison and feature extraction, and discusses implementing the proposed similarity metric using OpenCV. Sample results are given comparing the new Fourier-based method to existing histogram comparison techniques.
This document provides an overview of data clustering techniques from a statistical pattern recognition perspective. Clustering is the unsupervised classification of patterns into groups based on similarity. The key components of clustering include pattern representation, similarity measures, clustering techniques, and cluster validation. A wide variety of clustering techniques have been developed, each with different assumptions and applications. Selecting an appropriate clustering technique requires an understanding of the data and domain expertise.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
Hybrid Approach for Brain Tumour Detection in Image Segmentationijtsrd
In this paper we have considered illustrating a few techniques. But the numbers of techniques are so large they cannot be all addressed. Image segmentation forms the basics of pattern recognition and scene analysis problems. The segmentation techniques are numerous in number but the choice of one technique over the other depends only on the application or requirements of the problem that is being considered. Analysis of cluster is a descriptive assignment that perceive homogenous group of objects and it is also one of the fundamental analytical method in facts mining. The main idea of this is to present facts about brain tumour detection system and various data mining methods used in this system. This is focuses on scalable data systems, which include a set of tools and mechanisms to load, extract, and improve disparate data power to perform complex transformations and analysis will be measured between the way of measuring the Furrier and Wavelet Transform distance. Sandeep | Jyoti Kataria "Hybrid Approach for Brain Tumour Detection in Image Segmentation" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd33409.pdf Paper Url: https://www.ijtsrd.com/medicine/other/33409/hybrid-approach-for-brain-tumour-detection-in-image-segmentation/sandeep
Brain Tumor Segmentation Based on SFCM using Neural NetworkIRJET Journal
This document describes a proposed system for brain tumor segmentation using neural networks. The system involves 4 phases: 1) Preprocessing MRI images using dual-tree complex wavelet transforms for feature extraction. 2) Spatial fuzzy C-means clustering to classify tissues into normal, tumor core and edema classes. 3) Extracting features using the dual-tree complex wavelet transforms. 4) Classifying the features using a backpropagation neural network to identify normal and abnormal brain tissues. The goal is to automatically and accurately segment brain tumors from MRI images to aid diagnosis and reduce analysis time for radiologists. The system was tested on real patient MRI data and achieved accurate segmentation results.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This document proposes and examines the performance of a hybrid time series forecasting model called the wavelet radial bases function neural network (WRBFNN) model. The WRBFNN model uses wavelet transforms to extract coefficients from time series data as inputs to a radial basis function neural network for forecasting. The performance of the WRBFNN model is compared to a wavelet feed forward neural network (WFFNN) model using four types of non-stationary time series data. Results show the WRBFNN model achieves more accurate forecasts than the WFFNN model for certain types of non-stationary data, and trains significantly faster with fewer computational resources required.
This document provides an overview and introduction to data mining techniques. It discusses how data mining is used to discover patterns, associations, and structures in large amounts of data in a semi-automatic way. The document outlines the typical data mining process, which includes understanding the problem domain, collecting and cleaning data, applying data mining algorithms like association rules, sequence mining, classification, and clustering, and then interpreting and evaluating the results. Several categories of data mining problems and techniques are described at a high level.
This document discusses using Fourier transforms to measure image similarity. It proposes a metric that uses both the real and complex components of the Fourier transform to compute a similarity ranking between two images. The metric calculates the intersection of the covariance matrix of the Fourier transform magnitude and phase spectra of the two images. The approach is shown to be advantageous for images with varying degrees of lighting. It summarizes existing methods for image comparison and feature extraction, and discusses implementing the proposed similarity metric using OpenCV. Sample results are given comparing the new Fourier-based method to existing histogram comparison techniques.
This document provides an overview of data clustering techniques from a statistical pattern recognition perspective. Clustering is the unsupervised classification of patterns into groups based on similarity. The key components of clustering include pattern representation, similarity measures, clustering techniques, and cluster validation. A wide variety of clustering techniques have been developed, each with different assumptions and applications. Selecting an appropriate clustering technique requires an understanding of the data and domain expertise.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
Hybrid Approach for Brain Tumour Detection in Image Segmentationijtsrd
In this paper we have considered illustrating a few techniques. But the numbers of techniques are so large they cannot be all addressed. Image segmentation forms the basics of pattern recognition and scene analysis problems. The segmentation techniques are numerous in number but the choice of one technique over the other depends only on the application or requirements of the problem that is being considered. Analysis of cluster is a descriptive assignment that perceive homogenous group of objects and it is also one of the fundamental analytical method in facts mining. The main idea of this is to present facts about brain tumour detection system and various data mining methods used in this system. This is focuses on scalable data systems, which include a set of tools and mechanisms to load, extract, and improve disparate data power to perform complex transformations and analysis will be measured between the way of measuring the Furrier and Wavelet Transform distance. Sandeep | Jyoti Kataria "Hybrid Approach for Brain Tumour Detection in Image Segmentation" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd33409.pdf Paper Url: https://www.ijtsrd.com/medicine/other/33409/hybrid-approach-for-brain-tumour-detection-in-image-segmentation/sandeep
Brain Tumor Segmentation Based on SFCM using Neural NetworkIRJET Journal
This document describes a proposed system for brain tumor segmentation using neural networks. The system involves 4 phases: 1) Preprocessing MRI images using dual-tree complex wavelet transforms for feature extraction. 2) Spatial fuzzy C-means clustering to classify tissues into normal, tumor core and edema classes. 3) Extracting features using the dual-tree complex wavelet transforms. 4) Classifying the features using a backpropagation neural network to identify normal and abnormal brain tissues. The goal is to automatically and accurately segment brain tumors from MRI images to aid diagnosis and reduce analysis time for radiologists. The system was tested on real patient MRI data and achieved accurate segmentation results.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This document proposes and examines the performance of a hybrid time series forecasting model called the wavelet radial bases function neural network (WRBFNN) model. The WRBFNN model uses wavelet transforms to extract coefficients from time series data as inputs to a radial basis function neural network for forecasting. The performance of the WRBFNN model is compared to a wavelet feed forward neural network (WFFNN) model using four types of non-stationary time series data. Results show the WRBFNN model achieves more accurate forecasts than the WFFNN model for certain types of non-stationary data, and trains significantly faster with fewer computational resources required.
This document provides an overview and introduction to data mining techniques. It discusses how data mining is used to discover patterns, associations, and structures in large amounts of data in a semi-automatic way. The document outlines the typical data mining process, which includes understanding the problem domain, collecting and cleaning data, applying data mining algorithms like association rules, sequence mining, classification, and clustering, and then interpreting and evaluating the results. Several categories of data mining problems and techniques are described at a high level.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Pattern recognition using context dependent memory model (cdmm) in multimodal...ijfcstjournal
Pattern recognition is one of the prime concepts in current technologies in both private and public sectors.
The analysis and recognition of two or more patterns is a complex task due to several factors. The
consideration of two or more patterns requires huge space for keeping the storage media as well as
computational aspect. Vector logic gives very good strategy for recognition of patterns. This paper
proposes pattern recognition in multimodal authentication system with the use of vector logic and makes
the computation model hard and less error rate. Using PCA two to three biometric patterns will be fusion
and then various key sizes will be extracted using LU factorization approach. The selected keys will be
combined using vector logic, which introduces a memory model often called Context Dependent Memory
Model (CDMM) as computational model in multimodal authentication system that gives very accurate and
very effective outcome for authentication as well as verification. In the verification step, Mean Square
Error (MSE) and Normalized Correlation (NC) as metrics to minimize the error rate for the proposed
model and the performance analysis will be presented.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
Parametric comparison based on split criterion on classification algorithmIAEME Publication
This document presents a comparison of different attribute selection criteria for classification algorithms in stream data mining. It analyzes two common criteria - information gain and Gini index - and evaluates their impact on classification accuracy using different datasets. The results show that information gain generally achieves higher accuracy than Gini index, especially for larger data sizes. The document aims to improve the performance of stream data classification algorithms by optimizing the split criterion selection approach.
RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A SURVEYcsandit
Rainfall is considered as one of the major components of the hydrological process; it takes
significant part in evaluating drought and flooding events. Therefore, it is important to have an
accurate model for rainfall prediction. Recently, several data-driven modeling approaches have
been investigated to perform such forecasting tasks as multilayer perceptron neural networks
(MLP-NN). In fact, the rainfall time series modeling (SARIMA) involvesimportant temporal
dimensions. In order to evaluate the incomes of both models, statistical parameters were used to
make the comparison between the two models. These parameters include the Root Mean Square
Error RMSE, Mean Absolute Error MAE, Coefficient Of Correlation CC and BIAS. Two-Third
of the data was used for training the model and One-third for testing.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
Predicting electricity consumption using hidden parametersIJLT EMAS
data mining technique to forecast power demand of a
biological region based on the metrological conditions. The value
forecast analytical data mining technique is implement with the
Hidden Marko Model. The morals of the factor such as heat,
clamminess and municipal celebration on which influence
operation depends and the everyday utilization morals compose
the data. Data mining operation are perform on this
chronological data to form a forecast model which is able of
predict every day utilization provide the meteorological
parameter. The steps of information detection of data process are
implemented. The data is preprocessed and fed to HMM for
guidance it. The educated HMM network is used to predict the
electricity demand for the given meteorological conditions.
This document summarizes a research paper on using a k-means clustering method to detect brain tumors in MRI images. The paper introduces brain tumors and MRI imaging. It then describes using k-means clustering for tumor segmentation, which groups similar image patterns into clusters to identify the tumor region. The paper presents results of applying k-means to two MRI images, including statistical measures of segmentation accuracy, tumor area comparison, and timing. The k-means method achieved average rand index of 0.8358, low average errors, and tumor areas close to manual segmentation in under 3 seconds, demonstrating potential for accurate and efficient brain tumor detection.
A linear-Discriminant-Analysis-Based Approach to Enhance the Performance of F...CSCJournals
Spike sorting is of prime importance in neurophysiology and hence has received considerable attention. However, conventional methods suffer from the degradation of clustering results in the presence of high levels of noise contamination. This paper presents a scheme for taking advantage of automatic clustering and enhancing the feature extraction efficiency, especially for low-SNR spike data. The method employs linear discriminant analysis based on a fuzzy c-means (FCM) algorithm. Simulated spike data [1] were used as the test bed due to better a priori knowledge of the spike signals. Application to both high and low signal-to-noise ratio (SNR) data showed that the proposed method outperforms conventional principal-component analysis (PCA) and FCM algorithm. FCM failed to cluster spikes for low-SNR data. For two discriminative performance indices based on Fisher's discriminant criterion, the proposed approach was over 1.36 times the ratio of between- and within-class variation of PCA for spike data with SNR ranging from 1.5 to 4.5 dB. In conclusion, the proposed scheme is unsupervised and can enhance the performance of fuzzy c-means clustering in spike sorting with low-SNR data.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
Missing data exploration in air quality data set using R-package data visuali...journalBEEI
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
The document discusses employing a rough set based approach for clustering categorical time-evolving data. It proposes using node importance and a rough membership function to label unlabeled data points and detect concept drift between data clusters over time. Specifically, it defines key terms like nodes, introduces the problem of clustering categorical time-series data and concept drift detection. It then describes using a rough membership function to calculate similarity between unlabeled data and existing clusters in order to label the data and detect changes in cluster characteristics over time.
A Review on Classification Based Approaches for STEGanalysis DetectionEditor IJCATR
This document summarizes two approaches for image steganalysis detection. The first approach proposes novel steganalysis algorithms based on how data hiding affects the rate-distortion characteristics of images. Features are extracted based on increased image entropy and small, imperceptible distortions from data embedding. A Bayesian classifier is then trained on these features. The second approach uses contourlet transform to represent images. It extracts features based on the first four normalized statistical moments of high and low frequency subbands and structural similarity measure of medium frequency subbands. A non-linear support vector machine is then used for classification. Experimental results show the proposed approaches can efficiently detect stego images with high accuracy and low computational cost.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
Collaboration For Greater Impact Report Jan 09 V05 FinalAidan Stacey
This document discusses collaboration among non-profit organizations. It provides an overview of different types of collaboration, from informal networks to full mergers. The main motivations for collaboration include financial benefits from efficiencies, enhancing skills and expertise, and better serving organizational missions by coordinating services. Some key challenges to collaboration include overcoming cultural differences between organizations and potential staff turnover resulting from changes during the collaboration process. The document aims to provide food for thought for non-profits considering collaborative opportunities.
National Talent is a recruitment firm in Saudi Arabia that aims to promote and increase employment of Saudi nationals. Their mission is to develop Saudi talent for the benefit of local economies. Their vision is to be recognized as the leading recruitment provider in the kingdom by becoming a trusted partner for candidates and employers. They value operating with integrity, working in partnerships, encouraging local talent, performance, and innovation. The company is overseen by Tom McGregor and has recruitment consultants and specialists across branches in major Saudi cities.
This document outlines 50 essential content marketing hacks presented by Matt Heinz, President of Heinz Marketing Inc. at CMWorld. It provides an agenda for the presentation and covers topics such as content planning, measurement, formats, distribution, influencer engagement, repurposing content, and getting sales teams to leverage content. The goal is to provide new tools, tricks and best practices to help convert readers into customers through effective content marketing.
The document discusses prototyping and provides examples of different types of prototypes including paper prototypes, digital prototypes, storyboards, role plays, and space prototypes. It explains that prototyping is used to make ideas tangible and test reactions from users in order to gain insights. Prototypes should be iterated on and fail early to push ideas further and save time and money. Both low and high fidelity prototypes are mentioned as ways to test ideas at different stages of the design process.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Pattern recognition using context dependent memory model (cdmm) in multimodal...ijfcstjournal
Pattern recognition is one of the prime concepts in current technologies in both private and public sectors.
The analysis and recognition of two or more patterns is a complex task due to several factors. The
consideration of two or more patterns requires huge space for keeping the storage media as well as
computational aspect. Vector logic gives very good strategy for recognition of patterns. This paper
proposes pattern recognition in multimodal authentication system with the use of vector logic and makes
the computation model hard and less error rate. Using PCA two to three biometric patterns will be fusion
and then various key sizes will be extracted using LU factorization approach. The selected keys will be
combined using vector logic, which introduces a memory model often called Context Dependent Memory
Model (CDMM) as computational model in multimodal authentication system that gives very accurate and
very effective outcome for authentication as well as verification. In the verification step, Mean Square
Error (MSE) and Normalized Correlation (NC) as metrics to minimize the error rate for the proposed
model and the performance analysis will be presented.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
Parametric comparison based on split criterion on classification algorithmIAEME Publication
This document presents a comparison of different attribute selection criteria for classification algorithms in stream data mining. It analyzes two common criteria - information gain and Gini index - and evaluates their impact on classification accuracy using different datasets. The results show that information gain generally achieves higher accuracy than Gini index, especially for larger data sizes. The document aims to improve the performance of stream data classification algorithms by optimizing the split criterion selection approach.
RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A SURVEYcsandit
Rainfall is considered as one of the major components of the hydrological process; it takes
significant part in evaluating drought and flooding events. Therefore, it is important to have an
accurate model for rainfall prediction. Recently, several data-driven modeling approaches have
been investigated to perform such forecasting tasks as multilayer perceptron neural networks
(MLP-NN). In fact, the rainfall time series modeling (SARIMA) involvesimportant temporal
dimensions. In order to evaluate the incomes of both models, statistical parameters were used to
make the comparison between the two models. These parameters include the Root Mean Square
Error RMSE, Mean Absolute Error MAE, Coefficient Of Correlation CC and BIAS. Two-Third
of the data was used for training the model and One-third for testing.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
Predicting electricity consumption using hidden parametersIJLT EMAS
data mining technique to forecast power demand of a
biological region based on the metrological conditions. The value
forecast analytical data mining technique is implement with the
Hidden Marko Model. The morals of the factor such as heat,
clamminess and municipal celebration on which influence
operation depends and the everyday utilization morals compose
the data. Data mining operation are perform on this
chronological data to form a forecast model which is able of
predict every day utilization provide the meteorological
parameter. The steps of information detection of data process are
implemented. The data is preprocessed and fed to HMM for
guidance it. The educated HMM network is used to predict the
electricity demand for the given meteorological conditions.
This document summarizes a research paper on using a k-means clustering method to detect brain tumors in MRI images. The paper introduces brain tumors and MRI imaging. It then describes using k-means clustering for tumor segmentation, which groups similar image patterns into clusters to identify the tumor region. The paper presents results of applying k-means to two MRI images, including statistical measures of segmentation accuracy, tumor area comparison, and timing. The k-means method achieved average rand index of 0.8358, low average errors, and tumor areas close to manual segmentation in under 3 seconds, demonstrating potential for accurate and efficient brain tumor detection.
A linear-Discriminant-Analysis-Based Approach to Enhance the Performance of F...CSCJournals
Spike sorting is of prime importance in neurophysiology and hence has received considerable attention. However, conventional methods suffer from the degradation of clustering results in the presence of high levels of noise contamination. This paper presents a scheme for taking advantage of automatic clustering and enhancing the feature extraction efficiency, especially for low-SNR spike data. The method employs linear discriminant analysis based on a fuzzy c-means (FCM) algorithm. Simulated spike data [1] were used as the test bed due to better a priori knowledge of the spike signals. Application to both high and low signal-to-noise ratio (SNR) data showed that the proposed method outperforms conventional principal-component analysis (PCA) and FCM algorithm. FCM failed to cluster spikes for low-SNR data. For two discriminative performance indices based on Fisher's discriminant criterion, the proposed approach was over 1.36 times the ratio of between- and within-class variation of PCA for spike data with SNR ranging from 1.5 to 4.5 dB. In conclusion, the proposed scheme is unsupervised and can enhance the performance of fuzzy c-means clustering in spike sorting with low-SNR data.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
Missing data exploration in air quality data set using R-package data visuali...journalBEEI
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
The document discusses employing a rough set based approach for clustering categorical time-evolving data. It proposes using node importance and a rough membership function to label unlabeled data points and detect concept drift between data clusters over time. Specifically, it defines key terms like nodes, introduces the problem of clustering categorical time-series data and concept drift detection. It then describes using a rough membership function to calculate similarity between unlabeled data and existing clusters in order to label the data and detect changes in cluster characteristics over time.
A Review on Classification Based Approaches for STEGanalysis DetectionEditor IJCATR
This document summarizes two approaches for image steganalysis detection. The first approach proposes novel steganalysis algorithms based on how data hiding affects the rate-distortion characteristics of images. Features are extracted based on increased image entropy and small, imperceptible distortions from data embedding. A Bayesian classifier is then trained on these features. The second approach uses contourlet transform to represent images. It extracts features based on the first four normalized statistical moments of high and low frequency subbands and structural similarity measure of medium frequency subbands. A non-linear support vector machine is then used for classification. Experimental results show the proposed approaches can efficiently detect stego images with high accuracy and low computational cost.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
Collaboration For Greater Impact Report Jan 09 V05 FinalAidan Stacey
This document discusses collaboration among non-profit organizations. It provides an overview of different types of collaboration, from informal networks to full mergers. The main motivations for collaboration include financial benefits from efficiencies, enhancing skills and expertise, and better serving organizational missions by coordinating services. Some key challenges to collaboration include overcoming cultural differences between organizations and potential staff turnover resulting from changes during the collaboration process. The document aims to provide food for thought for non-profits considering collaborative opportunities.
National Talent is a recruitment firm in Saudi Arabia that aims to promote and increase employment of Saudi nationals. Their mission is to develop Saudi talent for the benefit of local economies. Their vision is to be recognized as the leading recruitment provider in the kingdom by becoming a trusted partner for candidates and employers. They value operating with integrity, working in partnerships, encouraging local talent, performance, and innovation. The company is overseen by Tom McGregor and has recruitment consultants and specialists across branches in major Saudi cities.
This document outlines 50 essential content marketing hacks presented by Matt Heinz, President of Heinz Marketing Inc. at CMWorld. It provides an agenda for the presentation and covers topics such as content planning, measurement, formats, distribution, influencer engagement, repurposing content, and getting sales teams to leverage content. The goal is to provide new tools, tricks and best practices to help convert readers into customers through effective content marketing.
The document discusses prototyping and provides examples of different types of prototypes including paper prototypes, digital prototypes, storyboards, role plays, and space prototypes. It explains that prototyping is used to make ideas tangible and test reactions from users in order to gain insights. Prototypes should be iterated on and fail early to push ideas further and save time and money. Both low and high fidelity prototypes are mentioned as ways to test ideas at different stages of the design process.
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
In an ever-changing landscape of one digital disruption after another, companies and organisations are looking for new ways to understand their target markets and engage them better. Increasingly they invest in user experience (UX) and customer experience design (CX) capabilities by working with a specialist UX agency or developing their own UX lab. Some UX practitioners are touting leaner and faster ways of developing customer-centric products and services, via methodologies such as guerilla research, rapid prototyping and Agile UX. Others seek innovation and fulfilment by spending more time in research, being more inclusive, and designing for social goods.
Experience is more than just an interface. It is a relationship, as well as a series of touch points between your brand and your customer. Here are our top 10 highlights and takeaways from the recent UX Australia conference to help you transform your customer experience design.
For full article, continue reading at https://yump.com.au/10-ways-supercharge-customer-experience-design/
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
The document discusses how personalization and dynamic content are becoming increasingly important on websites. It notes that 52% of marketers see content personalization as critical and 75% of consumers like it when brands personalize their content. However, personalization can create issues for search engine optimization as dynamic URLs and content are more difficult for search engines to index than static pages. The document provides tips for SEOs to help address these personalization and SEO challenges, such as using static URLs when possible and submitting accurate sitemaps.
A preliminary survey on optimized multiobjective metaheuristic methods for da...ijcsit
The present survey provides the state-of-the-art of research, copiously devoted to Evolutionary Approach
(EAs) for clustering exemplified with a diversity of evolutionary computations. The Survey provides a
nomenclature that highlights some aspects that are very important in the context of evolutionary data
clustering. The paper missions the clustering trade-offs branched out with wide-ranging Multi Objective
Evolutionary Approaches (MOEAs) methods. Finally, this study addresses the potential challenges of
MOEA design and data clustering, along with conclusions and recommendations for novice and
researchers by positioning most promising paths of future research.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
The document discusses representing relational spatiotemporal data using information granules. It proposes:
1) Describing the relational data using a vocabulary of granular descriptors formed from Cartesian products of spatial, temporal, and signal information granules. This granular representation provides an interpretable perspective on the data.
2) Analyzing the capabilities of different vocabularies to capture the essence of the data through the processes of granulation and degranulation, where the original data is reconstructed from its granular representation. The quality of reconstruction is used to optimize the vocabulary.
3) Extending the approach to analyze evolvability of the granular description as the relational data changes across consecutive
This document discusses and compares different methods for anomaly detection in multivariate time series data, including conventional, machine learning-based, and deep neural network methods. It analyzes the performance of sixteen such methods on five real-world datasets. The results show that no single family of methods consistently outperforms the others, suggesting researchers should incorporate and compare multiple types of methods in benchmarks.
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
Efficiency of recurrent neural networks for seasonal trended time series mode...IJECEIAES
Seasonal time series with trends are the most common data sets used in forecasting. This work focuses on the automatic processing of a
non-pre-processed time series by studying the efficiency of recurrent neural networks (RNN), in particular both long short-term memory (LSTM), and bidirectional long short-term memory (Bi-LSTM) extensions, for modelling seasonal time series with trend. For this purpose, we are interested in the learning stability of the established systems using the mean average percentage error (MAPE) as a measure. Both simulated and real data were examined, and we have found a positive correlation between the signal period and the system input vector length for stable and relatively efficient learning. We also examined the white noise impact on the learning performance.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
This document discusses using data mining techniques like machine learning to analyze air quality data and generate models for predicting pollution levels. It summarizes applying decision trees and neural networks to data on pollutants and weather factors in Cambridge, UK. The models showed air temperature as the dominant predictor of ozone levels. While data mining provided insights, the author notes it is most useful complementing existing scientific domain knowledge and physical models of air quality.
This document discusses using data mining techniques like machine learning to analyze air quality data and generate models for predicting pollution levels. It summarizes applying decision trees and neural networks to data on pollutants and weather in Cambridge, UK. The models showed air temperature is a dominant predictor of ozone levels. While data mining provides an empirical approach and short-term predictions, physical models are still needed to fully understand urban air quality given small-scale variations.
A Time Series ANN Approach for Weather Forecastingijctcm
Weather forecasting is most challenging problem around the world. There are various reason because of its experimented values in meteorology, but it is also a typical unbiased time series forecasting problem in scientific research. A lots of methods proposed by various scientists. The motive behind research is to predict more accurate. This paper contribute the same using artificial neural network (ANN) and simulated in MATLAB to predict two important weather parameters i.e. maximum and minimum temperature. The model has been trained using past 60 years of real data collected from(1901-1960) and tested over 40 years to forecast maximum and minimum temperature. The results based on mean square error function (MSE) confirm, this model which is based on multilayer perceptron has the potential to successful application to weather forecasting
HITS: A History-Based Intelligent Transportation System IJDKP
Transportation is the driving force of any country. Today we are facing an explosion in the number of motor vehicles that affects our daily routines. Intelligent transportation systems (ITS) aim to provide efficient tools that solve traffic problems. Predicting route congestions during different day periods can help drivers choose better routes for their trips. In this paper we propose “HITS” a traffic control system that integrates moving object database techniques [30, 28] along with data warehousing techniques [15].
Our system uses historical traffic information to answer queries about moving objects on road network, and to analyze historical traffic conditions to enhance future traffic related decisions.
In this paper, we propose a models of process chain and knowledge-based of meteorological reanalysisdatasets that help scientists, working in the field of climate and in particular of the rainfall evolution, tosolve uncertainty of spatial resources (data, process) to monitor the rainfall evolution. Indeed, rainfallevolution mobilizes all research, various methods of meteorological reanalysis datasets processing areproposed. Meteorological reanalysis datasets available, at present, are voluminous and heterogeneous interms of source, spatial and temporal resolutions
MODELING PROCESS CHAIN OF METEOROLOGICAL REANALYSIS PRECIPITATION DATA USING ...Zac Darcy
In this paper, we propose a models of process chain and knowledge-based of meteorological reanalysis datasets that help scientists, working in the field of climate and in particular of the rainfall evolution, to solve uncertainty of spatial resources (data, process) to monitor the rainfall evolution. Indeed, rainfall evolution mobilizes all research, various methods of meteorological reanalysis datasets processing are proposed. Meteorological reanalysis datasets available, at present, are voluminous and heterogeneous in terms of source, spatial and temporal resolutions. The use of these meteorological reanalysis datasets may solve uncertainty of data. In addition, phenomena such as rainfall evolution require the analysis of time series of meteorological reanalysis datasets and the development of automated and reusable processing chains for monitoring rainfall evolution. We propose to formalize these processing chains from modeling an abstract and concrete models based on existing standards in terms of interoperability. These processing chains modelled will be capitalized, and diffusible in operational environments. Our modeling approach uses Work-Context concepts. These concepts need organization of human resources, data, and process in order to establish a knowledge-based connecting the two latter. This knowledge based will be used to solve uncertainty of meteorological reanalysis datasets resources for monitoring rainfall evolution.
Modeling Process Chain of Meteorological Reanalysis Precipitation Data Using ...Zac Darcy
In this paper, we propose a models of process chain and knowledge-based of meteorological reanalysis
datasets that help scientists, working in the field of climate and in particular of the rainfall evolution, to
solve uncertainty of spatial resources (data, process) to monitor the rainfall evolution. Indeed, rainfall
evolution mobilizes all research, various methods of meteorological reanalysis datasets processing are
proposed. Meteorological reanalysis datasets available, at present, are voluminous and heterogeneous in
terms of source, spatial and temporal resolutions. The use of these meteorol
MODELING PROCESS CHAIN OF METEOROLOGICAL REANALYSIS PRECIPITATION DATA USING ...ijitmcjournal
In this paper, we propose a models of process chain and knowledge-based of meteorological reanalysis datasets that help scientists, working in the field of climate and in particular of the rainfall evolution, to solve uncertainty of spatial resources (data, process) to monitor the rainfall evolution. Indeed, rainfall evolution mobilizes all research, various methods of meteorological reanalysis datasets processing are proposed. Meteorological reanalysis datasets available, at present, are voluminous and heterogeneous in terms of source, spatial and temporal resolutions. The use of these meteorological reanalysis datasets may solve uncertainty of data. In addition, phenomena such as rainfall evolution require the analysis of time series of meteorological reanalysis datasets and the development of automated and reusable processing chains for monitoring rainfall evolution. We propose to formalize these processing chains from modeling an abstract and concrete models based on existing standards in terms of interoperability. These processing chains modelled will be capitalized, and diffusible in operational environments. Our modeling approach uses Work-Context concepts. These concepts need organization of human resources, data, and process in order to establish a knowledge-based connecting the two latter. This knowledge based will be used to solve uncertainty of meteorological reanalysis datasets resources for monitoring rainfall evolution.
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGijdkp
An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time
grows as the size and complexity of the training dataset increases. For this reason, it is important to have
efficient computational methods and algorithms that can be applied on large datasets, such that it is still
possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper
a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is
combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model
(HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances
with similar inputs are clustered and a weight factor which represents the frequency of these instances is
assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity
function to cluster similar examples. In the second step, all formulas in the classical HMM training
algorithm (EM) associated with the number of training instances are modified to include the weight factor
in appropriate terms. This process significantly accelerates HMM training while maintaining the same
initial, transition and emission probabilities matrixes as those obtained with the classical HMM training
algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set,
speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach
is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...Nader Ale Ebrahim
Time-series classification is utilized in a variety of applications leading to the development of many data mining techniques for time-series analysis. Among the broad range of time-series classification algorithms, recent studies are considering the impact of deep learning methods on time-series classification tasks. The quantity of related publications requires a bibliometric study to explore most prominent keywords, countries, sources and research clusters. The paper conducts a bibliometric analysis on related publications in time-series classification, adopted from Scopus database between 2010 and 2019. Through keywords co-occurrence analysis, a visual network structure of top keywords in time-series classification research has been produced and deep learning has been introduced as the most common topic by additional inquiry of the bibliography. The paper continues by exploring the publication trends of recent deep learning approaches for time-series classification. The annual number of publications, the productive and collaborative countries, the growth rate of sources, the most occurred keywords and the research collaborations are revealed from the bibliometric analysis within the study period. The research field has been broken down into three main categories as different frameworks of deep neural networks, different applications in remote sensing and also in signal processing for time-series classification tasks. The qualitative analysis highlights the categories of top citation rate papers by describing them in details.
This document proposes InterFusion, an unsupervised method for anomaly detection and interpretation in multivariate time series (MTS) data. InterFusion uses a hierarchical variational autoencoder (HVAE) with two stochastic latent variables to jointly model inter-metric and temporal dependencies in MTS. It evaluates InterFusion on four real-world MTS datasets, where it achieves over 94% F1-score for anomaly detection and 87% accuracy for anomaly interpretation, outperforming state-of-the-art methods.
1. Lecture Notes in Computer Science 1
Temporal Data Mining: an overview
Cláudia M. Antunes 1, and Arlindo L. Oliveira 2
1
Instituto Superior Técnico, Dep. Engenharia Informática, Av. Rovisco Pais 1,
1049-001 Lisboa, Portugal
cman@rnl.ist.utl.pt
2
IST / INESC-ID, Dep. Engenharia Informática, Av. Rovisco Pais 1,
1049-001 Lisboa, Portugal
aml@inesc.pt
Abstract. One of the main unresolved problems that arise during the data
mining process is treating data that contains temporal information. In this case,
a complete understanding of the entire phenomenon requires that the data
should be viewed as a sequence of events. Temporal sequences appear in a
vast range of domains, from engineering, to medicine and finance, and the
ability to model and extract information from them is crucial for the advance
of the information society. This paper provides a survey on the most signifi-
cant techniques developed in the past ten years to deal with temporal se-
quences.
1 Introduction
With the rapid increase of stored data, the interest in the discovery of hidden infor-
mation has exploded in the last decade. This discovery has mainly been focused on
data classification, data clustering and relationship finding. One important problem
that arises during the discovery process is treating data with temporal dependencies.
The attributes related with the temporal information present in this type of datasets
need to be treated differently from other kinds of attributes. However, most of the
data mining techniques tend to treat temporal data as an unordered collection of
events, ignoring its temporal information.
The aim of this paper is to present an overview of the techniques proposed to date
that deal specifically with temporal data mining. Our objective is not only to enu-
merate the techniques proposed so far, but also to classify and organize them in a
way that may be of help for a practitioner looking for solutions to a concrete prob-
lem. Other overviews of this area have appeared in the literature ([25], [56]), but is
not widely available, while the second one uses a significantly different approach to
classify the existing work and is considerably less comprehensive than the present
work.
The paper is organized as follows: Section 2 defines the main goals of temporal
data mining, related problems and concepts and its applications areas. Section 3
presents the principal approaches to represent temporal sequences, and Section 4 the
2. Lecture Notes in Computer Science 2
methods developed to measure the similarity between two sequences. Section 5 ad-
dresses the problems of non-supervised sequence classification and relation finding,
including frequent event detection and prediction. Section 6 presents the conclusions
and points some possible directions for future research.
2 Mining Temporal Sequences
One possible definition of data mining is “the nontrivial extraction of implicit, pre-
viously unknown and potential useful information from data” [19]. The ultimate goal
of temporal data mining is to discover hidden relations between sequences and sub-
sequences of events. The discovery of relations between sequences of events involves
mainly three steps: the representation and modeling of the data sequence in a suit-
able form; the definition of similarity measures between sequences; and the applica-
tion of models and representations to the actual mining problems. Other authors
have used a different approach to classify data mining problems and algorithms.
Roddick [56] has used three dimensions: data type, mining operations and type of
timing information. Although both approaches are equally valid, we preferred to use
representation, similarity and operations, since it provided a more comprehensive
and novel view of the field.
Depending on the nature of the event sequence, the approaches to solve the prob-
lem may be quite different. A sequence composed by a series of nominal symbols
from a particular alphabet is usually called a temporal sequence and a sequence of
continuous, real-valued elements, is known as a time series.
Time series or, more generally, temporal sequences, appear naturally in a variety
of different domains, from engineering to scientific research, finance and medicine.
In engineering matters, they usually arise with either sensor-based monitoring, such
as telecommunications control (e.g. [14], [15]) or log-based systems monitoring (see,
for instances [44], [47]). In scientific research they appear, for example, in spatial
missions (e.g. [34], [50]) or in the genetics domain (e.g. [15]).
In finance, applications on the analysis of product sales or inventory consump-
tions are of great importance to business planning (see for instance [5]). Another
very common application in finance is the prediction of the evolution of financial
data (e.g. [3], [16], [20], [11], [7], [21], [17], [55]).
In healthcare, temporal sequences are a reality for decades; with data originated
by complex data acquisition systems like ECG’s (see, for instances [13], [21], [24],
[57]), or even with simple ones like measuring the patient temperature or treatments
effectiveness ([9], [6], [12]). In the last years, with the development of medical
informatics, the amount of data has increased considerably, and more than ever, the
need to react in real-time to any change in the patient behavior is crucial.
In general, applications that deal with temporal sequences serve mainly to support
diagnosis and to predict future behaviors ([56]).
3. Lecture Notes in Computer Science 3
3 Representation of Temporal Sequences
A fundamental problem that needs to be solved in the representation of the data and
the pre-processing that needs to be applied before actual data mining operations take
place. The representation problem is especially important when dealing with time
series, since direct manipulation of continuous, high-dimensional data in an efficient
way is extremely difficult. This problem can be addressed in several different ways.
One possible solution is to use the data with only minimal transformation, either
keeping it in its original form or using windowing and piecewise linear approxima-
tions to obtain manageable sub-sequences. These methods are described in section
3.1. A second possibility is to use a transformation that maps the data to a more
manageable space, either continuous (section 3.2) or discrete (section 3.3). A more
sophisticated approach, discussed in section 3.4 is to use a generative model, where a
statistical or deterministic model is obtained from the data and can be used to answer
more complex questions. Approaches, that deal with arbitrary dimension transaction
databases are presented in section 3.5.
One issue that is relevant for all the representation methods addressed is the abil-
ity to discover and represent the subsequences of a sequence. A method commonly
used to find subsequences is to use a sliding window of size w and place it at every
possible position of the sequence. Each such window defines a new subsequence,
composed by the elements inside the window [16].
3.1 Time-Domain Continuous Representations
A simple approach to represent a sequence of real-valued elements (time series) is
using the initial elements, ordered by their instant of occurrence without any pre-
processing ([3], [41]).
An alternative consists in finding a piecewise linear function able to approxi-
mately describe the entire initial sequence. In this way, the initial sequence is parti-
tioned in several segments and each segment is represented by a linear function. In
order to partition the sequence there exist two main possibilities: define beforehand
the number of segments or discover that number and then identify the correspondent
segments ([14], [27]). The objective is to obtain a representation amenable to the
detection of significant changes in the sequence.
While this is a relatively straightforward representation, not much is gained in
terms of our ability to manipulate the generated representations. One possible appli-
cation of this type of representations is on change-point detection, where one wants
to locate the points where a significant change in behavior takes place.
Another proposal, based on the idea that the human visual system partitions
smooth curves into linear segments, has been proposed. It mainly consists on seg-
menting a sequence by iteratively merging two similar segments. Choosing which
segments are to be merged is done based on the squared error minimization criteria
[34]. An extension to this model consists in associating with each segment a weight
value, which represents the segment importance relatively to the entire sequence. In
4. Lecture Notes in Computer Science 4
this manner, it is possible to compare sequences mainly by looking at their most
important segments [35]. With this representation, a more effective similarity meas-
ure can be defined, and consequently, mining operations may be applied.
One significant advantage of these approaches is the ability to reduce the impact
of noise. However, problems with amplitude differences (scaling problems) and the
existence of gaps or other time axis distortion are not addressed easily.
3.2 Transformation Based Representations
The main idea of Transformation Based Representations is to transform the initial
sequences from time to another domain, and then to use a point in this new domain
to represent each original sequence.
One proposal [1] uses the Discrete Fourier Transform (DFT) to transform a se-
quence from the time domain to a point in the frequency domain. Choosing the k
first frequencies, and then representing each sequence as a point in the k-
dimensional space, achieves this goal. The DFT has the attractive property that the
amplitude of the Fourier coefficients is invariant under shifts, which allows extend-
ing the method to find similar sequences ignoring shifts [1].
Other possibilities that have been proposed represent each sequence as a point in
the n-dimensional space, with n being the sequence length. This representation can
be viewed as a time-domain continuous representation ([20]). However, if one maps
each subsequence to a point in a new space with coordinates that are the derivatives
of each original sequence point, this should be considered as a transformation based
representation. In this case, the derivative at each point is determined by the differ-
ence between the point and its preceding one [20].
A more recent approach uses the Discrete Wavelet Transform (DWT) to translate
each sequence from the time domain into the time/frequency domain [11]. The DWT
is a linear transformation, which decomposes the original sequence into different
frequency components, without loosing the information about the instant of the ele-
ments occurrence. The sequence is then represented by its features, expressed as the
wavelet coefficients. Again, only a few coefficients are needed to approximately
represent the sequence.
With these kinds of representations, time series became a more manageable ob-
ject, which permit the definition of efficient similarity measures and an easier appli-
cation to common data mining operations.
3.3 Discretization Based Methods
A third approach to deal with time series is the translation of the initial sequence
(with real-valued elements) to a discretized sequence, this time composed of symbols
from an alphabet. One language to describe the symbols of an alphabet and the rela-
tionship between two sequences was proposed in [4]. This Shape Definition Lan-
guage (SDL) is also able to describe shape queries to pose to a sequence database and
it allows performing ‘blurry’ matching. A blurry match is one where the important
5. Lecture Notes in Computer Science 5
thing is the overall shape, therefore ignoring the specific details of a particular se-
quence. The first step in the representation process is defining the alphabet of sym-
bols and then translating the initial sequence to a sequence of symbols. The transla-
tion is done by considering transitions from an instant to the following one, and then
assigning a symbol of the described alphabet to each transition.
Constraint-Based Pattern Specification Language (CAPSUL) [56] is another lan-
guage used to specify patterns, which similarly to SDL, describes patterns by
considering some abstractions of the values at time intervals. The key difference
between these two languages is that CAPSUL uses a set of expressive constraints
upon temporal objects, which allows describing more complex patterns, for instance,
periodic patterns. Finally, to perform each of those abstractions the system accesses
the knowledge bases, where the relationships between the different values are
defined. approach to discretize time series [33] suggests the segmentation of a se-
Other
quence by computing the change ratio from one time point to the following one, and
representing all consecutive points with equal change ratios by a unique segment.
After this partition, each segment is represented by a symbol, and the sequence is
represented as a string of symbols. A particular case is when change ratios are classi-
fied into one of two classes: large fluctuations (symbol 1) or small fluctuations (sym-
bol 0). This way, each time series is converted into a binary sequence, perfectly
suited to be manipulated by genetic algorithms [63].
A different approach to convert a sequence into a discrete representation is using
clustering. First, the sequence originates a set of subsequences with length w, by
sliding a window of width w. Then, the set of all subsequences is clustered, originat-
ing k clusters, and a different symbol is associated with each cluster. The discretized
version of the initial sequence is obtained by substituting each subsequence by the
symbol associated to the cluster that it belongs to [15].
Another method to convert time series into a sequence of symbols is based on the
use of self-organizing maps ([23], [24]). This conversion consists on three steps:
first, a new series composed by the differences between consecutive values of the
original time series is derived; second windows with size d (called delay embedding)
are inputted to the SOM, which finally outputs the winner node. Each node is associ-
ated with a symbol, which means that the resultant sequence may be viewed as a
sequence of symbols.
The advantage of these methods is that the time series is partitioned in a natural
way, depending on its values. However, the symbols of the alphabet are usually cho-
sen externally, which means that they are imposed by the user, who has to know the
most suitable symbols, or are established in an artificial way. The discretized time
series is more amenable to manipulation and processing than the original, continu-
ous valued, time series.
3.4 Generative Models
A significantly different approach consists in obtaining a model that can be viewed
as a generator for the sequences obtained.
6. Lecture Notes in Computer Science 6
Several methods that use probabilistic generators have been proposed. One such
proposal [21] uses semi-Markov models to model a sequence and presents an effi-
cient algorithm that can find appearances of a particular sub-sequence in another
sequence.
A different alternative is to find partial orders between symbols and build complex
episodes out of serial and parallel compositions of basic events ([43], [26], [8]). A
further development [47] builds a mixture model that generates sequences similar to
the one that one wishes to model, with high probability. In this approach, an intui-
tively appealing model is derived from the data, but no particular applications in
classification, prediction or similarity-based mining are presented. These models can
be viewed as graph-based models, since one or more graphs that describe the rela-
tionships between basic events are used to model the observed sequences.
A class of other approaches aim at inferring grammars from the given time se-
quence. Extensive research in grammatical inference methods has led to many inter-
esting results ([48], [31], [51],) but has not yet found their way into a significant
number of real applications. Usually, grammatical inference methods are based on
discrete search algorithms that are able to infer very complex grammars. Neural net-
based approaches have also been tried (see, for example, part VI of reference [49])
but have been somewhat less successful when applied to complex problems ([32],
[22]). However, some recent results have shown that neural-network based inference
of grammars can be used in realistic applications [23].
The inferred grammars can belong to a variety of classes, ranging from determi-
nistic regular grammars ([39], [38], [52]) to stochastic ([30], [61], [10]) and context
free grammars ([58], [59]). It is clear that the grammar models obtained by induction
from examples can be used for prediction and classification, but, so far, these meth-
ods have not been extensively applied to actual problems in data mining.
3.5 Transactional databases with timing information
A type of time sequences that does not easily match any of the classes described
above are transactional datasets that incorporate timing information. For example,
one might have a list of items bought by customers and would like to consider the
timing information contained in the database. This type of high-dimensional discrete
data cannot be modelled effectively by any of the previously described approaches.
Although the modelling and representation of this type of data is an important
problem in temporal data mining, to our knowledge, there exist only a few proposals
for the representation of data that has this kind of characteristics. In one of the pro-
posed representations [5] each particular set of items is considered a new symbol of
the alphabet and a new sequence is created from the original one, by using this new
set of symbols. These symbols are then processed using methods that aim of finding
common occurrences of sub-sequences in an efficient way. This representation is
effective if the objective is to find common events and significant correlations in
time, but fails in applications like prediction and classification. Other authors have
7. Lecture Notes in Computer Science 7
used a similar representation [28], although they have proposed new methods to
manipulate it.
4 Similarity Measures for Sequences
After representing each sequence in a suitable form, it is necessary to define a simi-
larity measure between sequences, in order to determine if they match. An event is
then identified when there are a significant number of similar sequences in the data-
base.
An important issue in measuring similarity between two sequences is the ability to
deal with outlying points, noise in the data, amplitude differences (scaling problems)
and the existence of gaps and other time axis distortion problems. As such, the simi-
larity measure needs to be sufficiently flexible to be applied to sequences with differ-
ent lengths, noise levels, and time scales.
There are many proposals for similaritude measures. The model used to represent
sequences clearly has a great impact on the similarity measure adopted, and, there-
fore, this section follows an organization that closely matches that of section 3.
4.1 Distances in time-domain continuous representations
When no pre-processing is applied to the original sequence, a simple approach to
measure the similarity between two sequences consists on comparing the ith element
of each sequence. This simple method, however, is insufficient for many applica-
tions.
The similarity measure most used with time-domain continuous representations is
the Euclidean distance, by viewing each sub-sequence with n points as a point in Rn.
In order to deal with noise, scaling and translation problems, a simple improvement
consists in determining the pairs of sequences portions that agree in both sequences
after some linear transformation is applied. A concrete proposal [3] achieves this by
finding one window of some fixed size from each sequence, normalizing the values
in them to the [-1, 1] interval, and then comparing them to determine if they match.
After identifying these pairs of windows, several non-overlapping pairs can be
joined, in order to find the longest match length. Normalizing the windows solves
the scaling problem, and searching for pairs of windows that match solves the trans-
lation problem.
A more complex approach consists in determining if there is a linear function f,
such that a long subsequence of one sequence can be approximately mapped into a
long subsequence of the other [14]. Since, in this approach, the subsequences don’t
necessarily consist of consecutive original elements but only have to appear in the
same order, this similarity measure allows for the existence of gaps and outlying
points in the sequence.
8. Lecture Notes in Computer Science 8
In order to surpass the high sensibility of the Euclidean distance to small distor-
tions in the time axis, the technique of Dynamic Time Warping (DWT) was proposed
to deal with temporal sequences. This technique consists on aligning two sequences
so that a predetermined distance measure is minimized. The goal is to create a new
sequence, the warping path, which elements are (i, j) points with i and j the indexes
of original sequences elements that match [7]. Some algorithms that improve the
performance of dynamic time warping techniques were proposed in ([36], [65]).
A different similarity measure consists on describing the distance between two se-
quences using a probabilistic model [34]. The general idea is that a sequence can be
‘deformed’, according to a prior probability distribution, to generate the other se-
quence. This is easily accomplished by composing a global shape sequence with
local features, where each of one is allowed some degree of deformation. In this way,
some elasticity is allowed in the global shape sequence, permitting stretching in time
and distortions in amplitude. The degree of deformation and elasticity are governed
by the prior probability distributions [34].
4.2 Distances in Transformation Based Methods
When Transformation Based Representations are used, a simple approach is to com-
pare the points, in the new domain, which represent each sequence. For example, the
similarity between two sequences may be reduced to the similarity between two
points in the frequency domain, again measured using the Euclidean distance ([1],
[11]). In order to allow for the existence of outlying points in the sequence, the dis-
covery of sequences similar to a given query sequence is reduced to the selection of
points in an ε-neighborhood of the query point. However, this method may allow for
false hits, and after choosing the neighbor points in the new domain, the distance in
the time domain is computed to validate the similarity between the chosen sequences.
Notice that this method implies that the sequences have the same length, and in
order to be able to deal with different lengths, the method was improved to find rele-
vant subsequences in the original data [16]. Another important issue is the fact that
to compare different time series, the k Fourier coefficients chosen to represent the
sequences must be the same for all of them, which could not be the optimal one for
each sequence.
4.3 Similarities measures in discrete spaces
When a sequence is represented as a sequence of discrete symbols of an alphabet, the
similarity between two sequences is, most of the times, achieved by comparing each
element of one sequence with the correspondent one in the other sequence, like in
the first approach described in 4.1.
However, since the discrete symbols are not necessarily ordered some provisions
must be taken to allow for blurry matching.
One possible approach is to use the Shape Definition Language, described in 3.3,
where a blurry match is automatically possible. This is achieved since the language
9. Lecture Notes in Computer Science 9
has defined a set of operators, such as any, noless or nomore, that allows describing
overall shapes, solving the existence gaps problem in an elegant way [4].
Another possibility is to use well-known algorithms for string editing to define a
distance over the space of strings of discrete symbols ([33], [45]). This distance re-
flects the amount of work needed to transform a sequence to another, and is able to
deal with different sequences length and gaps existence.
4.4 Similarity measures for generative models
When a generative model is obtained from the datasets, the similarity measure be-
tween sequences is obtained in an obvious way by how close the data fits one of the
available models. For deterministic models (graph-based models [43], [26], and
deterministic grammars [32], [22], [58], [59]), verifying if a sequence matches a
given model will provide a yes or no answer, and therefore the distance measures are
necessarily discrete, and, in many cases, will take only one of two possible values.
For stochastic generative models like Markov chains ([21]), stochastic grammars
([30], [61], [10]) and mixture models ([47]), it is possible to obtain a number that
indicates the probability that a given sequence was generated by a given model.
In both cases, this similaritude measure can be used effectively in classification
problems, without the need to resort to complex heuristic similarity measures be-
tween sequences.
5 Mining Operations
5.1 Discovery of Association Rules
One of the most common approaches to mining frequent patterns is the apriori
method [2] and when a transactional database represented as a set of sequences of
transactions performed by one entity is used (as described in section 3.5), the ma-
nipulation of temporal sequences requires that some adaptations be made to the apri-
ori algorithm. The most important modification is on the notion of support: support
is now the fraction of entities, which had consumed the itemsets in any of their pos-
sible transactions [5], i.e. an entity could only contribute one time to increment the
support of each itemset, beside it could had consumed that itemset several times.
After identifying the large itemsets, the itemsets with support greater than the
minimum support allowed, they are translated to an integer, and each sequence is
transformed in a new sequence, whose elements are the large itemsets of the previ-
ous-one. The next step is to find the large sequences. For achieve this, the algorithm
acts iteratively as apriori: first it generates the candidate sequences and then it
chooses the large sequences from the candidate ones, until there are no candidates.
10. Lecture Notes in Computer Science 10
One of the most costly operations in apriori-based approaches is the candidate
generation. A proposal to frequent pattern mining states that it is possible to find
frequent patterns avoiding the candidate generation-test [28]. Extending this to deal
with sequential data is presented in [29].
The discovery of relevant association rules is one of the most important methods
used to perform data mining on transactional databases. An effective algorithm to
discover association rules is the apriori [2]. Adapting this method to deal with tem-
poral information leads to some different approaches.
Common sub-sequences can be used to derive association rules with predictive
value, as is done, for instance, in the analysis of discretized, multi-dimensional time
series [24].
A possible approach consists on extending the notion of a typical rule X Y
(which states if X occurs then Y occurs) to be a rule with a new meaning: X T Y
(which states: if X occurs then Y will occur within time T) [15]. Stating a rule in this
new form, allows for controlling the impact of the occurrence of an event to the other
event occurrence, within a specific time interval.
Another method consists on considering cyclic rules [53]. A cyclic rule is one that
occurs at regular time intervals, i.e. transactions that support specific rules occur
periodically, for example at every first Monday of a month. In order to discover these
rules, it is necessary to search for them in a restrict portion of time, since they may
occur repeatedly at specific time instants but on a little portion of the global time
considered. A method to discover such rules is applying an algorithm similar to the
apriori, and after having the set of traditional rules, detects the cycles behind the
rules. A more efficient approach to discover cyclic rules consists on inverting the
process: first discover the cyclic large itemsets and then generate the rules. A natural
extension to this method consists in allowing the existence of different time units,
such as days, weeks or months, and is achieved by defining calendar algebra to de-
fine and manipulate groups of time intervals. Rules discovered are designated calen-
dric association rules [54].
A different approach to the discovery of relations in multivariate time sequences is
based on the definition of N-dimensional transaction databases. Transactions in these
databases are obtained by discretizing, if necessary, continuous attributes [42]. This
type of databases can then be mined to obtain association rules. However, new defini-
tions for association rules, support and confidence are necessary. The great differ-
ence is the notion of address, which locates each event in a multi-dimensional space
and allows for expressing the confidence and support level in a new way.
5.2 Classification
Classification is one of the most typical operations in supervised learning, but hasn’t
deserved much attention in temporal data mining. In fact, a comprehensive search of
applications based on classification has returned relatively few instances of actual
uses of temporal information.
11. Lecture Notes in Computer Science 11
One classification method to deal with temporal sequences is based on the merge
operator [35]. This operator receives two sequences and returns “a sequence whose
shape is a compromise between the two original sequences”[35]. The basic idea is to
iteratively merge a typical example of a class with each positive example, building a
more general model for the class. Using negative examples is also possible, but then
is necessary to emphasize the shape differences. This is accomplished by using an
influence factor to control the merge operator function: a positive influence factor
implies the generalization of the input sequences, and a negative one conduces to
exaggerate the differences between the positive and negative shapes.
Since traditional classification algorithms are difficult to apply to sequential ex-
amples, mostly because there is a vast number of potentially useful features for de-
scribing each example, an interesting improvement consists on applying a pre-
processing mechanism to extract relevant features. One approach [40] to implement
this idea consists on discovering frequent subsequences, and then using them, as the
relevant features to classify sequences with traditional methods, like Naive Bayes or
Winnow.
Classification is relatively straightforward if generative models are employed to
model the temporal data. Deterministic and probabilistic models can be applied in a
straightforward way to perform classification [39] since they give a clear answer to
the question of whether a sequence matches a given model. However, actual applica-
tions in classification have been lacking in the literature.
5.3 Non-supervised Clustering
The fundamental problem in clustering temporal sequences databases consists on the
discovery of a number of clusters, say K, able to represent the different sequences.
There are two central problems in clustering: choose the number of clusters and
initialize their parameters. Another important problem appears, when dealing with
temporal sequences: the existence of a meaningful similarity measure between se-
quences.
Considering that K is known, if a sequence is viewed as being generated accord-
ing to some probabilistic model, for example by a Markov model, clustering may be
viewed as modeling the data sequences as a finite group of K sequences in the form
of a finite mixture model. Through the EM algorithm their parameters could be
estimated and each K group would correspond to a cluster [61]. Learning the value
of K, if it is unknown, may be accomplished by a Monte-Carlo cross validation ap-
proach as suggested in [60].
A different approach proposes to use a hierarchical clustering method to cluster
temporal sequences databases [37]. The algorithm used is the COBWEB [18], and it
works on two steps: first grouping the elements of the sequences, and then grouping
the sequences themselves. Considering a simple time series, the first step is accom-
plished without difficulties, but to group the sequences is necessary to define a gen-
eralization mechanism for sequences. Such mechanism has to be able to choose the
most specific description for what is common to different sequences.
12. Lecture Notes in Computer Science 12
5.4 Applications in Prediction
Prediction is one of the most important problems in data mining. However, in many
cases, prediction problems may be formulated as classification, association rule find-
ing or clustering problems. Generative models can also be used effectively to predict
the evolution of time series.
Nonetheless, prediction problems have some specific characteristics that differen-
tiate them from other problems. A vast literature exists on computer-assisted predic-
tion of time series, in a variety of domains ([17], [64]). However, we have failed to
find in the data mining literature significant applications that involve prediction of
time series and that do not fall into any of the previously described categories.
Granted, several authors have presented work that aims specifically at obtaining
algorithms that can be used to predict the evolution of time series ([46], [42]), but the
applications described, if any, are used more to illustrate the algorithms and do not
represent an actual problem that needs to be solved. One exception is the work by
Giles et all [23], which combines discretization with grammar inference and applies
the resulting automata to explicitly predict the evolution of financial time series. In
the specific domain of prediction, care must be taken with the domain where predic-
tion is to be applied. Well-known and generally accepted results on the inherent
unpredictability of many financial time series [17] imply that significant gains in
prediction accuracy are not possible, no matter how sophisticated the techniques
used.
In other cases, however, it will be interesting to see if advances in data mining
techniques will lead to significant improvements in prediction accuracy, when com-
pared with existing parametric and non-parametric models.
6 Conclusions
We have presented a comprehensive overview of techniques for the mining of tempo-
ral sequences. This domain has relationships with many other areas of knowledge,
and an exhaustive survey that includes all relevant contributions is well beyond the
scope of this work.
Nonetheless, we have surveyed and classified a significant fraction of the pro-
posed approaches to this important problem, taking into consideration the represen-
tations they utilize, the similarity measures they propose and the applications they
have been applied to.
In our opinion, this survey has shown that a significant number of algorithms ex-
ist for the problem, but that the research has not be driven so much by actual prob-
lems but by an interest in proposing new approaches. It has also shown that the re-
searchers have ignored, at least for some time, some very significant contributions
from related fields, as, for example, Markov-process based modeling and grammati-
cal inference techniques. We believe that the field will be significantly enriched if
knowledge from these other sources is incorporated into data mining algorithms and
applications.
13. Lecture Notes in Computer Science 13
References
1. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence data-
bases. FODO - Foundations of Data Organization and Algorithms Proceedings
(1993) 69-84
2. Agrawal, R., Srikant, R.: Fast Algoritms for Mining Association Rules. VLDB
(1994) 487-499
3. Agrawal, R., Lin, K., Sawhney, H., Shim, K.: Fast Similarity Search in the Presence
of Noise, Scaling, and Translation in Time-Series Databases. VLDB (1995) 490-501
4. Agrawal, R., Giuseppe, P., Edward, W.L., Zait, M.: Querying Shapes of Histories.
VLDB (1995) 502-514
5. Agrawal, R., Srikant, R.: Mining sequential patterns. ICDE (1995) 3-14
6. Antunes, C.: Knowledge Acquisition System to Support Low Vision Consultation.
MSc Thesis at Instituto Superior Técnico, Universidade Técnica de Lisboa. (2001)
7. Berndt, D., Clifford, J.: Finding Patterns in Time Series: a Dynamic Programming
Approach. In: Fayyad, U., Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances
in Knowledge Discovery and Data Mining. AAAI Press. (1996) 229-248
8. Bettini, C., Wang, S.X., Jagodia, S., Lin, J.L. Discovering Frequent Event Patterns
with Multiple Granularities in Time Sequences. IEEE Transactions on Knowledge
and Data Engineering, vol. 10, nº 2 (222-237). 1998.
9. Caraça-Valente, J., Chavarrías, I.: Discovering Similar Patterns in Time Series.
KDD (2000) 497-505
10. Carrasco, R., Oncina, J.: Learning Stochastic Regular Grammars by means of a State
Merging Method. ICGI (1994) 139-152
11. Chan, K., Fu, W.: Efficient Time Series Matching by Wavelets. ICDE (1999) 126-
133
12. Coiera, E.: The Role of Knowledge Based Systems in Clinical Practice. In Bara-
hona, P., Christensen, J.: Knowledge and Decisions in Health Telematics – The
Next Decada. IOS Press Amsterdam (1994) 199-203
13. Dajani, R., Miquel, M., Forilini, M.C., Rubel, P.: Modeling of Ventricular Repo-
larisation Time Series by Multi-layer Perceptrons. 8th Conference on Artificial In-
telligence in Medicine in Europe, AIME (2001) 152-155
14. Das, G., Gonopulos, D., Mannila, H.: Finding Similar Time Series. PKDD (1997)
88-100
15. Das, G., Mannila, H., Smyth, P.: Rule Discovery from Time Series. KDD (1998) 16-
22
16. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast Subsequence Matching in
Time-Series Databases. ACM SIGMOD Int. Conf. on Management of Data (1994)
419-429
17. Fama, E.. Efficient Capital Markets: a review of theory and empirical work. Journal
of Finance (May 1970) 383-417
18. Fisher, D.. Knowledge Acquisition Via Incremental Conceptual Clustering. Machine
Learning vol. 2 (1987) 139-172
19. Frawley, W., Piatetsky-Shapiro, G., Matheus, C.. Knowledge discovery in data-
bases: An overview. AI Magazine, vol. 13 n. 3 (1992) 57-70
20. Gavrilov, M., Anguelov, D., Indyk, P, Motwani, R.. Mining The Stock Market:
Which Measure Is Best?. KDD (2000) 487-496
21. Ge, X., Smyth, P.. Deformable Markov Model Templates for Time Series Pattern
Matching. KDD (2000) 81-90
14. Lecture Notes in Computer Science 14
22. Giles, C., Chen, D., Sun, G., Chen, H., Lee, Y.: Extracting and Learning an Un-
known Grammar with Recurrent Neural Networks. In Moody, J., Hanson, S.,
Lippmann, R.. Advances in Neural Information Processing Systems, vol. 4 Morgan
Kaufmann (1992) 317-324
23. Giles, C., Lawrence, S., Tsoi, A.C.: Noisy Time Series Prediction using Recurrent
Neural Networks and Grammatical Inference. Machine Learning, vol. 44. (2001)
161-184
24. Guimarães, G.: The Induction of Temporal Grammatical Rules from Multivariate
Time Series. ICGI (2000) 127-140
25. Gunopulos, D., Das, G.: Time Series Similarity Measures and Time Series Index-
ing. In WOODSTOCK (1997)
26. Guralnik. V., Wijesakera. D., Srivastava. J.: Pattern Directed Mining of Sequence
Data. KDD (1998) 51-57
27. Guralnik. V., Srivastava. J.: Event Detection from Time Series Data. KDD (1999)
33-42
28. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation.
ACM SIGMOD Int. Conf. on Management of Data (2000) 1-12
29. Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.: FreeSpan: Fre-
quent pattern-projected sequential pattern mining. ACM SIGKDD (2000) 355-359
30. Higuera, C.: Learning Stochastic Finite Automata from Experts. ICGI (1998) 79-89
31. Honavar, V., Slutzki, G.: Grammatical Inference. In Lecture Notes in Artificial In-
telligence, Vol. 1433. Springer (1998)
32. Horne, B., Giles, C.: An Experimental Comparison of Recurrent Neural Networks,
in G. Tesauro, D. Touretzky, T. Leen. Advances in Neural Information Processing
Systems, Vol. 7. MIT Press (1995) 697-704
33. Huang, Y., Yu, P.: Adaptive Query Processing for Time Series Data. KDD (1999)
282-286
34. Keogh, E., Smyth, P.: A probabilistic Approach to Fast Pattern Matching in Time
Series Databases. KDD (1997) 24-30
35. Keogh, E., Pazzani, M.: An enhanced representation of time series which allows fast
and accurate classification, clustering and relevance feedback. Knowledge Discov-
ery in Databases and Data Mining (1998) 239-241
36. Keogh, E., Pazzani, M.: Scaling up Dynamic Time Warping to Massive Datasets.
Proc. Principles and Practice of Knowledge Discovery in Databases (1999) 1-11
37. Ketterlin, A.: Clustering Sequences of Complex Objects. KDD (1997) 215-218
38. Juillé, H., Pollack, J.: A Stochastic Search Approach to Grammar Induction. ICGI
(1998) 79-89
39. Lang, K., Pearlmutter, B., Price, R.: Results of the Abbadingo One DFA Learning
Competition and a new Evidence-Driven State Merging Algorithm. ICGI (1998) 79-
89
40. Lesh, N., Zaki, M., Ogihara, M.: Mining Features for Sequence Classification. KDD
(1999). 342-346
41. Lin, L., Risch, T.: Querying Continuous Time Sequences. VLDB (1998) 170-181
42. Lu, H., Han, J., Feng, L.: Stock Price Movement Prediction and N-Dimensional In-
ter-Transaction Association Rules. ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery (1998) 12.1-12.7
43. Mannila, H., Toivonen, H., Verkamo, I.: Discovering Frequent Episodes in Se-
quences. KDD (1995) 210-215
44. Mannila, H., Toivonen, H.: Discovering Generalized episodes using minimal occur-
rences. KDD (1996) 146-151
15. Lecture Notes in Computer Science 15
45. Mannila, H., Ronkainen, P.: Similarity of Event Sequences. In TIME (1997). 136-
139
46. Mannila, H., Pavlov, D., Smyth, P.: Prediction with Local Patterns using Cross-
Entropy. KDD (1999) 357-361
47. Mannila, H., Meek, C.: Global Partial Orders from Sequential Data. KDD (2000)
161-168
48. Miclet, L., Higuera, C.: Grammatical Inference: Learning Syntax from Sentences. In
Lecture Notes in Artificial Intelligence, Vol. 1147. Springer (1996)
49. Moody, J., Hanson, S., Lippmann, R. (eds): Advances in Neural Information Proc-
essing Systems, Vol. 4. Morgan Kaufmann (1992)
50. Oates, T.: Identifying Distinctive Subsequences in Multivariate Time Series by
Clustering. KDD (1999) 322-326
51. Oliveira, A.. Grammatical Inference: Algorithms and Applications. In Lecture Notes
in Artificial Intelligence, Vol. 1891. Springer (2000)
52. Oliveira, A., Silva, J.: Efficient Algorithms for the Inference of Minimum Size
DFAs. Machine Learning, Vol. 44 (2001) 93-119
53. Özden, B., Ramaswamy, S., Silberschatz, A.: Cyclic association rules. ICDE (1998)
412-421
54. Ramaswamy, S., Mahajan, S., Silberschatz, A.: On the Discovery of Interesting Pat-
terns in Association Rules. VLDB (1998) 368-379
55. Refenes, A.: Neural Networks in the Capital Markets. John Wiley and Sons, New
York, USA, (1995)
56. Roddick, J., Spiliopoulou, M.: A Survey of Temporal Knowledge Discovery Para-
digms and Methods. In IEEE Transactions of Knowledge and Data Engineering, vol.
13 (2001). To appear
57. Shahar, Y., Musen, MA.: Knowledge-Based Temporal Abstraction in Clinical Do-
mains. Artificial Intelligence in Medicine 8, (1996) 267-298
58. Sakakibara, Y.: Recent Advances of Grammatical Inference. Theoretical Computer
Science, Vol. 185 (1997) 15-45
59. Sakakibara, Y., Muramatsu, H.: Learning Context Free Grammars from Partially
Structured Examples. ICGI (2000) 229-240
60. Smyth, P.: Clustering Sequences with Hidden Markov Models. In Tesauro, G., To-
uretzky, D., Leen, T.: Advances in Neural Information Processing Systems, Vol. 9.
The MIT Press (1997) 648-654
61. Smyth, P.: Probabilistic Model-Based Clustering of Multivariate and Sequential
Data. Artificial Intelligence and Statistics (1999) 299-304
62. Stolcke, A., Omohundro, S.: Inducing Probabilistic Grammars by Bayesian Model
Merging. ICGI (1994) 106-118
63. Szeto, K.Y., Cheung, K.H.: Application of genetic algorithms in stock market pre-
diction. In Weigend, A.S. et al.: Decision Technologies for Financial Engineering.
World Scientific (1996) 95-103
64. Weigend, A., Gershenfeld, N.: Time Series Prediction: Forecasting the Future and
Understanding the Past. Addison-Wesley (1994)
65. Yi, B., Jagadish, H., Faloutsos, C.: Efficient Retrieval of Similar Time Sequences
Under Time Warping. ICDE (1998) 201-208