The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...Waqas Tariq
Sliding window is an interesting model for frequent pattern mining over data stream due to handling concept change by considering recent data. In this study, a novel approximate algorithm for frequent itemset mining is proposed which operates in both transactional and time sensitive sliding window model. This algorithm divides the current window into a set of partitions and estimates the support of newly appeared itemsets within the previous partitions of the window. By monitoring essential set of itemsets within incoming data, this algorithm does not waste processing power for itemsets which are not frequent in the current window. Experimental evaluations using both synthetic and real datasets shows the superiority of the proposed algorithm with respect to previously proposed algorithms.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
This document describes a proposed Optimal Frequent Patterns System (OFPS) that uses a genetic algorithm to discover optimal frequent patterns from transactional databases more efficiently. The OFPS is a three-fold system that first prepares data through cleaning, integration and transformation. It then constructs a Frequent Pattern Tree to discover frequent patterns. Finally, it applies a genetic algorithm to generate optimal frequent patterns, simulating biological evolution to find the best solutions. The proposed system aims to overcome limitations of conventional association rule mining approaches and efficiently discover optimal patterns from large, changing datasets.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Mining Regular Patterns in Data Streams Using Vertical FormatCSCJournals
The increasing prominence of data streams has been lead to the study of online mining in order to capture interesting trends, patterns and exceptions. Recently, temporal regularity in occurrence behavior of a pattern was treated as an emerging area in several online applications like network traffic, sensor networks, e-business and stock market analysis etc. A pattern is said to be regular in a data stream, if its occurrence behavior is not more than the user given regularity threshold. Although there has been some efforts done in finding regular patterns over stream data, no such method has been developed yet by using vertical data format. Therefore, in this paper we develop a new method called VDSRP-method to generate the complete set of regular patterns over a data stream at a user given regularity threshold. Our experimental results show that highly efficiency in terms of execution and memory consumption.
An incremental mining algorithm for maintaining sequential patterns using pre...Editor IJMTER
Mining useful information and helpful knowledge from large databases has evolved into
an important research area in recent years. Among the classes of knowledge derived, finding
sequential patterns in temporal transaction databases is very important since it can help model
customer behavior. In the past, researchers usually assumed databases were static to simplify datamining problems. In real-world applications, new transactions may be added into databases
frequently. Designing an efficient and effective mining algorithm that can maintain sequential
patterns as a database grows is thus important. In this paper, we propose a novel incremental mining
algorithm for maintaining sequential patterns based on the concept of pre-large sequences to reduce
the need for rescanning original databases.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...Waqas Tariq
Sliding window is an interesting model for frequent pattern mining over data stream due to handling concept change by considering recent data. In this study, a novel approximate algorithm for frequent itemset mining is proposed which operates in both transactional and time sensitive sliding window model. This algorithm divides the current window into a set of partitions and estimates the support of newly appeared itemsets within the previous partitions of the window. By monitoring essential set of itemsets within incoming data, this algorithm does not waste processing power for itemsets which are not frequent in the current window. Experimental evaluations using both synthetic and real datasets shows the superiority of the proposed algorithm with respect to previously proposed algorithms.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
This document describes a proposed Optimal Frequent Patterns System (OFPS) that uses a genetic algorithm to discover optimal frequent patterns from transactional databases more efficiently. The OFPS is a three-fold system that first prepares data through cleaning, integration and transformation. It then constructs a Frequent Pattern Tree to discover frequent patterns. Finally, it applies a genetic algorithm to generate optimal frequent patterns, simulating biological evolution to find the best solutions. The proposed system aims to overcome limitations of conventional association rule mining approaches and efficiently discover optimal patterns from large, changing datasets.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Mining Regular Patterns in Data Streams Using Vertical FormatCSCJournals
The increasing prominence of data streams has been lead to the study of online mining in order to capture interesting trends, patterns and exceptions. Recently, temporal regularity in occurrence behavior of a pattern was treated as an emerging area in several online applications like network traffic, sensor networks, e-business and stock market analysis etc. A pattern is said to be regular in a data stream, if its occurrence behavior is not more than the user given regularity threshold. Although there has been some efforts done in finding regular patterns over stream data, no such method has been developed yet by using vertical data format. Therefore, in this paper we develop a new method called VDSRP-method to generate the complete set of regular patterns over a data stream at a user given regularity threshold. Our experimental results show that highly efficiency in terms of execution and memory consumption.
An incremental mining algorithm for maintaining sequential patterns using pre...Editor IJMTER
Mining useful information and helpful knowledge from large databases has evolved into
an important research area in recent years. Among the classes of knowledge derived, finding
sequential patterns in temporal transaction databases is very important since it can help model
customer behavior. In the past, researchers usually assumed databases were static to simplify datamining problems. In real-world applications, new transactions may be added into databases
frequently. Designing an efficient and effective mining algorithm that can maintain sequential
patterns as a database grows is thus important. In this paper, we propose a novel incremental mining
algorithm for maintaining sequential patterns based on the concept of pre-large sequences to reduce
the need for rescanning original databases.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
The document discusses mining frequent patterns from object-relational data. It begins with an introduction to data mining and frequent pattern mining. It then reviews related work on data models, including relational, object-oriented, and object-relational data models. Several frequent pattern mining algorithms are described, including Apriori, FP-Growth, ECLAT and RElim. Two approaches for mining object-relational data are proposed: a fundamental approach that treats it similarly to transactional data, and a nested-relations approach to handle nested attributes. The document concludes by outlining parameters for applying frequent pattern mining algorithms to object-relational data.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
1. The document discusses stream data mining and compares classification algorithms. It defines stream data and challenges in mining stream data.
2. It describes sampling techniques and classification algorithms for stream data mining including Naive Bayesian, Hoeffding Tree, VFDT, and CVFDT.
3. The algorithms are experimentally compared in terms of time, memory usage, accuracy, and ability to handle concept drift. VFDT and CVFDT are found to have advantages over Hoeffding Tree in accuracy while maintaining speed, but CVFDT can additionally detect and respond to concept drift.
Managing and implementing the data mining process using a truly stepwise appr...Shanmugaraj Ramaiah
This document proposes a "truly stepwise" approach to managing and implementing the data mining process. It involves storing the output of each main phase - such as pre-processing, feature extraction, and modeling - in a storage medium between phases. This allows each phase to work independently without needing to rerun previous phases. It compares this approach to the traditional method where all phases are combined and run sequentially without intermediate storage. The document then presents a case study applying this new approach to a project on resistance spot welding quality estimation, finding it performed better than the traditional approach.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
A Novel Data mining Technique to Discover Patterns from Huge Text CorpusIJMER
Today, we have far more information than we can handle: from business transactions and scientific
data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough
anymore for decision-making. Confronted with huge collections of data, we have now created new needs to
help us make better managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. With this, Data mining with
inventory pattern came into existence and got popularized. Data mining finds these patterns and relationships
using data analysis tools and techniques to build models.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
This document presents an analytical framework for classifying data stream mining techniques based on their approaches to challenges. It discusses how data streams pose computational challenges due to their continuous, massive, and potentially infinite nature. It classifies data stream mining challenges and techniques for addressing them, such as approaches that modify existing data mining algorithms or develop new ones. The document proposes an analytical framework to evaluate how data mining applications can help develop novel data stream mining algorithms to handle different tasks.
Knowledge discovery in databases
Data pyramid
Introduction to-data-mining
Definition of Data Mining
Data Mining as an Interdisciplinary field
Data Mining and Business Intelligence
This document discusses negative sequential pattern mining. It begins with an introduction to data mining, sequential pattern mining, and the difference between positive and negative sequential patterns. Negative sequential patterns consider absent items in sequences and can provide valuable information. The document then surveys five different algorithms that have been proposed for mining negative sequential patterns more efficiently. These algorithms aim to reduce computational costs by pruning candidate patterns or avoiding rescanning databases multiple times.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protect...IOSR Journals
The document analyzes pattern transformation algorithms for sensitive knowledge protection in data mining. It discusses:
1) Three main privacy preserving techniques - heuristic, cryptography, and reconstruction-based. The proposed algorithms use heuristic-based techniques.
2) Four proposed heuristic-based algorithms - item-based Maxcover (IMA), pattern-based Maxcover (PMA), transaction-based Maxcover (TMA), and Sensitivity Cost Sanitization (SCS) - that modify sensitive transactions to decrease support of restrictive patterns.
3) Performance improvements including parallel and incremental approaches to handle large, dynamic databases while balancing privacy and utility.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
A new clutering approach for anomaly intrusion detectionIJDKP
Recent advances in technology have made our work easier compare to earlier times. Computer network is
growing day by day but while discussing about the security of computers and networks it has always been a
major concerns for organizations varying from smaller to larger enterprises. It is true that organizations
are aware of the possible threats and attacks so they always prepare for the safer side but due to some
loopholes attackers are able to make attacks.
Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms
for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting
possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion
detection by using the approach of K-medoids method of clustering and its certain modifications. The
proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means
algorithm.
The efficient and effective monitoring of mobile networks is vital given the number of users who rely on
such networks and the importance of those networks. The purpose of this paper is to present a monitoring
scheme for mobile networks based on the use of rules and decision tree data mining classifiers to upgrade
fault detection and handling. The goal is to have optimisation rules that improve anomaly detection. In
addition, a monitoring scheme that relies on Bayesian classifiers was also implemented for the purpose of
fault isolation and localisation. The data mining techniques described in this paper are intended to allow a
system to be trained to actually learn network fault rules. The results of the tests that were conducted
allowed for the conclusion that the rules were highly effective to improve network troubleshooting.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
The document discusses mining frequent patterns from object-relational data. It begins with an introduction to data mining and frequent pattern mining. It then reviews related work on data models, including relational, object-oriented, and object-relational data models. Several frequent pattern mining algorithms are described, including Apriori, FP-Growth, ECLAT and RElim. Two approaches for mining object-relational data are proposed: a fundamental approach that treats it similarly to transactional data, and a nested-relations approach to handle nested attributes. The document concludes by outlining parameters for applying frequent pattern mining algorithms to object-relational data.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
1. The document discusses stream data mining and compares classification algorithms. It defines stream data and challenges in mining stream data.
2. It describes sampling techniques and classification algorithms for stream data mining including Naive Bayesian, Hoeffding Tree, VFDT, and CVFDT.
3. The algorithms are experimentally compared in terms of time, memory usage, accuracy, and ability to handle concept drift. VFDT and CVFDT are found to have advantages over Hoeffding Tree in accuracy while maintaining speed, but CVFDT can additionally detect and respond to concept drift.
Managing and implementing the data mining process using a truly stepwise appr...Shanmugaraj Ramaiah
This document proposes a "truly stepwise" approach to managing and implementing the data mining process. It involves storing the output of each main phase - such as pre-processing, feature extraction, and modeling - in a storage medium between phases. This allows each phase to work independently without needing to rerun previous phases. It compares this approach to the traditional method where all phases are combined and run sequentially without intermediate storage. The document then presents a case study applying this new approach to a project on resistance spot welding quality estimation, finding it performed better than the traditional approach.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
A Novel Data mining Technique to Discover Patterns from Huge Text CorpusIJMER
Today, we have far more information than we can handle: from business transactions and scientific
data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough
anymore for decision-making. Confronted with huge collections of data, we have now created new needs to
help us make better managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. With this, Data mining with
inventory pattern came into existence and got popularized. Data mining finds these patterns and relationships
using data analysis tools and techniques to build models.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
This document presents an analytical framework for classifying data stream mining techniques based on their approaches to challenges. It discusses how data streams pose computational challenges due to their continuous, massive, and potentially infinite nature. It classifies data stream mining challenges and techniques for addressing them, such as approaches that modify existing data mining algorithms or develop new ones. The document proposes an analytical framework to evaluate how data mining applications can help develop novel data stream mining algorithms to handle different tasks.
Knowledge discovery in databases
Data pyramid
Introduction to-data-mining
Definition of Data Mining
Data Mining as an Interdisciplinary field
Data Mining and Business Intelligence
This document discusses negative sequential pattern mining. It begins with an introduction to data mining, sequential pattern mining, and the difference between positive and negative sequential patterns. Negative sequential patterns consider absent items in sequences and can provide valuable information. The document then surveys five different algorithms that have been proposed for mining negative sequential patterns more efficiently. These algorithms aim to reduce computational costs by pruning candidate patterns or avoiding rescanning databases multiple times.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protect...IOSR Journals
The document analyzes pattern transformation algorithms for sensitive knowledge protection in data mining. It discusses:
1) Three main privacy preserving techniques - heuristic, cryptography, and reconstruction-based. The proposed algorithms use heuristic-based techniques.
2) Four proposed heuristic-based algorithms - item-based Maxcover (IMA), pattern-based Maxcover (PMA), transaction-based Maxcover (TMA), and Sensitivity Cost Sanitization (SCS) - that modify sensitive transactions to decrease support of restrictive patterns.
3) Performance improvements including parallel and incremental approaches to handle large, dynamic databases while balancing privacy and utility.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
A new clutering approach for anomaly intrusion detectionIJDKP
Recent advances in technology have made our work easier compare to earlier times. Computer network is
growing day by day but while discussing about the security of computers and networks it has always been a
major concerns for organizations varying from smaller to larger enterprises. It is true that organizations
are aware of the possible threats and attacks so they always prepare for the safer side but due to some
loopholes attackers are able to make attacks.
Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms
for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting
possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion
detection by using the approach of K-medoids method of clustering and its certain modifications. The
proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means
algorithm.
The efficient and effective monitoring of mobile networks is vital given the number of users who rely on
such networks and the importance of those networks. The purpose of this paper is to present a monitoring
scheme for mobile networks based on the use of rules and decision tree data mining classifiers to upgrade
fault detection and handling. The goal is to have optimisation rules that improve anomaly detection. In
addition, a monitoring scheme that relies on Bayesian classifiers was also implemented for the purpose of
fault isolation and localisation. The data mining techniques described in this paper are intended to allow a
system to be trained to actually learn network fault rules. The results of the tests that were conducted
allowed for the conclusion that the rules were highly effective to improve network troubleshooting.
Prediction of quality features in iberian ham by applying data mining on data...IJDKP
This paper aims to predict quality features of Iberian hams by using non-destructive methods of analysis
and data mining. Iberian hams were analyzed by Magnetic Resonance Imaging (MRI) and Computer
Vision Techniques (CVT) throughout their ripening process and physico-chemical parameters from them
were also measured. The obtained data were used to create an initial database. Deductive techniques of
data mining (multiple linear regression) were used to estimate new data, allowing the insertion of new
records in the database. Predictive techniques of data mining were applied (multiple linear regression) on
MRI-CVT data, achieving prediction equations of weight, moisture and lipid content. Finally, data from
prediction equations were compared to data determined by physical-chemical analysis, obtaining high
correlation coefficients in most cases. Therefore, data mining, MRI and CVT are suitable tools to estimate
quality traits of Iberian hams. This would improve the control of the ham processing in a non-destructive
way.
The document summarizes research on performing spatio-textual similarity joins. It discusses:
1) Developing a filter-and-refine framework to efficiently find similar object pairs from two datasets using signatures.
2) Generating spatial and textual signatures for objects and building inverted indexes on the signatures to find candidate pairs.
3) Refining the candidate pairs to obtain the final result pairs that satisfy spatial and textual similarity thresholds.
The problem considered is that of finding frequent subpaths of a database of paths in a fixed undirected
graph. This problem arises in applications such as predicting congestion in network and vehicular traffic.
An algorithm, called AFS, based on the classic frequent itemset mining algorithm Apriori is developed, but
with significantly improved efficiency over Apriori from exponential in transaction size to quadratic through exploiting the underlying graph structure. This efficiency makes AFS feasible for practical input path sizes. It is also proved that a natural generalization of the frequent subpaths problem is not amenable to any solution quicker than Apriori.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Accurate time series classification using shapeletsIJDKP
Time series data are sequences of values measured over time. One of the most recent approaches to
classification of time series data is to find shapelets within a data set. Time series shapelets are time series
subsequences which represent a class. In order to compare two time series sequences, existing work uses
Euclidean distance measure. The problem with Euclidean distance is that it requires data to be
standardized if scales differ. In this paper, we perform classification of time series data using time series
shapelets and used Mahalanobis distance measure. The Mahalanobis distance is a descriptive statistic
that provides a relative measure of a data point's distance (residual) from a common point. The
Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It
differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant.
We show that Mahalanobis distance results in more accuracy than Euclidean distance measure
Emotion Detection is one of the most emerging issues in human computer interaction. A sufficient amount
of work has been done by researchers to detect emotions from facial and audio information whereas
recognizing emotions from textual data is still a fresh and hot research area. This paper presented a
knowledge based survey on emotion detection based on textual data and the methods used for this purpose.
At the next step paper also proposed a new architecture for recognizing emotions from text document.
Proposed architecture is composed of two main parts, emotion ontology and emotion detector algorithm.
Proposed emotion detector system takes a text document and the emotion ontology as inputs and produces
one of the six emotion classes (i.e. love, joy, anger, sadness, fear and surprise) as the output.
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
Clustering is one of the data mining techniques that have been around to discover business intelligence by grouping objects into clusters using a similarity measure. Clustering is an unsupervised learning process that has many utilities in real time applications in the fields of marketing, biology, libraries, insurance, city-planning, earthquake studies and document clustering. Latent trends and relationships among data objects can be unearthed using clustering algorithms. Many clustering algorithms came into existence. However, the quality of clusters has to be given paramount importance. The quality objective is to achieve
highest similarity between objects of same cluster and lowest similarity between objects of different clusters. In this context, we studied two widely used clustering algorithms such as the K-Means and Fuzzy K-Means. K-Means is an exclusive clustering algorithm while the Fuzzy K-Means is an overlapping clustering algorithm. In this paper we prove the hypothesis “Fuzzy K-Means is better than K-Means for Clustering” through both literature and empirical study. We built a prototype application to demonstrate the differences between the two clustering algorithms. The experiments are made on diabetes dataset
obtained from the UCI repository. The empirical results reveal that the performance of Fuzzy K-Means is better than that of K-means in terms of quality or accuracy of clusters. Thus, our empirical study proved the hypothesis “Fuzzy K-Means is better than K-Means for Clustering”.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Comparison between riss and dcharm for mining gene expression dataIJDKP
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to
reveal biological information about genes functions and their relation to health. Data mining techniques
are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer
from high processing time while generating frequent itemsets. The aim of this paper is to provide a
comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are
examined with high dimension data specifically gene expression data. Nine experiments are conducted with
different number of genes to examine the performance of both algorithms. It is found that RISS outperforms
dCHARM in terms of processing time..
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
A potential objective of every financial organization is to retain existing customers and attain new
prospective customers for long-term. The economic behaviour of customer and the nature of the
organization are controlled by a prescribed form called Know Your Customer (KYC) in manual banking.
Depositor customers in some sectors (business of Jewellery/Gold, Arms, Money exchanger etc) are with
high risk; whereas in some sectors (Transport Operators, Auto-delear, religious) are with medium risk;
and in remaining sectors (Retail, Corporate, Service, Farmer etc) belongs to low risk. Presently, credit risk
for counterparty can be broadly categorized under quantitative and qualitative factors. Although there are
many existing systems on customer retention as well as customer attrition systems in bank, these rigorous
methods suffers clear and defined approach to disburse loan in business sector. In the paper, we have used
records of business customers of a retail commercial bank in the city including rural and urban area of
(Tangail city) Bangladesh to analyse the major transactional determinants of customers and predicting of a
model for prospective sectors in retail bank. To achieve this, data mining approach is adopted for
analysing the challenging issues, where pruned decision tree classification technique has been used to
develop the model and finally tested its performance with Weka result. Moreover, this paper attempts to
build up a model to predict prospective business sectors in retail banking.
The Innovator’s Journey: Asset Owners Insights State Street
On behalf of State Street, Longitude conducted a global survey of senior executives at investment
organizations during October and November 2014. We asked them to self-assess their confidence and
progress across six data capabilities, including infrastructure, insight, adaptability, compliance, talent and
governance. The 400 respondents were drawn from 11 countries and included insurance companies,
private and public pension funds, fund-of-funds, foundations, central banks, endowments, sovereign
wealth funds and supranationals. One hundred asset owners participated in the survey.
For a human body to function properly it is essential to have a certain amount of body fat. Fat serves to
manage body temperature, pads and protects the organs. Fat is the fundamental type of the body's vitality
stockpiling. It is important to have a healthy amount of body fat. Overabundance of fat quotient can build
danger of genuine wellbeing issues. Anthropometry is a broadly accessible and basic strategy for the
appraisal of body composition. Anthropometry measures are weight, height, Body Mass Index (BMI),
waist, boundary, biceps, skinfold etc. The human fat percentage is figured by taking anthropometric
variables. We proposed a methodology to determine the body fat percentage using R programming and
regression formula. We analyzed 10 anthropometric variables and 3 demographic variables. Our study
shows that the impact of certain variables has an edge over other in predicting body fat percentage.
Interpreting available data is a focal issue in data mining. Gathering of primary data is a difficult and expensive affair for assessing the trends for any business decision especially when multiple players are present. There is no uniform formula-type work procedure to deduce information from a vast data set especially if the data formats in the secondary sources are not uniform and need enormous cleansing to
mend the data for statistical analysis. In this paper, an incremental approach to cleanse data using a simple yet extended procedure is presented and it is shown how to deduce conclusions to facilitate business decisions. Freely available Indian Telecom Industry’s data over a year is used to illustrate this process. It is shown how to conclude the superiority of one telecom service provider over the others comparing different parameters like network availability, customer service quality etc. using a relative parameter
quantification technique. It is found that this method is computationally less costly than the other known
methods.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document summarizes a research paper that proposes a new approach called CBSW (Chernoff Bound based Sliding Window) for mining frequent itemsets from data streams. CBSW uses concepts from the Chernoff bound to dynamically determine the window size for mining frequent itemsets. It monitors boundary movements in a synopsis data structure to detect changes in the data stream and adjusts the window size accordingly. Experimental results demonstrate the effectiveness of CBSW in mining frequent itemsets from high-speed data streams.
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...IJDKP
The document describes a proposed algorithm called RAQ-FIG for mining frequent itemsets from transactional data streams. It operates using a sliding window model composed of basic windows. The algorithm has three phases: 1) initializing the sliding window by filling it with recent transactions from a buffer, 2) generating bit sequences for each basic window and finding frequent itemsets through bitwise operations, and 3) adapting the algorithm's processing based on available memory and quality metrics to ensure efficient resource usage and accurate results. The algorithm aims to account for computational resources and dynamically adjust the processing rate based on available memory while computing recent approximate frequent itemsets with a single pass.
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
Frequent itemset mining and association rule generation is
a challenging task in data stream. Even though, various algorithms
have been proposed to solve the issue, it has been found
out that only frequency does not decides the significance
interestingness of the mined itemset and hence the association
rules. This accelerates the algorithms to mine the association
rules based on utility i.e. proficiency of the mined rules. However,
fewer algorithms exist in the literature to deal with the utility
as most of them deals with reducing the complexity in frequent
itemset/association rules mining algorithm. Also, those few
algorithms consider only the overall utility of the association
rules and not the consistency of the rules throughout a defined
number of periods. To solve this issue, in this paper, an enhanced
association rule mining algorithm is proposed. The algorithm
introduces new weightage validation in the conventional
association rule mining algorithms to validate the utility and
its consistency in the mined association rules. The utility is
validated by the integrated calculation of the cost/price efficiency
of the itemsets and its frequency. The consistency validation
is performed at every defined number of windows using the
probability distribution function, assuming that the weights are
normally distributed. Hence, validated and the obtained rules
are frequent and utility efficient and their interestingness are
distributed throughout the entire time period. The algorithm is
implemented and the resultant rules are compared against the
rules that can be obtained from conventional mining algorithms
This document summarizes a research paper that proposes a new algorithm called ESW-FI to efficiently mine frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. It guarantees output quality and bounds memory usage. The algorithm divides the sliding window into fixed-size segments and processes window slides by inserting new segments and removing old ones, avoiding reprocessing of all transactions on each slide.
This document summarizes an algorithm called ESW-FI that efficiently mines frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. This is an improvement over existing algorithms that require multiple scans or maintaining all transaction data within the window. The ESW-FI algorithm guarantees output quality and bounds memory usage while processing streams of continuous, unpredictable data in a timely manner.
Mining Maximum Frequent Item Sets Over Data Streams Using Transaction Sliding...ijitcs
As we know that the online mining of streaming data is one of the most important issues in data mining. In
this paper, we proposed an efficient one- .frequent item sets over a transaction-sensitive sliding window),
to mine the set of all frequent item sets in data streams with a transaction-sensitive sliding window. An
effective bit-sequence representation of items is used in the proposed algorithm to reduce the time and
memory needed to slide the windows. The experiments show that the proposed algorithm not only attain
highly accurate mining results, but also the performance significant faster and consume less memory than
existing algorithms for mining frequent item sets over recent data streams. In this paper our theoretical
analysis and experimental studies show that the proposed algorithm is efficient and scalable and perform
better for mining the set of all maximum frequent item sets over the entire history of the data streams.
Frequent pattern mining techniques helpful to find interesting trends or patterns in
massive data. Prior domain knowledge leads to decide appropriate minimum support threshold. This
review article show different frequent pattern mining techniques based on apriori or FP-tree or user
define techniques under different computing environments like parallel, distributed or available data
mining tools, those helpful to determine interesting frequent patterns/itemsets with or without prior
domain knowledge. Proposed review article helps to develop efficient and scalable frequent pattern
mining techniques.
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET Journal
This document summarizes research on techniques for handling concept drift in data stream mining. It begins with an introduction to the challenges of concept drift in data streams and the two main approaches for handling concept drift using ensembles: online and block-based. It then reviews several existing studies on concept drift detection and handling in data streams. Finally, it proposes an adaptive online ensemble approach that uses an internal change detector to dynamically determine block sizes and capture concept drifts in a timely manner. Experimental results show this approach outperforms other ensemble techniques, especially on datasets with sudden concept changes.
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET Journal
This document summarizes several techniques for handling concept drift in data stream mining. It discusses how ensemble methods are commonly used to deal with concept drift and categorizes ensemble approaches into online and block-based. It also reviews several existing studies on handling concept drift, including methods that use adaptive windowing and online learning as well as techniques for detecting concept drift and efficiently updating models. The document concludes by discussing the need for approaches that can adapt to different types of concept drift and changes in non-stationary data streams.
Mining Stream Data using k-Means clustering AlgorithmManishankar Medi
This document discusses using k-means clustering to analyze urban road traffic stream data. Stream data arrives continuously over time and is challenging to process due to its high volume, velocity and volatility. The document proposes using a sliding window technique with k-means clustering to analyze recent urban traffic data and visualize clusters in real-time to provide insights into traffic patterns and congested roads. This analysis could help travelers and authorities respond to traffic issues more quickly.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
Mining high utility itemsets in data streams based on the weighted sliding wi...IJDKP
This document proposes an algorithm for mining high utility itemsets from data streams based on the weighted sliding window model. The weighted sliding window model allows the user to specify the number of windows, window size, and weight for each window. The algorithm, called HUI_W, makes a single pass over the data stream and takes advantage of reusing stored information to efficiently discover high utility itemsets. It first identifies high transaction-weighted utilization (HTWU) 1-itemsets, then generates candidate itemsets and determines actual high utility itemsets based on a transaction-weighted downward closure property to prune the search space.
This document provides an overview of stream data mining techniques. It discusses how traditional data mining cannot be directly applied to data streams due to their continuous, rapid nature. The document outlines some essential methodologies for analyzing data streams, including sampling, load shedding, sketching, and data summarization techniques like reservoirs, histograms, and wavelets. It also discusses challenges in applying these techniques to data streams and open problems in the emerging field of stream data mining.
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
Basic idea is that the search tree could be divided into sub process of equivalence
classes. And since generating item sets in sub process of equivalence classes is independent from
each other, we could do frequent item set mining in sub trees of equivalence classes in parallel. So
the straightforward approach to parallelize Éclat is to consider each equivalence class as a data
(agriculture). We can distribute data to different nodes and nodes could work on data without any
synchronization. Even though the sorting helps to produce different sets in smaller sizes, there is a
cost for sorting. Our Research to analysis is that the size of equivalence class is relatively small
(always less than the size of the item base) and this size also reduces quickly as the search goes
deeper in the recursion process. Base on time using more than using agriculture data we can handle
large amount of data so first we develop éclat algorithm then develop parallel éclat algorithm then
compare with using same data with respect time .with the help of support and confidence.
A Survey Report on High Utility Itemset Mining for Frequent Pattern MiningIJSRD
Data Mining is the Process to extract knowledge from the data or data stream, Mining high utility itemsets focus on the itemset with high profit only, as data or itemset may be static or in a stream mine this type if data has become a significant research topic. In this research paper we have presented different methods available for mining high utility dataset , to develop algorithm for this requires logic to mine the high utility itemsets from data streams, We have compared following algorithms, One pass Algorithm, Two Phase Algorithm, Mining top k-Utility Frequent itemset, and Sliding window based algorithms like MHUI-BIT(Mining High Utility Itemset Based on BITVector), MHUI-TID(Mining HighUtility Itemset based on TID-List) for lexicographical tree based summary data structure. And the FHM algorithm which we will used for our proposed work “Fast High Utility Miner) Using estimated utility Co-occurrence Pruning.
An Efficient Compressed Data Structure Based Method for Frequent Item Set Miningijsrd.com
Frequent pattern mining is very important for business organizations. The major applications of frequent pattern mining include disease prediction and analysis, rain forecasting, profit maximization, etc. In this paper, we are presenting a new method for mining frequent patterns. Our method is based on a new compact data structure. This data structure will help in reducing the execution time.
PATTERN DISCOVERY FOR MULTIPLE DATA SOURCES BASED ON ITEM RANKIJDKP
Retail company’s data may be geographically spread in different locations due to huge amount of data and
rapid growth in transactions. But for decision making, knowledge workers need integrated data of all sites.
Therefore the main challenge is to get generalized patterns or knowledge from the transactional data
which is spread at various locations. Transporting data from those locations to server site increases the
cost of transportation of data and at the same time finding patterns from huge data on the server increases
the time and space complexity. Thus multi-database mining plays a vital role to extract knowledge from
different data sources. Thus the technique proposed finds the patterns on various sites and instead of
transporting the data, only the patterns from various locations get transported to the server to find final
deliverable pattern. The technique uses the ranking algorithm to rank the items based on their profit, date
of expiry and stock available at each location. Then association rule mining (ARM) is used to extract
patterns based on ranking of items. Finally all the patterns discovered from various locations are merged
using pattern merger algorithm. Proposed algorithm is implemented and experimental results are taken
for both classical association rule mining on integrated data and for datasets at various sources. Finally
all patterns are combined to discover actionable patterns using pattern merger algorithm given in section
This document discusses an approach for mining frequent itemsets from data streams using the Chernoff bound and sliding window model. The proposed CB-based method approximates itemset counts from summary information without rescanning the stream, making it adaptive to streams with different distributions. Experiments showed the method performs better in optimizing memory usage and mining recent patterns in less time with accurate results. The document reviews related work on frequent itemset mining from data streams and motivates the need for an efficient model to handle time-sensitive items in uncertain streams.
Survey of the Euro Currency Fluctuation by Using Data Miningijcsit
Data mining or Knowledge Discovery in Databases (KDD) is a new field in information technology that emerged because of progress in creation and maintenance of large databases by combining statistical and artificial intelligence methods with database management. Data mining is used to recognize hidden patterns and provide relevant information for decision making on complex problems where conventional methods are inecient or too slow. Data mining can be used as a powerful tool to predict future trends and behaviors, and this prediction allows making proactive, knowledge-driven decisions in businesses. Since the automated prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools, it can answer the business questions which are traditionally time consuming to resolve. Based on this great advantage, it provides more interest for the government, industry and commerce. In this paper we have used this tool to investigate the Euro currency fluctuation.For this investigation, we have three different algorithms: K*, IBK and MLP and we have extracted.Euro currency volatility by using the same criteria for all used algorithms. The used dataset has
21,084 records and is collected from daily price fluctuations in the Euro currency in the period
of10/2006 to 04/2010.
Similar to Mining frequent itemsets (mfi) over (20)
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
DOI : 10.5121/ijdkp.2014.4402 17
MINING FREQUENT ITEMSETS (MFI) OVER
DATA STREAMS : VARIABLE WINDOW SIZE
(VWS) BY CONTEXT VARIATION ANALYSIS
(CVA) OF THE STREAMING TRANSACTIONS
V.Sidda Reddy1
, T.V.Rao2
and A.Govardhan3
1
Department of Computer and Engineering,
Sagar Institute of Technology, Chevella, Hyderabad, T.S, India
2
Departments of Computer and Engineering,
PVP Siddhartha Institute of Technology, Vijayawada, A.P, India
3
School of Information Technology,
Jawaharlal Nehru Technological University, Hyderabad, T.S, India
ABSTRACT
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWS-
CVA). There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
KEYWORDS
Data Mining, Data Streams, Frequent Itemset, Frequent Itemset Mining, Data Stream Mining, Variable
Window, Sliding Window
1. INTRODUCTION
Association rule is an important and well researched method for finding frequent itemsets
(patterns) among set of objects in large database [1]. Frequent itemsets indicates the closeness
among the items that are available in datasets. There are very good range of applications that
includes creation of association rules for market-basket analysis, in text mining grading and
clustering of documents, text, and pages, web mining, that allows users to disclose hidden
patterns and relationships in huge datasets. The time taken to produce the data is outstripping the
speed of its mining in the present upcoming applications where the information is in the form of
an enormous and continuous stream [2]. To traditional static databases this is in quiet opposite, so
data stream mining is considerably different from traditional data mining with respect good
number of aspects. Primarily, the amount of data embedded in its lifetime in a data stream could
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
18
be overwhelmingly high [3]. Along with the above it is required to create spontaneous responses
by maintaining response time to queries on such information streams because of rigid pitfalls of
resource [4]. The data stream mining is the area where very high level of research is taking place
because of the above mentioned reasons and contemporary research area is the challenge of
getting timely and appropriate association rules. Migrating from conventional data mining
methods to the emerging high efficient methods those will accommodate to function on an open-
ended, high speed stream of data [5]. The challenges that are inevitable to all mining technologies
because of the unique and inherent qualities of a data stream are as follows [3]. Primarily the
conventional technology is not applicable as it is required to create model in this method database
need to be scanned multiple number of times which is not possible because of the continuous
quality of the stream data. The high priority was given to the scalability of the frequent itemsets
mining in approaches like association rules, classification and clustering. This requires high level
of data transformation from the representation to other as the algorithm is designed in such a way,
this in turn will use resources extensively resulting in high CPU overhead. Our proposal in this
paper is a model known as Mining Frequent itemsets using Variable Window Size fixed by
Context Variation Analysis (MFI-VWS-CVA). This model is the resource effective and
measurable, as it functions with limited memory requirements and limited computational
expenses. It portrays the trade-offs between computation, data representation, I/O and heuristics.
Context variation analysis based dynamic window based transaction storage is being used in the
proposed algorithm and also allows TIFIM [15] to regular itemsets from the concluded window.
2. RELATED WORK
CPS-tree a prefix-tree structure was proposed by Syed Khairuzzaman Tanbeer [6]. Dynamic tree
restructuring technique was used in the CPS tree in order to manage the stream data. The basic
pitfall of this model is, it reconstructs the tree for every new arrival of the item. This used to result
in high memory usage and a time consuming process. Weighted Sliding Window (WSW)
algorithm was initiated by Pauray S.M. Tsai [7]. In this method weight of each transaction in each
window will be calculated by the algorithm proposed. Here also same memory space and time are
the major concerns; this method failed using these two economically. An apriori algorithm is
being for candidate generation. Hue-Fu Li [8] introduced an effective bit-sequence dependent
algorithm named as MFI-TransSW (Mining Frequent Item sets with in a Transaction Sensitive
Sliding window). There as three phases in MFI algorithm. As there is any increase in the window
size, there will be a subsequent increase in the memory usage of MFI-TransSW. Similarly
whenever there is an increase in window size, the time consumption in phase 1 and phase 2 of
MFI-TransSW to process will also increase. Yo unghee Kim [9, 16] initiated an effective
algorithm with normalized weight over data stream called WSFI mine (Weighted Support
Frequent Item sets mining). From the database in one scan this WSFI-mine algorithm can mine
all frequent item sets. HUPMS (High Utility Pattern Mining in Stream data) was recommended
by Chowdhury Farhan Ahmed [10]. This is a different algorithm for sliding window based high
utility pattern mining over data stream. For interactive mining only this algorithm is suitable. In
the paper that was presented by Jing Guo [11], they have discussed about how to mine regular
patterns across multiple data streams. Here for analysis they have considered real time news paper
data. In multiple streams it is vital to identify collaborative frequent patterns and comparative
frequent patterns. Prevention of misuse of sensitive data in a stream was addressed by Anushree
Gowtham Ringe [12] in their work. They initiate a new technique for guarding the privacy of data
stream. Fan Guidan [13] proposed a model which is conceptually similar to our model. Matrix in
sliding window over data streams plays major role in this model. In order to store, this algorithms
uses two 0-1 matrices and 2-itemsets, further applies relative operation of the two matrices to
extract frequent itemsets.
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
19
3. MINING FREQUENT ITEMSETS OVER DATA STREAMS WITH CONTEXT
AWARE VARIABLE SIZE WINDOWS
If streaming data is input to mining strategies such as frequent itemsets mining, the traditional
approaches are not suitable, since those are mainly works by the multiple passes through entire
dataset. Henceforth the mining strategies opted for streaming data considers the tuples of the
streaming transactions as windows and these windows are used as input to the mining algorithms.
The significant issue here in this model is fixing the window size. In the case of data streams with
transitional and temporal state transactions, the transitional and temporal state identifications can
be used to fix the window size. In the other cases that are not having any transitional and temporal
state identities for streaming transactions, fixing window size is a big constraint to achieve quality
factors such as results accuracy, process scalability. In this regard here we propose a novel
context variation based dynamic window size fixing approach to mine frequent itemsets over data
streams. The proposed frequent itemsets mining strategy is centric to following qualities targeted.
• The window size should be optimal and dynamic.
• The window size should fix dynamically between minimal and maximal size given as
thresholds.
• The size of the window should be within the range of minimal and maximal size and
should fix based on the context variation observed in input transaction from the data
stream.
In regard to implementing the proposed model, the only significant constraint related to the
streaming data is that the context change of the transactions should be in an order.
The minimal memory utilization and less computational cost are two main quality metrics
expected from this proposal. The exploration of the proposed window fixing strategy is follows:
Let ds be the DataStream, and stream transactions as horizontal partitions of the transactions, let
each partition having one transaction. Let n be the total count of the attributes used to form the
transactions by ds . Let seta be the attributes set that contains attributes
1 2 1{ , ,....... , ,...... }i i na a a a a+ , which are used to form the transactions. Let
1 2 3 1 ( 1){ , , ...., , ,.... , ,......}i i i m i mt t t t t t t+ + + + be the transactions streaming in the same sequence. Let
minws be the minimum window size and maxws be the maximal window size. Let tranw be the
transaction window and ccaw be the context change analysis window. Let ( )ccas w be the size of
ccaw . The initial values to min max,ws ws and ( )ccas w will be set during the pre-processing step.
The transactions of count minws from the given data stream ds will be moved initially to tranw ,
then following transactions of count ( )ccas w will be moved to ccaw . Then context variation
analysis (CVA) process will be initialized. The exploration of the CVA process is follows:
The attributes involved to generate the transactions moved into tranw will be collected as ( )tranal w
, and attributes involved to form the transactions found in ccaw will also be collected as ( )ccaal w .
Then the similarity score of these two attribute lists ( ), ( )tran ccaal w al w will be found as follows
(Eq1), which is derived from jaccard similarity measuring approach.
( )
( ) ( )
( ) ( )trans cca
tran cca
w w
trans cca
al w al w
ss
al w al w
↔ =
I
U
… (Eq1)
If similarity score ( )tran ccaw wss ↔ is greater than the given similarity score threshold ssτ then the
transactions of ccaw will be moved to tranw (see Eq2).
tran tran ccaw w w= U … (Eq2)
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
20
If size of the tranw is greater than or equals to maxws then the tranw will be finalized and initiates
process of mining frequent itemsets from the transactions of tranw , else the further streaming
transactions of size ( )ccas w will moved to ccaw and continues CVA process.
Once the tranw is finalized and mining of frequent itemsets is initiated, then tranw and ccaw will be
cleared and continues the process explored to prepare the window tranw will be continued for
further transactions streaming from data stream ds .
The above said process continues till transactions found from data stream ds . The mining
frequent itemsets from the finalized window will be done by using TIFIM [15], which is a tree
based incremental frequent itemsets mining approach that we devised in our earlier research paper
(see sec 3.3).
3.1 Algorithmic exploration of the Fixing Variable Window Size by Context
Variation Analysis
Inputs:
• Data stream ds
• Minimal transaction window size minws
• Maximal transaction window size maxws
• Similarity score threshold ssτ
• Size of the context change analysis window ( )ccas w
1. Begin
2. For each transaction { }t t ds∀ ∈ Begin
3. If min(| | )tranw ws< tranw t←
4. Else Begin
5. ccaw t←
6. If (| | ( ))cca ccaw s w≥
7. ( , )tran ccass CVA w w←
8. if( )ss ssτ≥ Begin
9. ( )tran tran ccaw w w← U
10. If max(| | )tranw ws≥ Begin
11. Finalize window tranw
12. Initiate ( )tranTIFIM w
13. set tranw φ← // empty tranw
14. set ccaw φ← //empty ccaw
15. End of 10
16. End of 8
17. Else Begin
18. Finalize window tranw
19. Initiate ( )tranTIFIM w
20. set tranw φ← // empty tranw
21. set tran ccaw w← //move transactions of window ccaw to new window tranw
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
21
22. set ccaw φ← //empty ccaw
23. End of 17
24. End of 4
25. End of 2
26. End of 1
3.2 Algorithmic exploration of the Context Variation Analysis
1. ( , )tran ccaCVA w w Begin
2. Set tranfs φ← // initiate field set tranfs of transaction window tranw empty
3. Foreach transaction { }trant t w∀ ∈ Begin
4. tran tranfs fs t← U
5. End of 3
6. Set ccafs φ← // initiate field set ccafs of transaction window ccaw empty
7. Foreach transaction { }ccat t w∀ ∈ Begin
8. cca ccafs fs t← U
9. End of 7
10. tran cca
tran cca
fs fs
ss
fs fs
=
I
U
//measuring similarity score of tranw and ccaw
11. Return ss
12. End of 1
3.3 Tree (Bush) Based Incremental Frequent Itemsets Mining (TIFIM)
3.3.1 Finding Frequent Itemsets
The primary representation of the transactions of data stream ds is as described above. An
asynchronous parallel process runs to find frequent itemsets in incremental manner.
A bush represents itemsets with two attribute pair such that these attributes belongs to tranfs and
transactions contain that pair. The coverage to measure the frequency of the itemsets can be
considered and set in the context of window tranw size. The coverage of two attribute itemsets can
be the count of number of Childs in a bush represented by each pair of attributes.
An asynchronous parallel process called frequent itemsets finder (FIF) performs as follow:
Initially picks the bushes with coverage more than given coverage thresholdcov .
Prepare new bushes from each two bushes by union the roots and intersects the Childs, and
retains it if new bush coverage is greater or equal to cov else discards.
This continues until no new bush formed.
3.3.2 The pruning process
A bush ib said to be sub-bush to bush jb if i jb br r⊆ and ( ) ( )cov covi jb b≤ . Since sub-bush ib
represented by jb , then bush ib can be pruned from the bush-set B .
3.4 Find frequent items
At an event of time, frequent itemsets can be found as follows
The roots of the bushes with coverage more than given cov can be claimed as frequent itemsets.
A bush ‘ ib ’ coverage can be find as follows
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
22
If a bush jb found to be such that i jb b⊆ and coverage value of jb is higher than any other bush
kb such that i kb b⊆ , then the coverage of ib said to be ( ) ( )cov covj ib b+ .
3.4.1 An algorithmic representation of the caching processes
Input: At an event of time a window iw with transaction it received
For each transaction it :
Let set of attributes
1 2 3
1 2 3
1 2 3
{( , , ,......... )
( , , ,......... )
( , , ,......... ) }
i
i i
i set
a a a a
a a a a t
a a a a A
∀
∈ ∧
⊆
For each pair of attributes{( , ) ( , ) }m n m n ia a a a t∀ ∈ , if found a bush { ( , ) }i m nb a a as root∃ then
add transaction it as node to bush ib , else prepare a bush such that
{ ( , ) t }i m n ib a a as roo t as node∃ ∧
3.4.2 An algorithmic approach of FIF
The bush set B prepared by caching process is said to be input to FIF
For each bush { }i ib b B∀ ∈ perform the following:
For each bush ( : 1,2,3....... ) )i c i cb c n b B+ +∃ = ∧ ∈
Forms a bush ( ) ( ){ }i i c i i cb b B+ +∃ ∉U U by Union the roots of the ‘ ib ’and ‘ i cb+ ’ ( ( ) ( )i i cb br r +
U ) and
intersects nodes of ib and i cb+ ( ( ) ( )i i cb bts ts +
I ).
4. EMPIRICAL ANALYSIS OF THE FIXING VARIABLE WINDOW SIZE BY
CONTEXT VARIATION ANALYSIS
4.1 Dataset characteristics
Multiple sets of data streamed to perform the experiments, and the characteristics of these
streaming data are as follows:
• To achieve the sparseness in streaming transactions, the range of fields considered as
75,100, 125 and 150, the max transaction length set in the range of 12 to 18, the min
transaction range set to 5 and the total number of transactions has taken in the range of
1000 to 10000.
• To achieve the denseness in streaming transactions, the range of fields considered as 20,
30, 40 and 50, the max transaction length set in the range of 10 to 15, the min transaction
range set to 5 and the total number of transactions has taken in the range of 1000 to
10000.
4.2 Experimental results
We compare our algorithm with frequent itemsets mining model for data streams devised in [13],
which is a matrix based frequent itemsets mining (MFIM) algorithm for data streams. The
implementation of our MFI-VWS-CVA and model MFIM done by using java 7 and set of flat
files as streaming data sources. The streaming environment is emulated using java RMI and
parallel process involved in proposed MFI-VWS-CVA is achieved by using java multi threading
concept. The three parameters of each synthetic dataset are the total number of transactions, the
average length, and divergence count of items, respectively. Each transaction of a dataset is
scanned only once in our experiments to simulate the environment of data streams. In regard to
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
23
measure the computational cost and scalability, the algorithms run under divergent coverage
values in the range of 10% to 90%.
Figure 1: NFI-VWS-CVA advantage over MFIM in Memory usage.
Figure 2: MFI-VWS-CVA advantage over MFIM in terms of execution time.
Figure 1 and 2 shows the comparison of the Memory usage, execution time under divergent
coverage values range given from 10% to 40% respectively. The Figure 3 explores the scalability
of MFI-VWS-CVA over MFIM under divergent streaming data sizes respectively, In Figure 1
and 2, the horizontal axis is the coverage given and the vertical axis is the memory in unit of
mega bytes and time in unit of seconds respectively. In Figure 3, the horizontal axis is the
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
24
streaming data size given in unit of transactions and the vertical axis is the execution time in unit
of seconds and percentage of elapsed time in unit of seconds respectively. As the coverage value
decreases the average increment in memory usage for matrix based FIM and MFI-VWS-CVA
are 2.29 and 0.7 respectively (see Figure 1) and average execution time increment for matrix
based FIM and MFI-VWS-CVA are 83.2 and 27.9 respectively (see Figure 2). The results
obtained here clearly indicating that the performance of MFI-VWS-CVA is miles ahead than the
matrix based FIM. The performance of MFI-VWS-CVA is scalable as matrix based FIM is taking
average of 14.16% elapsed time under uniform increment of streaming data size with 1500
transactions.
Figure 3: MFI-VWS-CVA advantage over MFIM about Scalability under divergent streaming
data size.
5. CONCLUSION
We explored a novel approach for mining the frequent itemsets from a data stream. We have
implemented an efficient tree based incremental frequent itemsets mining model [15] in our
earlier research paper, further here we developed an approach for mining frequent itemsets using
variable window size by context variation analysis (MFI-VWS-CVA) over data streams. Due to
the factor of fixing window size dynamically by concept variation analysis, the said model is
identified as optimal and scalable. A parallel process that determines frequent itemsets from the
concept of cached bush structures, which is our earlier proposal [15], performs frequent itemsets
mining over data streams. We extended this incremental frequent itemset mining algorithm by
introducing windowing the streaming transaction with variable window size technique in regard
to achieve efficient memory usage and execution time. The experiment results confirm that the
MFI-VWS-CVA is scalable under divergent streaming data size and coverage values. In future
this model can be extended to perform utility based frequent itemset mining over data streams.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.4, July 2014
25
ACKNOWLEDGEMENT
The authors would like to thank the anonymous reviewers for their valuable comments.
We also thank the authors of all references for helping us setup the paper.
REFERENCES
[1] V.Sidda Reddy, M.Narendra and K.Helini, “Knowledge Discovery from Static Datasets to Evolving
Data Streams and Challenges”, International Journal of Computer Applications, Volume 87, No.15,
pp 22-25, 2014.
[2] Moses Charikar, Kevin Chen and Martin Farach-Colton, “Finding frequent items in data streams”,
Journal of Theoretical Computer Science – Special issue on automata, languages and programming,
pp 3-15, 2004.
[3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy, “Mining data streams: a
review”, ACM SIGMOD Record, Volume 34, Issue 2, pp 18-26, 2005.
[4] Nan Jiang and Le Gruenwald., “CFI-stream: Mining closed frequent itemsets in data streams”,
Proceedings of the 12th ACM SIGKDD international conference on KDD, USA, 2006.
[5] Gurmeet Singh Manku and Rajeev Motvani, “Approximate frequency counts over data streams”,
Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China,
2002.
[6] Syed Khairuzzaman Tabeer, Chowdary Farha Ahmed, Byeong-Soo Jeong, and Young Koo Lee,
”Efficient frequent pattern mining over data streams”, In proceeding of: Proceedings of the 17th ACM
Conference on Information and Knowledge Management, pp 22-30, 2008.
[7] Pauray S.M.Tsai, “Mining frequent item sets in data streams using the weighted sliding window
model”, Elsevier publication, 2009.
[8] Hue-Fu Li and Suh-Li, “Mining frequent item sets over data streams using efficient window sliding
technique”, Elsevier publication, 2009.
[9] Yo unghee Kim, Won Young Kim and Ungmo Kim, “Mining frequent item sets with normalized
weight in continuous data streams”, Journal of information processing systems, 2010.
[10] Chowdary Farha Ahmed and Byeong Soo Jeong, “Efficient mining of high utility patterns over data
streams with a sliding window model”, Springerlink.com, 2011.
[11] Jing Guo, Peng Zhang, Jianlong Tan and Li Guo, “Mining frequent patterns across multiple data
streams”, Proceedings of the 20th ACM international conference on Information and knowledge
management, pp 2325-2328, 2011.
[12] Anushree Gowtham Ringe, Deeksha Sood and Turga Toshniwa,”Compression and privacy
preservation of data streams using moments”, Information journal of machine learning and
computing, 2011.
[13] Fan Guidan and Yin Shaohong, "A Frequent Itemsets Mining Algorithm Based on Matrix in Sliding
Window over Data Streams", Intelligent System Design and Engineering Applications (ISDEA),
Third International Conference, pp 66-69, 2013.
[14] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer and R. Srikant, “The Quest data mining
system”, Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and
Data Mining, 1996.
[15] V. Sidda Reddy, Dr T.V.Rao and Dr A.Govardhan, "TIFIM: Tree based Incremental Frequent Itemset
Mining over Streaming Data", International Journal Of Computers & Technology Vol 10, No 5,
Council for Innovative Research, www.cirworld.com, 2013.
[16] www.borgelt.net/slides/fpm.pdf.